Storing Analytics Data on AWS
Recently The Wharton Customer Analytics Initiative began looking into ways to store research projects online. I did some research for them to determine the options for using AWS versus in-house solutions. In the end we decided to use EBS backed instances on AWS because of the reasonable price and flexibility that it offers. Here’s a little information on how we arrived at our decision.
WCAI (The Wharton Customer Analytics Initiative) is “The world’s preeminent academic research center focusing on the development and application of customer analytics methods.” Every few months WCAI receives a dataset from a corporate partner (ranging from ticket sales on StubHub to email marking data from Charming Shoppes). After receiving the data, WCAI does some queries and a basic analysis of the data before sending the data to top researchers across the country. Throughout the research process WCAI supports the researchers by providing access to the data as well as performing any additional queries they may need.
Currently there is no standardized way to store the data for the life of the project. The raw files are submitted and stored on box.com, but for analysis they must be loaded into a database. Some projects are loaded into a MySQL or Postgres database on Wimi-Cruncher (the in-house server), others are stored locally on researchers’ computers in a database program like Microsoft Access. Our goal is to create a uniform long-term storage solution for new projects.
The wimi-crucher server is outdated, so in the process of looking at ways to replaces the server I was asked to estimate the cost of running future projects on Amazon Web Services. Here’s a quick table of the per project estimated costs on Amazon.
|Service||Use||Price per Unit||Unit Usage||Estimated Monthly Cost|
|S3||Storing raw files||$0.125 per GB month||40 GB||$5|
|EC2 Machine (High-Memory Double Extra Large Instances)||Perform SQL queries||$0.9 per Hour||20 Hours per week||$72|
|EC2 Storage||Storing computer hard drive||$0.10 per GB Month||40 GB||$4|
|Data out||Downloading Data||$0.125 per GB||100 GB||$12.50|
AWS also offers great additional features such as image backup, the ability to change instance sizes as needed, and Glacier, a cheap long-term data storage solution.
Notice that we can also only pay for the server hours we need. Since server usage is charged by the hour, we only have to pay for the hours we have an instance up to perform queries.
Compared to the cost of purchasing a server or buying a virtual machine though Penn, this was by far the cheapest and most flexible choice, so we made the decision to set it up for the next project.
While AWS has done a fantastic job at providing many simple and intuitive ways to control instances, we want to be able to provide a straightforward method for a researcher to spin up a project instance and perform queries without having to understand too much of the underlying infrastructure. I have been assigned the task of creating an app to perform this functionality, and will describe the implementation choices I made in future posts.