This post is by Prateek Gupta, a lead engineer at BloomReach. It is also cross-posted on the AWS Big Data Blog.
BloomReach has built a personalized discovery platform with applications for organic search, site search, content marketing and merchandizing. BloomReach ingests data from a variety of sources such as merchant inventory feed, sitefetch data from merchants’ websites and pixel data. The data is collected, parsed, stored and used to match user intent to content on merchants’ websites and to provide merchants with insights into consumer behavior and the performance of products on their sites.
A sample data ingestion flow for merchant data is shown in the figure below. BloomReach ingests merchant data including crawled merchant pages, merchant feed, and pixel data. There are ETL (extract-transform-load) flows that clean, filter and normalize the data and put it into the product database. Individual applications may use this data to produce derived relations. The product database also supports many applications including the “What’s Hot” application that displays relevant trending products to the user on merchant website.
Below is a sample workflow for personalization:
At BloomReach, we launch 1,500 to 2,000 Amazon EMR clusters and run 6,000 Hadoop jobs every day. As a growing company, we’ve seen our use of Amazon EMR rise dramatically in a short time:
It is critical that we keep our Amazon EMR costs down as we scale up. To that end, we’ve adopted the following strategies
- Use AWS Spot Instances rather than On-Demand Instances whenever possible. Amazon Elastic Cloud Compute (Amazon EC2) Spot Instances are unused Amazon EC2 capacity that you bid on; the price you pay is determined by the supply and demand for Spot Instances. The cost of using Spot Instances can be 80% less than using On-Demand Instances. It’s important to manage Spot Instances because they can be terminated if the Spot market price exceeds your bid price. At BloomReach, we have written an orchestration system that schedules jobs on Amazon EMR. The system implements a Hartmann pipeline that can run a variety of jobs both locally and on Amazon EMR. It can also detect failures such as Spot Instance termination and reschedule jobs on different clusters as needed.
- Create a system that shares clusters among several small jobs rather than launching a separate cluster for every job. Remember, whether your job takes 10 minutes or 60 minutes, you’re paying for an hour of access. If you have four 10-minute jobs, you could share one cluster to do them all and be charged for one hour. Or you could employ one cluster for each and be charged for four hours. Sharing clusters among jobs also allows you to save the time and cost of bootstrapping a new cluster. The time savings alone can be a significant factor for real-time jobs.
- Use Amazon EMR tags for cost tracking. Using EMR tags lets you track the cost of your cloud usage by project or by department, which gives you deeper insight into return on investment and provides transparency for budgeting purposes.
- Create a lifecycle management system that allows you to track clusters and eliminate idle clusters.
- Use the right instance types for your jobs. For example, use c3 instance type for compute-heavy jobs. This can significantly reduce waste and costs based on the scale of your jobs. Below is an algorithm we have found useful for selecting the instance type with the best value for compute capacity based on its Spot price:
Incorporating these Amazon EMR strategies can help you increase efficiency, contain costs, and make a good thing even better.
If you are interested in working on these challenges at BloomReach, contact us at www.bloomreach.com/careers