Simplicity

This is a post by Amit Aggarwal, Head of Development at BloomReach. “Simplicity is the ultimate sophistication.” ― Leonardo da Vinci Complexity can lead to systems that are unmaintainable, hard to test and debug, and difficult to scale. Good engineering systems, on the other hand, are simple — easy to explain, easy to test and … 

 

The Evolution of Fault Tolerant Redis Cluster

This is a post by Hongche Liu and Jurgen Philippaerts from the Personalization and Ops Teams at BloomReach. At BloomReach, we use Redis, an open source advanced key-value cache and store, which is often referred to as a data structure server since values can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs. In one … 

 

Solr Compute Cloud – An Elastic Solr Infrastructure

This is a post by Nitin Sharma and Li Ding, Engineers from the Search and Data Infrastructure Team at BloomReach. Scaling a multi-tenant search platform that has high availability while maintaining low latency is a hard problem to solve.  It’s especially hard when the platform is running a heterogeneous workload on hundreds of millions of … 

 

Crawling Billions of Pages: Building Large Scale Crawling Cluster (part 2)

Previously in “Crawling Billions of Pages: Building Large Scale Crawling Cluster (Part 1),” we talked about the way to build an asynchronous fetcher to download raw HTML pages effectively. Now we have to go from a single machine to a cluster of fetchers, therefore, we need a way to synchronize all the fetcher nodes so … 

 

Crawling Billions of Pages: Building Large Scale Crawling Cluster (part 1)

This post is by Chou-han Yang, principal engineer at BloomReach. At BloomReach, we are constantly crawling our customers’ websites to ensure their quality and to obtain the information we need to run our marketing applications. It is fairly easy to build a prototype with a few lines of scripts and there are a bunch of … 

 

Mapreduce Fun: Sampling for Large Data Set

This post is by Chou-han Yang, principal engineer at BloomReach. The coolest thing about mapreduce is that we suddenly have enormous computing power and storage at disposal. To me, it’s like a kid who suddenly has a new toy and a desire to incorporate it into his favorite games. What could be more fun than … 

 

Introducing Briefly : A Python DSL to Scale Complex Mapreduce Pipelines

This post is by Chou-han Yang, principal engineer at BloomReach. Today we are excited to announce Briefly, a new open-source project designed to tackle the challenge of simultaneously handling the flow of Hadoop and non-Hadoop tasks. In short, Briefly is a Python-based, meta-programming job-flow control engine for big data processing pipelines. We called it Briefly … 

 

Strategies for Reducing Your Amazon EMR Costs

This post is by Prateek Gupta, a lead engineer at BloomReach. It is also cross-posted on the AWS Big Data Blog. BloomReach has built a personalized discovery platform with applications for organic search, site search, content marketing and merchandizing. BloomReach ingests data from a variety of sources such as merchant inventory feed, sitefetch data from merchants’ websites … 

 

Open Source at Bloomreach

BloomReach benefits enormously from open source software throughout our data processing and serving systems. Our backend data processing and analytics systems use Hadoop, Cassandra and a myriad of libraries from the Apache and Python projects and other communities — and of course Linux. While the bulk of our code is tightly linked to our data …