BloomReach benefits enormously from open source software throughout our data processing and serving systems. Our backend data processing and analytics systems use Hadoop, Cassandra and a myriad of libraries from the Apache and Python projects and other communities — and of course Linux. While the bulk of our code is tightly linked to our data and systems, such as our Web Relevance Engine pipeline and our machine learning systems, we’ve also built some standalone pieces of infrastructure that we’d like to give back to the community. As a first step in this direction, we’ve released into open source a couple tools that we’ve developed and found quite useful.
The first tool, Zinc, is a simple but highly scalable versioned data store for files. It operates much like a revision control system for source code (like Git or Subversion), but with an emphasis on scalability and simplicity in managing large or numerous data or configuration files.
The second tool is s4cmd. S4cmd is a command-line utility for accessing Amazon S3, inspired by the highly useful s3cmd. We’ve used s3cmd heavily for a variety of scripts used in data-intensive applications. However as the need for a variety of small improvements arose, we created our own implementation, and gave it the catchy name s4cmd. It is intended as an alternative to s3cmd for enhanced performance and for large files, and with a number of additional features and fixes that we have found useful.
We’ll write a bit more about each tool soon, but in the meantime, check out the docs and code on GitHub and let us know what you think!