Crawling Billions of Pages: Building Large Scale Crawling Cluster (part 1)

This post is by Chou-han Yang, principal engineer at BloomReach.

At BloomReach, we are constantly crawling our customers’ websites to ensure their quality and to obtain the information we need to run our marketing applications. It is fairly easy to build a prototype with a few lines of scripts and there are a bunch of open source tools available to do that, such as Apache Nutch. We chose to build our own crawling cluster at BloomReach after evaluating several open source options. Our requirements were:

  • Handle more than billions of pages (in about a week).
  • Use Hadoop (with Amazon EMR) to parse those pages efficiently.
  • Have constant QPS (query per second) for each website.
  • Have multiple task queues per each website with different priorities.
  • Handle long latency for slow websites.

A very typical architecture for crawling clusters includes three main components:

  1. A fetcher that send HTTP requests and reads content.
  2. A centralized queuing system for job management and distribution.
  3. A backend pipeline to parse and post-process pages.

CrawlingInfrastructure

For part one of this series, I would like to focus on the fetcher that we use to crawl our customers’ pages. I’ll cover other components separately in future posts.

First attempt: single process loop

To kick-start our discussion, we will just use simple code snippets to demonstrate a very simple fetcher:

This is a very straightforward implementation of the fetcher with several potential scaling issues:

  • It uses single thread and only one thread per process. In order to concurrently fetch from more than one website, the fetcher needs multiple processes.
  • If a single page takes a long time to process, or even worse, the server times out without any response, the whole process will be stuck.

As straightforward as it is, this approach won’t go very far before some operational headaches set in. So naturally, a better approach would be to use multiple threads in a single process. Unfortunately, with this system, the memory overhead for each process will quickly consume all your memory space.

Second attempt: multithreaded HTTP client

This process seems more modern and it removes the requirements to run more than one process on a single machine. But the shortcoming that a single page can stop the whole loop remains. Compared to multiple process, multiple thread has better memory efficiency, but it will reach its limit when you are running at least 400 to 500 threads on a quad core machine.

Third attempt: asynchronous HTTP (Windows style)

To solve the problem of blocking threads for each website loop, people long ago developed solutions for Windows. An experienced Windows IIS programmer would be very familiar with the event-driven programming paradigm. Coming up with the same code in Java isn’t easy, but it might look something like:

Windows usually uses a single thread to process all events, but you can allow multiple threads by changing the setting of the IIS Web server. The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.

Fourth attempt: HTTP client with asynchronous I/O

The potential bottleneck has been removed by kernel 2.5.44 with the introduction to epoll system call. This allows a process to monitor a huge number of TCP connections without polling from each connection one-by-one. This also triggered the creation of series non-blocking libraries such as Java NIO.

Network libraries based on Java NIO have the benefit of easily scaling from a few thousands to tens of thousands of TCP connection per machine. The CPU no longer spends time in a waiting state or context switching between a huge number of threads. Therefore, performance and throughput both increase.

We use Netty to build our crawler, not only because it uses Java NIO, but because it also provides a good pipeline abstraction to the network stack. It is easy to insert a handler to HTTPS, compression, or time-out without compromising the code structure.

Conclusion

Based on our stress tests, each node with a quad-core CPU can go up to 600 queries per second, reaching the maximum network bandwidth, with its average HTML of size 400K bytes. With a six-node cluster, we can crawl at 3,600 QPS, which is about 311 million pages a day, or 1.2 billion pages in four days.

Next time, we will talk about how to store tasks with a very long list of URLs with efficient queuing and scheduling.

 
 
  • http://www.davidxgoliath.com Bakz T. Future

    Hey guys!

    I really enjoyed the piece – was just wondering, how much is it costing you per day, to crawl all of those pages through your aws setup? Was thinking to build a crawler soon and figure I could use the fire power!

    Thanks!

    — bakz

    • Prashant Shrivastava

      I too was wondering about the same question.

  • Anonymous

    Uhh, what about select()? Asynchronous IO has been possible on non-Windows platforms for literal decades, epoll isn’t something radically new.