I would like to suggest an architecture that scales reasonably well, and performs faster than random sleeps.
Each domain is associated with a queue of known pages that are to be crawled, and two additional fields:
- The time between requests for this domain.
- The earliest time when the next page can be requested.
We now have a priority queue of domains. In this queue, the domains are sorted by the time for the next request. These domain objects are units of work. The worker threads take units of work from the queue, with the queue guaranteeing this will be the earliest domain where the next request shall be made. When a worker receives an unit of work, it first checks whether the specified request time lies in the future. If so, the thread sleeps until then. Otherwise/afterwards:
- The next page for that domain is requested and processed. Possibly, new pages are added to that domain to be crawled.
- If the server requests rate limiting, then the time between requests for that domain is increased.
- The unit of work is given back to the job queue, and a new job is requested.
When the job queue receives back ownership of the domain object, it first checks whether there are any further pages to be crawled. If so, then the time for the next request is calculated. This may be the time between requests added to the time of the previous request, or a random value with that time being the minimum or mean.
This architecture has the advantage that ownership is clearly defined, so only one thread is making requests to any given domain at a time. The disadvantages are the job queue thread doing a lot of work, and the overhead of communication between threads.
There is an additional point to take care of: How are pages to be crawled added to a domain that is still unknown to the system, or that is currently owned by another thread? This should be handled by the job queue in order to avoid concurrency issues.
Performance can be increased by having each worker thread request multiple pages simultaneously using asynchronous operations. This way, the time between sending an HTTP request and receiving an response can be used for working on other things.