How to detect end of queue in a parallelized web crawler?

Question

Let's say I want to program a parallelized web crawler which has a shared FIFO queue (multi consumer/multi producer). The queue only contains URLs. How do I detect the end of the queue?

A worker process is always consumer and producer at the same time because it takes an URL from the queue, crawls it and adds any found URLs to the queue. I think there is no way to have separate processes for consumer and producer tasks in this scenario.

Since the amount of input data is unknown but not infinite it's impossible to use a 'poison pill' as sentinel in the queue, right?

Also, the queue size is not a reliable way to find out if the queue is empty (because of multiple consumers/producers).

Please enlighten me :-)

Queue size plus the number of non-idle workers adding up to zero should tell you when you're done. — Blrfl, Jan 03 '17 at 04:01

score 2 · Answer 1 · answered Jan 02 '17 at 19:38

The main issue is that you will have to deal with cycles in the URL graph to prevent infinite looping -- when you see the same URL a second time, you probably should not put it into any queue.

Given cycle detection and prevention, then the queue will eventually converge to zero nodes allowing the zero queue size test to be reliable.

You can use a sentinel, and it can be used to indicate the end of a generation.

Or you can use a new queue for each generation. Thus, the workers start reading from the current generation queue while writing to the next generation queue. When a worker realizes that their currently configured input queue is empty, they reconfigure to read from a non-empty queue while writing to the next queue after that.

The queue for each generation can be retired when workers have all transitioned to reading from another queue. (Modulo that you still need some data structure to detect cycles.)

You can also use the notion of generation count to limit the search if it goes too far; it might give more consistent results than time-based termination, or URL-count-based termination.

How to detect end of queue in a parallelized web crawler?

1 Answers1