6

Here are some requirements for a queue:

  • Every few days add ~100k tasks with various priorities
  • Workers will pull tasks at typically less than 10 / second
  • Tasks need to be completed by ~2 unique workers (for error checking), and then depending on outcome potentially by an additional worker
  • Persistent storage of tasks

Since the task processing rate is quite modest is it worth adding a dedicated message queueing system to my stack, or reusing the database (MongoDB)?

hoju
  • 163
  • 5

3 Answers3

3

What impresses me most here isn't the slow throughput. It's ~100k tasks every few days. Sounds like batch processing to me. The sort of thing you'd want to be persistent so you could shut down to install an update every once in awhile. That doesn't necessarily mean DB but it's certainly not something you want to only have in volatile memory.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • That is a good point - added this to question. However message queues don't have to just be stored in memory: https://www.rabbitmq.com/persistence-conf.html – hoju Jul 29 '16 at 06:24
  • @CandiedOrange So instead of persist at the database and parallel process it with queues and have Redelivery Policy you think that a batch process is better? Why? – deFreitas Jun 14 '17 at 23:08
  • @deFreitas because I haven't seen a requirement yet that convinces me this couldn't be done with a file system and a simple script. Sure it can also be done with a database and a message queue. It could be done on a data center running a Hadoop cluster. If the product owner doesn't care about speed then all I care about is not losing work if it gets interrupted. – candied_orange Jun 15 '17 at 00:29
  • @CandiedOrange Understood, I agree that not lose work is the main point, mainly when my process is slow – deFreitas Jun 15 '17 at 00:42
3

It's hard to say without a complete understanding of what you are trying to accomplish but based on what you are saying, I think a database makes more sense. You might want to use a database and a queuing system together.

The reason is that in this kind of situation you generally want to have some sort of audit-balance-control capability and queues won't provide this. You are going to want to know who did what when right? Where are you going to track that? Most likely it's in a DB. Also you need to worry about things like a worker selecting a task and then going home sick. You don't want to have a bunch of long-term uncommitted reads on the queues.

The important thing to understand about queues is that reads are destructive. This makes them a poor solution (by themselves) in a lot of common situations. In a nutshell, once someone has committed the read, it's difficult to impossible to know that the message was ever there. If a message gets read and committed by a consumer but that consumer fails to process it successfully, things can get lost. A lot of queuing systems have "guaranteed delivery" which doesn't mean as much as people think it does. It simply means it got to it's destination (or was reported as undeliverable) but it doesn't guarantee that the consumer did anything with it. "So what, reading from a database doesn't change that" you might say. The difference is that when you read the message from the DB, it doesn't disappear. Often you won't even want to delete it from the DB but rather track the status. This means you don't lose messages due to flaws in the consumers.

I think the hybrid solution works well here. Use queues as a distribution mechanism. They are great for handling lots of readers pulling from the same set of things and making sure only one reader gets each thing. You can do that in DBs but it's kind of ugly. Then once a reader has claimed and item, it's can update the DB with that info with very little contention. You can send the same message on the queue as needed for your 2-worker requirment (you'll need a mechanism to avoid the same worker handling both requests) and also in the case a worker pulls a message but never does the task.

JimmyJames
  • 24,682
  • 2
  • 50
  • 92
1

I've been using Apache Camel with ActiveMQ lately, and can say that a message queue is very easy to work with. Speed and performance has not been a problem at all for me.

ActiveMQ, along with pretty much every other large scale message broker supports persistence, in ActiveMQ it is as simple as supplying a file-path to a database in the config. If troughput is a problem I have heard that Apache Kafka is a good candidate.

Kafka has a theoretical higher throughput than ActiveMQ due to ActiveMQ fluffing each message with 144 bytes, and Kafka only 9(!) bytes.

In my experience, probably due to me not being a database wizard, I would say that it easier to create better results with a message queue.

See this article: The Database As Queue Anti-Pattern

haraldfw
  • 135
  • 9
  • Does ActiveMQ support multiple unique workers doing the same tasks? I haven't found a straightforward way to do this with the message queues I looked at so far. – hoju Jul 29 '16 at 08:10
  • 1
    @hoju What you are looking for is called a topic. Every message broker I have checked out supports this functionality. http://activemq.apache.org/how-does-a-queue-compare-to-a-topic.html – haraldfw Jul 29 '16 at 09:46
  • as far as I can tell topics don't provide the fine grained control I am after. Specifically I want the first 2 unique workers that pull this task to execute it, and then depending on the outcome may want some additional unique workers to execute it. – hoju Jul 29 '16 at 17:08