6

I need to write a server application that fetches mails from different mail servers/mailboxes and then needs to process/analyze these mails. Traditionally, I would do this multi-threaded, launching a thread for fetching mails (or maybe one per mailbox) and then process the mails.

We are moving more and more to servers where we have 8+ cores, so I would like to make use of these cores as much as possible (and not use 1 at 100% and leave the seven others untouched). So conceptually, as an example, it would be nice that I could write the application in such a way that two cores are "continuously" fetching emails and four cores are "continuously" processing/analyzing the emails (since processing and analyzing mails is more CPU intensive than fetching mails).

This seems like a good concept, but after studying some parallel patterns, I'm not really sure how this is best implemented. None of the patterns really fit. I'm working in VS2012, native C++, but I guess from a design point of view this does not really matter and just some pointers on how to organize this would be great!

Den
  • 4,827
  • 2
  • 32
  • 48
  • 3
    I'd just like to make the point that a single process NodeJS instance on a single core often beats threaded servers not doing IO correctly. Most of the time in these applications tend to be wasted **waiting for I/O**. How much of your use case is the actual processing/analyzing and how time is spent waiting for emails? If you're not sure I think that profiling that specific aspect will really help you in making a good design decision. – Benjamin Gruenbaum Jul 11 '13 at 22:38
  • @Benjamin: that was indeed also my worry. Therefore I started first with writing an email generator script and different models for fetching the mails, to see whether there is actually a point in going multiprocessor. Thx for the input – Wim Van Houts Jul 13 '13 at 10:11

1 Answers1

3

The actor model of concurrency seems like it might be a good fit for this.

The Model

In case you're not familiar with this model, it goes as follows:

Actors are threads that run in a loop. Each actor has a producer-consumer message queue; external code and other actors communicate with an actor by sending a message to it (queuing it in its message queue).

An actor's thread will block waiting for a message in its message queue, and when one appears the actor will deal with it, then loop back to process or wait for the next message. Repeat.

Note: "Actors" are sometimes called "Agents", but that term is misapplied. See the comment thread below for more.

Architecture

You could create actors specifically for downloading messages (say one per mailserver/mailbox) and other actors for processing the e-mails once they've been downloaded.

Connecting the two you could have a single routing actor that would receive references to the downloaded mail files from the fetching actors and either send each reference to an available processing actor or spin up another one to process it if all the other processing actors were busy. When a processing actor was finished processing, it would send a message to the routing actor saying it was done so the routing actor would know it could send another message to it to process.

I'm betting by this point there's a library for actors for C++ [UPDATE: See comment by @rwong below]. If all else fails you could try Erlang ;)

I'm not sure how the C++ threading libraries work -- whether they map threads to a single core or multiple cores -- but if this doesn't do it for you you could take the same concept and instead of using threads have them be discrete processes and use some sort of message passing framework for the communication.


Edit: I'm betting you'll have a bottleneck at the network, though, so it might not even make sense to want to occupy all the cores at once (unless the processing takes a loooong time).

Edit: Expanded answer and corrected terminology (Agent -> Actor)

paul
  • 2,074
  • 13
  • 16
  • Do you mean by bottleneck at the network that a gigabit connection would not be fast enough to download sufficient emails to keep a core busy? Processing would take reasonable time (several regex to execute per mail, on mails having possibly large body). – Wim Van Houts Jul 11 '13 at 20:54
  • Yes, that was what I meant about the bottleneck. My thought was that if you had umpteen e-mails all trying to come in through a connection at once (with multiple download agents running) that it'd slow things down and that the processor agents wouldn't have enough to do. That's just my gut instinct though, I don't have any hard evidence :) – paul Jul 11 '13 at 20:58
  • @paul Agents are _autonomous_ beings in their computational environment showing _goal driven_ behavior. This answer suggests no goal driven behavior, no deliberation and no autonomy . While these might be considered purely reactive agents (much like a light switch could) I don't think it's the right term to use in this case. I think producer/consumer is more what you had in mind. Multiagent systems excel at _huge_ scaling problems where coordination is needed and conflict resolution and common knowledge are issues - an 8 core server is not such a system. – Benjamin Gruenbaum Jul 11 '13 at 22:35
  • 1
    Microsoft [Parallel Patterns Library (PPL)](http://msdn.microsoft.com/en-us/library/dd492418.aspx) provides an implementation of [asynchronous agents](http://msdn.microsoft.com/en-us/library/dd551463.aspx) also known as actor mode. [This online book by Campbell and Miller](http://msdn.microsoft.com/en-us/library/gg675934.aspx) provides a step-by-step guide to write high-performance concurrent software. – rwong Jul 12 '13 at 02:59
  • 2
    @BenjaminGruenbaum "Actor model" is the correct term. However, for historical reasons there are multiple frameworks that used the term "Agents" for what is actor model, therefore when we discuss practical programming issues the reader has to understand that these terms *were* used interchangeably. (Sort of a "do as I say, not as I do" situation.) – rwong Jul 12 '13 at 03:02
  • @BenjaminGruenbaum Thanks for the correction. I've seen the Actor model referred to as both the Actor model (in Erlang) and Agents (in F#, where the FSharpx library has an Actor type named "Agent") which is why I used the term interchangeably. I'll be sure to use the correct term henceforth (and I believe I have some documentation at work to correct...) – paul Jul 12 '13 at 12:19
  • @BenjaminGruenbaum Also, yes, I was looking for the term producer/consumer while writing my answer but it kept eluding me! – paul Jul 12 '13 at 12:22
  • @rwong Thanks for the support and the links! Those look like they'll be handy. – paul Jul 12 '13 at 12:23
  • @rwong I agree that it's mainly terminology, and both actors and agents are beautiful ideas to abstract concurrency issues. I just wanted to mention `Actor` was a more correct term to what you suggested - and producer/consumer was probably even more correct due to the purely reactive nature and __lack of state__. Paul, if you're interested in agents, I think that [this book](http://www.cs.ox.ac.uk/people/michael.wooldridge/pubs/imas/IMAS2e.html) is a good place to start :) It's an interesting field with some great ideas to solve large scale concurrency problems and cooperation problems. – Benjamin Gruenbaum Jul 12 '13 at 12:24
  • There are several other names related to producer/consumer: pipe-and-filter, pipeline, channel, port, etc. However, I think OP will need another concept: incrementalism. For example, when a batch of 100000 items are received, an intermediate process ought to break it down further to maybe batches of 100 items. Not doing so will introduce serialization (or prevent concurrency from happening). – rwong Jul 12 '13 at 12:39
  • @BenjaminGruenbaum Thanks for the book reference, I'll take a look! Also, I updated my answer to use "Actor" instead of "Agent". – paul Jul 12 '13 at 14:01