3

Given a workload with many long running CPU/IO intensive operations (e.g. producing multi-GB text files from database reads and business rules), what is the best way to balance .net application load across a group of Windows Server 2012 R2 VMs?

Ideally, the approach would allow for adding/removing/patching nodes as needed. It's pretty much a 24/7 operation with little opportunity for maintenance windows.

Background

We are in the process of migrated millions of lines of COBOL code from a IBM mainframe to C#. The current workload is batch centered and there are lots of jobs that fire up at a scheduled time to perform units of work of various sizes.

Wherever possible, we'll look to make the converted processes more real-time, event-driven, etc... However, because of dependence on existing inputs/ouputs to/from many upstream and downstream partners and a desire to get from one platform to the other in the shortest time possible with minimal disruption, we're going to have constraints that prevent us going with a completely greenfield approach.

Due to the large volumes of constantly changing data involved and the interconnectedness of that data with other legacy systems, moving to a cloud-based infrastructure is probably not feasible.

Approaches we're considering

JMS queues - We have an enterprise class JMS compliant ESB and we could queue the work, having consumers on each VM that could pull work based on resource availability.

Task coordinator - We've considered creating a custom workload manager that would monitor servers and decide where to send work. Something along these lines.

Dan Pichelman
  • 13,773
  • 8
  • 42
  • 73
Paul G
  • 133
  • 1
  • 5
  • 1
    Check out [Project Orleans](https://github.com/dotnet/orleans). – Kasey Speakman Oct 16 '15 at 13:39
  • what technology to take up next is explicitly off-topic per [help/on-topic]. See http://meta.programmers.stackexchange.com/a/6486/40980 – gnat Oct 16 '15 at 13:44
  • 3
    @gnat While I agree that this is probably off-topic, I disagree that this specific rule applies here. This is a question about which technology will solve a particular problem, not a "what should I learn" question. If you interpret the question as asking which concepts to use for this scaling instead of which existing tools, it could even be on-topic here. – CodesInChaos Oct 16 '15 at 13:46
  • @gnat - Thanks. My intent was definitely to focus more on scaling patterns than any particular tool/technology, although I do have some technology constraints. If the consensus is that it would be better suited for a different site in the SE family, I'll happily delete it here and move it there. – Paul G Oct 16 '15 at 13:55
  • 1
    @PaulG I'd first try to asking about the concepts here. But perhaps that would still be considered too broad. Then after you figured out what you need, you could ask on [softwarerecs.se] about existing tools matching your requirements. – CodesInChaos Oct 16 '15 at 14:02
  • When you're all done I'd be curious to know how much slower I/O becomes. My understanding is that COBOL is wicked-fast reading & writing files. – radarbob Oct 19 '15 at 00:12

1 Answers1

1

Here is an excerpt from a related question, specifically about scaling.

Scaling across machines can be done in various ways. Likely what you will want is partitioning. The easiest way is by hashing on some request value to decide which server to send to. For instance, if your customer ID is an integer (or can be consistently reduced to one... e.g. with GetHashCode) and you have 3 machines processing tasks, you can use a simple modulus to decide which customer should go to which machine. machineNumber = customerId % 3. CustomerId 1,4,7, etc will always go to machine 1. However, if Machine 1 goes down, 1/3 of the customer requests will not get processed until it comes back up. Since these are long running imports anyway, that's likely not a big deal. The load will also not be distributed evenly, since there are usually some customers who are heavier users. Again, probably not a huge deal. Measure to make sure.

Another way that is resilient to failure is to use a distributed directory. It keeps track of which node currently owns which customer. Project Orleans uses a mechanism like this. It allows for nodes to fail and customers to be transitioned to another node. Before allocating a new customer on a node, you can also query the node to see which is the least loaded. However, I'm not aware of a pre-built component for this purpose, and building it yourself is perilous to your time.

Since you mentioned queues, you might look at a feature called Competing Consumers which is available in various queue systems. In that case, you setup multiple machines or VMs (nodes) to service the queue. Whenever a new message comes through, the first node to answer gets the message and is expected to process it.

The Issues and Considerations section of the competing consumers link has some excellent points regarding such a queue/workload system.

Kasey Speakman
  • 4,341
  • 19
  • 26