1

I have a C# system that reads thousands of messages per minute from a Kafka topic. The messages are filtered and I need to insert the filtered messages into a MSSQL table.

A normal insert statement is a too slow and seems to hammer the database with thousands of inserts running at the same time.

(Currently using Dapper and SqlBulkCopy in the system - so I have access to those tools by default)

I've solved the issue by buffering the records and then performing a BulkCopy. That works, but it does mean that there is some latency while I buffer enough records to make the BulkCopy worthwhile. The users really want access to the record as soon as possible. We are in a high volume, near-real-time data processing environment.

Any idea what other solutions I could try?

Would creating smaller buffers and then performing an insert like the below provide significant benefit?

   Insert into TableA( ID ) Values (1), (2), (3)

Are there any standard patterns for handling this kind of situation?

Would transactions at all be useful in this scenario? Maybe multiple single inserts with a single commit for a small batch?

At this point I'm just guessing on what to try next :)

  • What is your definition of "near-real-time"? The answers here will be very different if you needs 10s of ms latency than if seconds of latency is okay. – Philip Kendall Nov 10 '21 at 11:01
  • 1
    I've seen a lot of very similar questions to this asked here and on Stack Overflow. I'm starting to think that an RDBMS is really not the right kind of tool for tasks like this. – Robert Harvey Nov 10 '21 at 12:51
  • Try to use Bulk Copy, but not with all the data at once. Instead, try smaller batches of 10, 50, or 100 records (you will have to run some experiments to find the optimal balance). – Doc Brown Nov 10 '21 at 13:17
  • @RobertHarvey It could also be that the task requirements didn't adequately specify the desired throughput, thus leading to running on underpowered hardware. Especially as data would seem to be desired to be pulled out of the DB as least as fast as it goes in (to service multiple clients). But at least you can solve hardware problems by simply throwing more money at them. However no amount of money thrown at software is going to truly fix inadequate hardware - all it can do is to trim the fat that currently exists. – Peter M Nov 10 '21 at 13:53
  • @PhilipKendall: I can stomach a latency of about 2 seconds - but there are thousands of records arriving each minute. So, if a single insert has a 2 second latency - I have a massive problem. The consumers are still just people - so a few seconds won't make a difference, but if it's more than 5 seconds late they will brand the system as a failure. – Mulciber Coder Nov 10 '21 at 14:48
  • @DocBrown The problem with BulkInsert is that it doesn't seem to scale well when you have multiple inserts happening at the same time. And, when you make batches too small, you have many sql connections being established. The process of establishing a connection is also relatively expensive... If I establish a connection for every 50 records (for example) - no amount of connection pooling will save me :) – Mulciber Coder Nov 10 '21 at 14:51
  • @PeterM Hardware isn't an option in this case. Anyway, we're not talking about a small server here. (roughly 256 GB RAM with 128+cores). The question meant to ask about programming paradigms, rather than brute force solutions. I do take your point though - there is a point where brute force is the only solution. – Mulciber Coder Nov 10 '21 at 14:54
  • 2
    @MulciberCoder With all due respect, 1000s of records a minute is small data; that's less than 200 per second. If inserting that number of records into your database is your bottleneck, the first thing I'd do would be to get a good database engineer to look at the structure of your database and make recommendations. – Philip Kendall Nov 10 '21 at 15:00
  • @PhilipKendall The database structure is a single table. Not much to optimize there. – Mulciber Coder Nov 10 '21 at 15:35
  • 2
    @MulciberCoder Indexes, partitions, constraints, etc, etc, etc... lots and lots of optimisation options, and they really do make a difference. Or maybe you want to have multiple tables. – Philip Kendall Nov 10 '21 at 15:39
  • Can't we use parallel executions here ? Or using multiple threads for concurrent bulk copy? Or those won't work? I am just curious. – T.kowshik Yedida Nov 10 '21 at 16:46
  • 2
    @MulciberCoder having only a single table could be a potential reason for latency if there is contention on disk IO, indexes, etc. Some type of sharding may help. – Dan Wilson Nov 10 '21 at 17:56
  • Using an APM trace (inverted flame graph) may show you where time is being spent. – Dan Wilson Nov 10 '21 at 17:57
  • Ask here: https://dba.stackexchange.com/ ... Or here: https://google.com ... :D – svidgen Nov 10 '21 at 18:29
  • Also consider your isolation level. The usual setup will add an implied transaction to every insert, limiting your throughput and parallelization options. – Telastyn Nov 11 '21 at 01:48
  • 1
    This is a system with continual batches - why would you ever have to _establish_ a connection to the database in the steady state? Why can't you keep them open and reuse them for one batch after another? IOW you don't have to open-insert-close-repeat, you can open-insert-insert-insert-insert-...-close when done. – davidbak Nov 11 '21 at 02:27
  • What are you using to trigger the bulk update? Number of records, or time since last bulk update? Have you profiled the system to see where it's spending its time? – Ben Aveling Nov 11 '21 at 05:45
  • @davidbak Thanks David. That was the kind of answer I was really looking for. You've nailed at least one of the issues right on the head. I've used this solution once in a similar project and the only issue was gracefully (and cleanly in code) handling disconnections - but that is just a technicality. – Mulciber Coder Nov 11 '21 at 09:59
  • @BenAvenling At the moment, I wait for about 1000 - 5000 records or for kafka to not return data for more than 3 seconds. If either of those conditions happen - the load is triggered. I did that because just waiting for transactions could cause me to not flush to the db what I already have. The 1000 to 5000 depends on how busy the topic is - the busier it gets - the larger I make the buffer. Anything above 10 000 means that the write to sql starts to take a bit longer. Anything less than 1 000 starts creating too many bulkcopies. – Mulciber Coder Nov 11 '21 at 10:03
  • @DanWilson Thanks. Definitely worth a look. I strongly suspect that the issue is purely on the SQL insert side and as Davibak pointed out - likely in the opening of connections, more so than the actual database inserts themselves. It would be nice to substantiate that though.... – Mulciber Coder Nov 11 '21 at 10:15
  • Good! just make sure that you know, if you're using some kind of intermediary layer between you and the database, where the _implicit_ transaction boundaries are if you aren't _explicitly_ encapsulating each insert into its own transaction ... hope that works. – davidbak Nov 11 '21 at 16:00

1 Answers1

1

Note this article states too large of a batch size may be detrimental, so you need to tune that carefully. That article also has other recommendations of things to check about your indexes and IO performance.

When talking about buffering, you also need to think about back pressure and delays through the entire pipeline. You don't want your database to wait on kafka, or vice versa.

What I would do is have a thread continually polling kafka, doing the filtering, and putting records in a buffer. Then another thread doing a bulk copy with all the records buffered since the last bulk copy. That way your batch size naturally adjusts to what your database can handle, neither side is ever sitting idle, and your latency adjusts based on your incoming kafka event rate.

I don't know what's available for C#, but many streaming libraries will help you with this sort of buffering. Monix observables have bufferIntrospective, for example.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • That article is spot on. There is a diminishing return on BulkCopy as the batch size grows. It think its normally not noticeable when you do proper bulk loads - but in a smaller-batch environment you see it clearly. Larger batch sizes also have an impact on the log in SQL. I believe that I've adequately catered for the back-pressure - but I guess I'll find out as I start testing this more. You described my code almost to a tee (2 threads...). C# has a very nice BlockingCollection that acts as my buffer and helps with the producer-consumer problem. – Mulciber Coder Nov 11 '21 at 10:20