Efficient methods for storing tens of millions of objects for querying, with a high number of inserts per second?

Question

This is basically a logging/counting application that is counting the number of packets and counting the type of packet, etc. on a p2p chat network. This equates to about 4-6 million packets in a 5 minute period. And because I only take a "snapshot" of this information, I am only removing packets older than 5 minutes every five minutes. So the maximum about of items that will be in this collection is 10 to 12 million.

Because I need to make 300 connections to different superpeers, it is a possibility that each packet is trying to be inserted at least 300 times (which is probably why holding this data in memory is the only reasonable option).

Currently, I have been using a Dictionary for storing this information. But because of the large amount of items I'm trying to store, I run into issues with the large object heap and the amount of memory usage continuously grows over time.

Dictionary<ulong, Packet>

public class Packet
{
    public ushort RequesterPort;
    public bool IsSearch;
    public string SearchText;
    public bool Flagged;
    public byte PacketType;
    public DateTime TimeStamp;
}

I have tried using mysql, but it was not able to keep up with the amount of data that I need to insert (while checking to make sure it was not a duplicate), and that was while using transactions.

I tried mongodb, but the cpu usage for that was insane and did not keep either.

My main issue arises every 5 minutes, because I remove all packets that are older than 5 minutes, and take a "snapshot" of this data. As i'm using LINQ queries to count the number of packets containing a certain packet type. I also am calling a distinct() query on the data, where I strip 4 bytes (ip address) out of the keyvaluepair's key, and combine it with the requestingport value in the Value of the keyvalupair and use that to get a distinct number of peers from all the packets.

The application currently hovers around 1.1GB of memory usage, and when a snapshot is called it can go so far as to double the usage.

Now this wouldn't be an issue if I have an insane amount of ram, but the vm I have this running on is limited to 2GB of ram at the moment.

Is there any easy solution?

Its very memory intensive scenario and on top of that you are using a vm for running the application, wow. Anyways, did you explore memcached for storing the packets. Basically you can run memcached on a separate machine and the application can keep running on the vm itself. — , Mar 14 '12 at 07:07
As you've tried both MySQL and MongoDB already, it would seem that maybe the requirements of your application (if you want to do it right) dictate that you simply need more horsepower. If your application is important to you, beef up the server. You also might want to revisit your "purging" code. I'm sure you could find a more optimized way of handling that, insofar that it doesn't make your app unusable. — Matt Beckman, Mar 14 '12 at 07:13
You won't get anything faster than local heap. My suggestion would be to manually invoke garbage collection after purging. — vartec, Mar 15 '12 at 13:20
@vartec - as a matter of fact, contrary to the popular belief, manually invoking garbage collector does not actually guarantee immediate, well...garbage collection. The GC might defer the action to a later period according to own gc algorithm. Invoking it every 5 minutes might even add to the strain, instead of relieving it. Just saying ;) — Jas, Mar 15 '12 at 21:57
@Jas: AFAIK, it depends on parameter you pass. But in general case yeah, you have good point. — vartec, Mar 16 '12 at 09:32
GC.WaitForFullGCComplete(int timeoutInMS) is the API to use if you want to wait to ensure that a full GC.Collect has taken place. — Norman H, Jan 24 '14 at 14:29
Redis sounds like a useful consideration, if you were thinking MySQL and MongoDB but they were too slow. In-memory storage on a separate server. :) Alternatively, I would consider B-Trees for their memory-efficiency. The challenge is that there are **2** desirable indexes: one on timestamp and one on [unique identifier of packet]. If you can get away with storing all 300 duplicates, you can skip the latter index, of course. Third, there are dynamic 2-way resizable arrays that might work with timestamp order as well. There's a paper online that will help you implement it... yourself, though. — Timo, Feb 01 '16 at 08:42
I would suggest constantly deleting and adding, instead of waiting for 5 minutes to do both. — Krythic, Jul 21 '16 at 13:39

score 13 · Accepted Answer · answered Mar 15 '12 at 21:23

13

Instead of having one dictionary and searching that dictionary for entries that are too old; have 10 dictionaries. Every 30 seconds or so create a new "current" dictionary and discard the oldest dictionary with no searching at all.

Next, when you're discarding the oldest dictionary, put all the old objects onto a FILO queue for later, and instead of using "new" to create new objects pull an old object off the FILO queue and use a method to reconstruct the old object (unless the queue of old objects is empty). This can avoid a lot of allocations and a lot of garbage collection overhead.

answered Mar 15 '12 at 21:23

Brendan

3,895
21
21

1

Partitioning by time slice! Just what I was going to suggest. – James Anderson Mar 16 '12 at 02:15
The issue with this is, I would have to query all of those dictionaries that were made within the last five minutes. As the there are 300 connections, the same packet is going to arrive at each one at least once. So in order to not handle the same packet more than once, I must keep them for at least the 5 minute period. – Josh Mar 16 '12 at 07:19
1

Part of the problem with generic structures is that they aren't customised for a specific purpose. Perhaps you should add a "nextItemForHash" field and a "nextItemForTimeBucket" field to your Packet structure and implement your own hash table, and stop using Dictionary. That way you can quickly find all packets that are too old and only search once when a packet is inserted (ie. have your cake and eat it too). It'd also help for memory management overhead (as "Dictionary" wouldn't be allocating/freeing extra data structures for Dictionary management). – Brendan Mar 16 '12 at 10:29
@Josh the fastest way to determine if you've seen something before is a [hashset](http://stackoverflow.com/a/4558802/156755). Time-sliced hash sets would be fast and you still wouldn't need to search to evict old items. If you haven't seen it before, then you can store it in your dictionar(y/ies). – Basic Nov 01 '16 at 22:41
https://www.infoq.com/articles/Big-Memory-Part-3 – itadapter DKh Jul 15 '17 at 22:43

score 4 · Answer 2 · answered Mar 15 '12 at 21:49

Simple approach: try memcached.

It is optimized to run tasks like this.
It can reuse spare memory on less busy boxes, not only on your dedicated box.
It has built-in cache expiration mechanism, which is lazy so no hiccups.

The downside is that it's memory-based and does not have any persistence. If an instance is down, data is gone. If you need persistence, serialize the data yourself.

More complex approach: try Redis.

It is optimized to run tasks like this.
It has built-in cache expiration mechanism.
It scales / shards easily.
It has persistence.

The downside is that it's slightly more complex.

Memcached can be split across machines to increase the amount of ram available. You could have a second server serializing data to the filesystem so that you will not lose things if a memcache box goes down. The Memcache API is very simple to use and works from any language allowing you to use different stacks in different places. — Michael Shopsin, Feb 03 '14 at 16:40

score 3 · Answer 3 · answered Mar 15 '12 at 14:04

The first thought that springs to mind is why you wait 5 minutes. Could you do the snap-shots more often and thus reduce the big overload you see at the 5 minute boundary?

Secondly, LINQ is great for concise code, but in reality LINQ is syntactic sugar on "regular" C# and there is no guarantee that it will generate the most optimal code. As an exercise you could try and rewrite the hot spots withouth LINQ, you may not improve performance but you will have a clearer idea what you are doing and it would make profiling work easier.

Another thing to look at is data structures. I don't know what you do with your data, but could you simplify the data you store in any way? Could you use a string or byte array and then extract relevant parts from those items as you need them? Could you use a struct instead of a class and even do something evil with stackalloc to set aside memory and avoid GC runs?

Don't use a string/byte array, use something like a BitArray: http://msdn.microsoft.com/en-us/library/system.collections.bitarray.aspx to avoid having to manually bit-twiddle. Otherwise, this is a good answer, there's not really an easy option other than better algorithms, more hardware or better hardware. — Ed James, Mar 15 '12 at 14:16
The five minute thing is due to the fact that these 300 connections may receive the same packet. So I have to keep track of what I have already handled, and 5 minutes is the amount of time it takes for packets to fully propagate to all nodes on this particular network. — Josh, Mar 16 '12 at 07:15

Codism · Answer 4 · 2012-03-16T00:34:47.420

You don't have to store all the packages for the queries you have mentioned. For example - package type counter:

You need two arrays:

int[] packageCounters = new int[NumberOfTotalTypes];
int[,] counterDifferencePerMinute = new int[6, NumberOfTotalTypes];

The first array keeps track of how many packages in different types. The second array keeps track of how many more packages been added in every minutes such that you know how many packages need to be removed at every minute interval. I hope you can tell that the second array is used as a round FIFO queue.

So for each package, the following operations are performed:

packageCounters[packageType] += 1;
counterDifferencePerMinute[current, packageType] += 1;
if (oneMinutePassed) {
  current = (current + 1) % 6;
  for (int i = 0; i < NumberOfTotalTypes; i++) {
    packageCounters[i] -= counterDifferencePerMinute[current, i];
    counterDifferencePerMinute[current, i] = 0;
}

At any time, the package counters can be retrieved by the index instantly and we don't to store all the packages.

The main reason for having to store the data that I do, is the fact that these 300 connections may receive the same exact packet. So I need to keep every seen packet for at least five minutes in order to make sure I do not handle/count them more than once. Which is what the ulong for the dictionary key is for. — Josh, Mar 16 '12 at 07:16

score 1 · Answer 5 · answered Feb 03 '14 at 16:26

(I know this is an old question, but I ran across it while looking for a solution to a similar problem where the second gen garbage collection pass was pausing the app for several seconds, so recording for other people in similar situation).

Use a struct rather than a class for your data (but remember it's treated as a value with pass-by-copy semantics). This takes out one level of searching the gc has to do each mark pass.

Use arrays (if you know the size of data you are storing) or List - which uses arrays internally. If you really need the fast random access, use a dictionary of array indices. This takes out another couple of levels (or a dozen or more if you are using a SortedDictionary) for the gc to have to search.

Depending on what you are doing, searching a list of structs may be faster than the dictionary lookup (due to the localisation of memory) - profile for your particular application.

The combination of struct&list reduces both the memory usage and the size of the garbage collector sweep significantly.

I have a recent experiment, that generates collections & dictionaries in disk as fast, using sqlite https://github.com/modma/PersistenceCollections — ModMa, Jul 10 '17 at 09:54

Efficient methods for storing tens of millions of objects for querying, with a high number of inserts per second?

5 Answers5