Your bitset idea has promise and reminded me of Bloom filters which have gained popularity of late. I think the central problem - as you noted - is this: "How to handle possible duplicates/collisions?"
Fortunately this is a problem that has been studied heavily with respect to hash table implementation - see the "Collision Resolution" section of the Wikipedia article on hash tables. Much of this knowledge has been baked into existing libraries of course, but unfortunately that may not always translate perfectly into what you'd like to do.
It appears that you have a couple valuable constraints that may make the problem quite reasonable:
- ID's can be represented as
UInt64
's.
- Only 20M items need to be able to fit into the cache.
You've already considered that the key range of #1 is too big for memory - but #2 can be quite helpful. As pointed out elsewhere, the total number of bytes to store 20M UInt64
's is quite reasonable at 160MB - but of course other data structure overhead can be quite heavy. It's still a useful reference point.
So first, are the built-in data structures that bad? I tried .NET's builtin HashSet<UInt64>
and - depending on your key distribution - ended up using about 720MB of memory and it only took 6.5 seconds:
Estimated hash memory usage: 719833240
Total addition time: 00:00:06.4888422
Stats:
Added: 18851584
Found: 1148416
Using UInt64
directly probably saved quite a bit. (Although I'm afraid you'll have to see what Java can do.)
But can we do better by leveraging constraint #2 while still being relatively lazy/leveraging library built-ins? Quite possibly, yes. One simple approach is to create the world's simplest hash table ourselves with a fixed bucket count and essentially open addressing, then fallback to a builtin if the bucket gets full. We can easily turn up the number of slots per bucket as a space vs. time tradeoff and we still don't have to implement any of the hard work around collision management because we punt on the problem.
With this type of approach, I saw results like this on my box - code will be shared below. Note the slightly reduced memory usage and decently reduced time.
Estimated hash memory usage: 562606716 (total 723054508)
Total addition time: 00:00:02.6151904
Stats:
AddedInSlot: 18426506
FoundInPrimarySlot: 988489
FoundInLaterSlot: 146550
AddedInOverflow: 425078
FoundInOverflow: 13377
Code (caveat emptor - may be bugs! I also used explicit types for ultimate clarity):
using System;
using System.Collections.Generic;
using System.Collections;
using System.Diagnostics;
namespace BigHashChecker
{
public enum AddResult
{
AddedInSlot,
FoundInPrimarySlot,
FoundInLaterSlot,
FoundInOverflow,
AddedInOverflow
}
public class BigHash
{
public const Int32 BUCKET_SIZE = 2;
public const Int32 BUCKET_COUNT = 0x2000000; //Round 20M up a bit
public const UInt32 BUCKET_MASK = BUCKET_COUNT - 1;
public const UInt64 RESERVED_VALUE = 0; //Exclude ID 0 for simplicity
public UInt64[] Buckets = new UInt64[BUCKET_SIZE * BUCKET_COUNT];
public HashSet<UInt64> Overflow = new HashSet<UInt64>();
public AddResult Add(UInt64 value)
{
//Find which bucket it should go in.
//Note this greatly depends on the distribution of your keys
//However, we will stick with the most naive method possible
//for speed and simplicity right now.
UInt32 whichBucket = (UInt32)(value & BUCKET_MASK);
UInt32 bucketStartIndex = whichBucket * BUCKET_SIZE;
for (UInt32 i = 0; i < BUCKET_SIZE; i++)
{
UInt64 slotValue = Buckets[bucketStartIndex + i];
if (slotValue == RESERVED_VALUE)
{
Buckets[bucketStartIndex + i] = value;
return AddResult.AddedInSlot;
}
else if (slotValue == value)
{
return i == 0 ? AddResult.FoundInPrimarySlot : AddResult.FoundInLaterSlot;
}
}
//Could not fit it into buckets, check/add to overflow.
if (Overflow.Contains(value))
{
return AddResult.FoundInOverflow;
}
Overflow.Add(value);
return AddResult.AddedInOverflow;
}
}
class Program
{
private const int SIMULATED_DATA_COUNT = 20*1000*1000;
static void Main(string[] args)
{
//UInt64[] simulatedData = GenerateUniformData(); //Uniform UIn64 - very sparse, virtually no overlap
//UInt64[] simulatedData = GenerateDataOverRange(1,0x2000000); //Match the bucket count - minimal overlap
//UInt64[] simulatedData = GenerateDataOverRange(1, (int)(0.95*SIMULATED_DATA_COUNT)); //Force ~5% overlap
UInt64[] simulatedData = GenerateDataOverRange(1, 5*0x2000000); //Range of 5x the bucket count - mostly sparse but a few second/overflow slots
Int64 sampleDataMemoryBaseLine = GC.GetTotalMemory(false);
BigHash bigHash = new BigHash();
Console.WriteLine("Running...");
int[] addResults = new int[5];
Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < simulatedData.Length; i++)
{
AddResult result = bigHash.Add(simulatedData[i]);
addResults[(int)result]++;
}
stopwatch.Stop();
Int64 withHashMemoryBaseline = GC.GetTotalMemory(false);
Int64 estimateMemoryBytes = withHashMemoryBaseline - sampleDataMemoryBaseLine;
Console.WriteLine("Estimated hash memory usage: " + estimateMemoryBytes + " (total "+withHashMemoryBaseline+")");
Console.WriteLine("Total addition time: " + stopwatch.Elapsed);
Console.WriteLine("Stats:");
Console.WriteLine(AddResult.AddedInSlot + ": " + addResults[(int)AddResult.AddedInSlot]);
Console.WriteLine(AddResult.FoundInPrimarySlot + ": " + addResults[(int)AddResult.FoundInPrimarySlot]);
Console.WriteLine(AddResult.FoundInLaterSlot + ": " + addResults[(int)AddResult.FoundInLaterSlot]);
Console.WriteLine(AddResult.AddedInOverflow + ": " + addResults[(int)AddResult.AddedInOverflow]);
Console.WriteLine(AddResult.FoundInOverflow + ": " + addResults[(int)AddResult.FoundInOverflow]);
Console.ReadKey();
}
static UInt64[] GenerateUniformData()
{
UInt64[] simulatedData = new UInt64[SIMULATED_DATA_COUNT];
Random random = new Random(Seed: 12345);
byte[] uint64RandomBuffer = new byte[8];
for (int i = 0; i < SIMULATED_DATA_COUNT; i++)
{
random.NextBytes(uint64RandomBuffer);
simulatedData[i] = BitConverter.ToUInt64(uint64RandomBuffer, 0);
}
return simulatedData;
}
static UInt64[] GenerateDataOverRange(int minValue, int maxValue)
{
UInt64[] simulatedData = new UInt64[SIMULATED_DATA_COUNT];
Random random = new Random(Seed: 123456);
for (int i = 0; i < SIMULATED_DATA_COUNT; i++)
{
simulatedData[i] = (UInt64)random.Next(1, maxValue);
}
return simulatedData;
}
}
}
Sounds like you have a fun problem! I hope everything works out well in your implementation.