7

How can uniformly and deterministically distribute set of GUIDS to N buckets.

  1. N can be as small as 2.
  2. Need to make sure the same GUID is always mapped to the same bucket.
  3. Can't use any additionally memory.
  4. Input GUID set is not known in advance, and will be generated using common library functions available in standard libraries available in c#, java etc.

BucketId = GUID random part % N, would satisfy the consistency part, but, I don't think it will be uniform or not.

FaisalMansoor
  • 181
  • 1
  • 5
  • 2
    How are your GUIDs being generated? – Philip Kendall Jun 06 '15 at 18:30
  • 2
    please don't **[cross-post](http://meta.stackexchange.com/tags/cross-posting/info "'Cross-posting is frowned upon...'")**: http://stackoverflow.com/questions/30686234/uniformly-distributing-guids-to-bucket-of-size-n – gnat Jun 06 '15 at 18:51
  • @PhilipKendall I am using Guid.NewGuid() for generating guids. https://msdn.microsoft.com/en-us/library/system.guid.newguid%28v=vs.110%29.aspx – FaisalMansoor Jun 08 '15 at 05:39
  • 1
    I know this is closed, but using a SHA-1 hash for example on the guid would provide a good distribution. GUIDs by them selves do not have this property, so partitioning by modulo would probably result in unwanted results. – Joe C Jun 17 '16 at 18:20

1 Answers1

8

(Note: you probably mean "to N buckets". Or "to a group of buckets of size N".)

"GUID random part % N" is the most uniform you can ever hope for.

Lack of uniformity will only be evident in a small data set, in which performance does not matter anyway. In a large data set, where performance matters, it will be quite uniform.

Of course, these are random numbers that we are talking about, so absolute uniformity (all buckets having the exact same load) is practically impossible.

So, when you speak of uniformity, and knowing that you cannot have absolute uniformity, you must be willing to accept some "good enough" uniformity. Which in turn means that you must have some particular uniformity requirements in mind. Please tell us your requirements, and why you arrived at them.

What, you don't have any such requirements? You are just worried before you have an actual problem in your hands? Well then, I would recommend that you stop worrying about the uniformity offered by "GUID random part % N". In all likelihood it will be much better than what you would ever need.

(And, in any case, unbeatable.)

Mike Nakis
  • 32,003
  • 7
  • 76
  • 111
  • Mike can you please clarify how you would go about dividing up a dataset of guids into buckets? – Alex Gordon Aug 14 '15 at 06:19
  • 1
    @IIIIIllllllllIlllllIIIIIIIIlll Suppose B is our array of buckets, addressable using indexes from 0 to N - 1. Each bucket is essentially a container. For each guid G, let R be the random part of the guid expressed as a number, calculate i = R % N, and then append that guid to bucket B[i]. Repeat for all guids. The `%` sign here is the modulus operator, which gives you a number from 0 to N-1, suitable for using as an index into the array of buckets. – Mike Nakis Aug 14 '15 at 21:08
  • Of course this assumes that the range of R is larger (much larger) than N. This is true for any reasonable implementation of guids. If not, then you might need to multiply R by another (preferably huge) random number in order to obtain a large range. – Mike Nakis Aug 14 '15 at 21:13