0

I'm partitioning a stream of input data between n servers. The simple way of doing it is to use the hash of some property of each input data packet to assign a single server, using mod or similar, and be done with it.

However, I want some degree of resiliancy - if one server goes down, nothing is lost. I want to partition each data packet to m servers, where 1 < m < n, with each data packet guaranteed to go to at least m servers (but it can be more). Furthermore, I want the partitioning to be stateless, deterministic, and well-distributed - the calculation only uses the hash(es) of the input data.

This feels like something that research papers have been written about, but my google-fu has failed me. Are there any existing algorithms which do this, ideally generalisable across n and m?

thecoop
  • 491
  • 1
  • 5
  • 13
  • [Which hashing algorithm is best for uniqueness and speed?](http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed) – gnat May 12 '16 at 13:46
  • That is not relevant - this question is, given a hash value, how you use it to assign to several different buckets, ensuring a good distribution across buckets – thecoop May 12 '16 at 14:04
  • Pretty sure that being well-distributed precludes the possibility of being stateless and deterministic. That is if by well-distributed, you mean avoiding hotspots. Otherwise, a simple function like `mod (ni/m)`, where `ni` is initial number of servers, is probably well-distributed on a large enough time scale. :) But otherwise it doesn't behave like you want a cluster to behave... adding nodes only adds more redundancy. To increase performance (increase n while keeping m constant), you'd have to rebalance the data between nodes. – Kasey Speakman May 18 '16 at 22:56

0 Answers0