5

So imagine two HashMaps that need to be mirrored across two different machines, A and B. Whenever I make a modification to machine A, machine B sees it and vice-versa. Problem is:

These two hash maps need to remain always identical. What computer science strategy can we use to accomplish that? Is it possible to do it without a centralizing/master broker or hash map?

Peter Mel
  • 191
  • 3

3 Answers3

2

If there are multiple machines involved and network latency is non-zero, then keeping the data on those machines "always identical" is obviously impossible.

But there are many weaker yet useful forms of consistency you can achieve (Wikipedia even has a list of them). Achieving those forms of consistency is a problem largely orthogonal to what shape the data has, unless that data has some structure you can exploit such as a natural ordering. Since you said hashmaps, I'll assume there is no natural/useful ordering.

So the part of your question that I can give a direct and answer to is:

Is it possible to do it without a centralizing/master broker or hash map?

Yes, it is possible to achieve some useful forms of consistency without an authoritative "master" machine.

A simple but popular example would be "eventual consistency" where the conflict resolution strategy is "last write wins". Let's say that every hour or so your machines tell each other what the believe the most recent change to the value of foobar is. Once this happens, each machine can see all the timestamps sent by the other machines, so without any further communication each one can pick the latest timestamp and use that as the value of foobar from then on. Of course, it may take up to an hour for a given write operation to be reflected on all machines, which is why it's called eventual consistency. Most systems will be far more clever than that (it'd be silly for a website to go down for one minute every single hour), but that should at least give you an idea what sorts of guarantees you can achieve in practice.

Ixrec
  • 27,621
  • 15
  • 80
  • 87
2

The mirroring that is required for your purpose must be a synchronous mirroring.

This kind of replication strategy is generally achieved via mechanism implying ACID transactions. It always implies a certain latency when doing an operation on any of the machine.

Typically, this would work somewhat like this (simplified):

  • Machine A performs an operation that requires update of the map.
  • Machine A sets a lock on its map
  • Machine A ask B to set a lock on its map
  • Machine A updates its map
  • Machine A informs B of the required update
  • Machine B updates its map
  • Machine B informs A that it finished and releases the lock
  • Machine A releases the lock.

This approach is decentralized. No master. But this way of proceeding is very very heavy if there are many writes on both machines: your map will quickly become a bottleneck. And it is utterly complex: you have to address everything that can go wrong, for example, on of the node that would crash while he locked the table.

Another approach could be to make one machine the master for this table and replicate the changes to the other. Doing so replaces the locking mechanism, and increases the fault tolerance against every machine but the master. In terms of performance, you'll have the same drawbacks that the initial approach.

You can overcome these replication issues by adopting a partitioning strategy (every machine is responsible for some part of the data, to be defined if horizontal or vertical partitioning).

Another approach, even more scalable is to have asynchronous synchronisation: every database is independent, and from time to time they syncrhonize. This can work together very efficiently if used in combination with horizontal partitionning.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • There is no such thing as truly synchronous mirroring. Also, from the Two Generals Problem we know that it is not possible to make two machines always agree in case of possible network failures. – usr Mar 19 '16 at 20:04
  • @usr With ["synchronous"](http://stackoverflow.com/questions/748175/asynchronous-vs-synchronous-execution-what-does-it-really-mean) I mean "waiting that the whole operation finishes before doing something else" – Christophe Mar 19 '16 at 21:36
1

Fundamentally this not a problem that can be solved. You need to choose your poison.

If you require that the mashmaps are synced at all times they cannot be distributed and you have introduced a single point of failure.

If you can accept that they are sometimes a little off you can get very close to what you want. The good news is can know exactly how they diverge so you can respond accordingly.

This is called master-master replication and it is such a hard problem that most dbs don't support it very well, or at all. One db that does is CouchDB. It solves this hard problem with a simple model where all changes to a document(value in your hashmap) is versioned. If a document has been updated in both mirrors independently both versions are saved as conflicts(one is chosen as default) when replicating. The ability to ask for conflicts enables the client to fix any concurrency problems after the fact.

Esben Skov Pedersen
  • 5,098
  • 2
  • 21
  • 24