4

So master-slave replication is great, but it leaves the master as a single-point of failure. What is the strategy to solve this problem, if there is one?

I'm looking for the computer science theory behind it, not for a product that magically supports it.

Ok, the slave could become the new master, but what's the theory behind this fail-over transition? How it is done?

Peter Mel
  • 191
  • 3
  • 1
    depends on the db. But you can have failover where one if the slaves takes over as master – Ewan Mar 20 '16 at 19:24
  • 1
    Fail-over. That's the whole point of master-slave database replication. – Robert Harvey Mar 20 '16 at 19:34
  • @RobertHarvey It does not seem easy and trivial to make a slave become the new master. What's the theory behind it? – Peter Mel Mar 20 '16 at 19:37
  • A slave is promoted to master, the other additional slaves are reconfigured to use the new master, and the applications using the database server are informed about the new address to use when connecting. – Robert Harvey Mar 20 '16 at 19:50
  • 2
    Note, fail-over is only correct *if* you have a perfect failure detector (which you don't). If you fail-over because you *think* the master has failed, but it hasn't, you now have two masters. To solve this problem requires consensus (e.g. via Paxos, Raft, Zab). – Derek Elkins left SE Mar 20 '16 at 19:50

1 Answers1

4

As soon as a master fails, the remaining slaves can elect a new master.

For instance, MongoDB documentation explains how a new master is elected. Depending on the specific requirements, additional complexity can be added, such as the presence of non-voting members.

If the original master is restored, there are two alternatives:

  • Either it becomes a new master, and the current master becomes a slave,

  • Or the machine becomes a slave and would have a chance to become a master only through a new election, once the current master fails.

For non-critical systems, there is another possibility. If a master fails, none of the slaves could replace the master, but they continue to serve read-only requests, awaiting the master to be restored. This could be used successfully in a context where the downtime of the master is acceptable. For instance, if an app stores your favorite movies, the write part can eventually go down for a few minutes once per month—as soon as you can still access your data in read-only mode, the downtime of the part where you change your data is acceptable.

Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513