This past week, I was privileged to attend Percona Live 2019, an open-source database conference held this year in Austin, Texas. With the breadth of the sessions I attended being captured on a community.mattermost.com post, I wanted to use this blog post to dive into a recurring topic from the conference: database replication.
Replication is a means by which data on one database server is automatically copied – or replicated – to other database servers called replicas. Having extra copies of a database reduces the chance of data-loss in the event of a server failure. Plus, the copies can be used as databases themselves, spreading out load to improve application performance.
Most types of replication fall into one of the following categories:
Asynchronous replication allows copied data to be out-of-date relative to the master copy. Consider a typical database interaction on a single server:
The client sends a COMMIT
to confirm the transaction, but must wait for the database to flush that transaction to disk to be confident it was persisted. Otherwise, a sudden power loss after the COMMIT
but before the fsync()
would result in data-loss.
When a replica is configured, the server receiving the write is also trying to make a copy of the data. Asynchronous replication releases the client before the copy has been confirmed on the replica:
A power loss before the copy is made would result in the replica being out-of-date, but it can easily catch up from the master copy on restart. Even without a power loss, there is a period of time in which the replica will be out-of-date before it receives the copy. This period of time (sometimes measured as a number of bytes) is called the replication lag.
When an application directly queries an asynchronous replica, it must be prepared to deal with some replication lag. For example, such an application should never INSERT
a row on one server and then immediately query it from another: the data won’t be there yet! Mattermost v5.12 includes a fix for just such an issue. A healthy database keeps replication lag to a minimum. If a replica fails to keep up, it might be necessary for the application to stop querying it to avoid unexpected results.
Sometimes it can be useful to introduce an artificial replication lag. A delayed replica intentionally stays out-of-date with the master copy to simplify disaster recovery scenarios. If an errant application decides to TRUNCATE
a table, a normal replica will happily truncate its own tables to stay up-to-date. By contrast, a delayed replica avoids applying changes right away, maintaining a view into the older data. While this is no replacement for a proper backup, a delayed replica can be useful to recover a dropped table much more quickly than restoring from a full backup.
The various replication options involve a tradeoff between durability – the guarantee that the data will not be lost – and throughput – the speed at which data can be written. While asynchronous replication prioritizes performance at the expense of durability, synchronous replication prioritizes durability. The client committing the transaction must wait for the copy to persist to the replica:
Synchronous replication is the only approach that guarantees no data loss if the master copy fails. The impact to performance increases for each synchronous replica, but an application may combine these replication strategies. One replica would be synchronous to increase durability, and the others would be asynchronous to improve performance:
Semi-synchronous replication attempts to straddle the tradeoff between durability and performance. The client now waits a bit longer for the replica to confirm receipt of the copy, but not for it to write that copy to disk:
This approach exploits the fact that two servers failing at the same time is relatively rare, and the in-memory copy of the data on the replica is extremely likely to get written to disk. Semi-synchronous replication is supported by MySQL, but not yet for Postgres.
Group replication is a variant of synchronous replication with increased flexibility. Instead of configuring a fixed number of replicas, all servers are distributed peers and technically capable of accepting writes. A group consensus model such as Paxos or Raft is used to ensure a majority of nodes have persisted the write before the client is released.
The primary benefit of this approach is built-in, automatic failover. When a server goes down – even the ones receiving writes – the cluster automatically elects a new primary node from the remaining servers and resumes accepting writes. When a replacement server becomes available, it can rejoin the cluster automatically. Ultimately, this self-healing minimizes disruption to the application and avoids the delays of having to get a human involved.
But this increased flexibility comes at the cost of increased complexity. For example, there must be at least three nodes in a group replication cluster, otherwise a majority consensus cannot be well-defined. Bugs in the implementation are more complicated, and there is the rare possibility of split-brain syndrome.
Josh Berkus gave a session at the conference in which he described how Patroni for Postgres avoids split-brain syndrome in a Kubernetes deployment. Normally, operators rely on services to define endpoints, but their operator creates its own endpoint that simultaneously serves as the definition of what server is primary in the cluster and as the means of directing traffic to it. By effectively using this Kubernetes primitive as a locking mechanism, it is impossible for the consensus reached by the cluster to diverge from where a client actually sends writes.
I leave the specifics to the linked documentation, but here’s some MySQL and Postgres resources to consider:
In addition to the stock options available, there is always the opportunity to tune the replication algorithm for a specific workload. Anirban Rahut from Facebook gave a session in which he spoke about their modifications to the (non-group) replication algorithm in MySQL. While the master copy can accept writes in parallel, MySQL replicas currently apply all changes from the master copy serially. If the master copy receives a high volume of writes, the replication lag can grow uncontrollably, with the replicas unable to catch up. Anirban and his team modified the MySQL source to parse and interpret the replication stream, build a DAG modelling the write dependencies, and apply subsets of the binary logs in parallel, all while preserving commit order. For typical workloads, these replicas improved throughput by 4-5x!
All versions of Mattermost can rely on replication to provide redundancy. Mattermost E20 customers can leverage the High Availability Cluster feature set to improve performance by directing read-only traffic to the replicas. Understanding the available replication options is crucial to configuring the database used by Mattermost for the desired durability and performance.