githubEdit

5. Replication

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

​ — Douglas Adams, Mostly Harmless (1992)


In Part Iarrow-up-right of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in Part IIarrow-up-right, we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?

There are various reasons why you might want to distribute a database across multi‐ ple machines:

Scalability

If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.

Fault tolerance/high availability

If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multi‐ ple machines to give you redundancy. When one fails, another one can take over.

Latency

If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geo‐ graphically close to them. That avoids the users having to wait for network pack‐ ets to travel halfway around the world.

……

Summary

In this chapter we looked at the issue of replication. Replication can serve several purposes:

High availability

Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down

Disconnected operation

Allowing an application to continue working when there is a network interrup‐ tion

Latency

Placing data geographically close to users, so that users can interact with it faster

Scalability

Being able to handle a higher volume of reads than a single machine could han‐ dle, by performing reads on replicas

Despite being a simple goal—keeping a copy of the same data on several machines— replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we need to deal with unavailable nodes and network interruptions (and that’s not even considering the more insidious kinds of fault, such as silent data corruption due to software bugs).

We discussed three main approaches to replication:

Single-leader replication

Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale.

Multi-leader replication

Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes.

Leaderless replication

Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.

Each approach has advantages and disadvantages. Single-leader replication is popular because it is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees.

Replication can be synchronous or asynchronous, which has a profound effect on the system behavior when there is a fault. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases and servers fail. If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost.

We looked at some strange effects that can be caused by replication lag, and we dis‐ cussed a few consistency models which are helpful for deciding how an application should behave under replication lag:

Read-after-write consistency

Users should always see data that they submitted themselves.

Monotonic reads

After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

Consistent prefix reads

Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order.

Finally, we discussed the concurrency issues that are inherent in multi-leader and leaderless replication approaches: because they allow multiple writes to happen con‐ currently, conflicts may occur. We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they hap‐ pened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates.

In the next chapter we will continue looking at data that is distributed across multiple machines, through the counterpart of replication: splitting a large dataset into partitions.

References

  1. Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “Notes on Distributed Databasesarrow-up-right,” IBM Research, Research Report RJ2571(33471), July 1979.

  2. AlwaysOn Availability Groupsarrow-up-right,” in SQL Server Books Online, Microsoft, 2012.

  3. Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platformarrow-up-right,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

  4. Jun Rao: “Intra-Cluster Replication for Apache Kafkaarrow-up-right,” at ApacheCon North America, February 2013.

  5. Highly Available Queuesarrow-up-right,” in RabbitMQ Server Documentation, Pivotal Software, Inc., 2014.

  6. Yoshinori Matsunobu: “Semi-Synchronous Replication at Facebookarrow-up-right,” yoshinorimatsunobu.blogspot.co.uk, April 1, 2014.

  7. Robbert van Renesse and Fred B. Schneider: “Chain Replication for Supporting High Throughput and Availabilityarrow-up-right,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

  8. Jeff Terrace and Michael J. Freedman: “Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloadsarrow-up-right,” at USENIX Annual Technical Conference (ATC), June 2009.

  9. Brad Calder, Ju Wang, Aaron Ogus, et al.: “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistencyarrow-up-right,” at 23rd ACM Symposium on Operating Systems Principles (SOSP), October 2011.

  10. Andrew Wang: “Windows Azure Storagearrow-up-right,” umbrant.com, February 4, 2016.

  11. Jesse Newland: “GitHub Availability This Weekarrow-up-right,” github.com, September 14, 2012.

  12. Mark Imbriaco: “Downtime Last Saturdayarrow-up-right,” github.com, December 26, 2012.

  13. Amit Kapila: “WAL Internals of PostgreSQLarrow-up-right,” at PostgreSQL Conference (PGCon), May 2012.

  14. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Servicesarrow-up-right,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

  15. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!arrow-up-right,” at ACM Symposium on Cloud Computing (SoCC), October 2012.

  16. Greg Sabino Mullane: “Version 5 of Bucardo Database Replication Systemarrow-up-right,” blog.endpoint.com, June 23, 2014.

  17. Werner Vogels: “Eventually Consistentarrow-up-right,” ACM Queue, volume 6, number 6, pages 14–19, October 2008. doi:10.1145/1466443.1466448arrow-up-right

  18. Douglas B. Terry: “Replicated Data Consistency Explained Through Baseballarrow-up-right,” Microsoft Research, Technical Report MSR-TR-2011-137, October 2011.

  19. Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: “Session Guarantees for Weakly Consistent Replicated Dataarrow-up-right,” at 3rd International Conference on Parallel and Distributed Information Systems (PDIS), September 1994. doi:10.1109/PDIS.1994.331722arrow-up-right

  20. Terry Pratchett: Reaper Man: A Discworld Novel. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6

  21. BDR 0.10.0 Documentationarrow-up-right,” The PostgreSQL Global Development Group, bdr-project.org, 2015.

  22. Robert Hodges: “If You Must Deploy Multi-Master Replication, Read This Firstarrow-up-right,” scale-out-blog.blogspot.co.uk, March 30, 2012.

  23. J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide. O'Reilly Media, 2010. ISBN: 978-0-596-15589-6

  24. AppJet, Inc.: “Etherpad and EasySync Technical Manualarrow-up-right,” github.com, March 26, 2011.

  25. John Day-Richter: “What’s Different About the New Google Docs: Making Collaboration Fastarrow-up-right,” drive.googleblog.com, September 23, 2010.

  26. Martin Kleppmann and Alastair R. Beresford: “A Conflict-Free Replicated JSON Datatypearrow-up-right,” arXiv:1608.03960, August 13, 2016.

  27. Frazer Clement: “Eventual Consistency – Detecting Conflictsarrow-up-right,” messagepassing.blogspot.co.uk, October 20, 2011.

  28. Robert Hodges: “State of the Art for MySQL Multi-Master Replicationarrow-up-right,” at Percona Live: MySQL Conference & Expo, April 2013.

  29. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “Dynamo: Amazon's Highly Available Key-Value Storearrow-up-right,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

  30. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: “A Comprehensive Study of Convergent and Commutative Replicated Data Typesarrow-up-right,” INRIA Research Report no. 7506, January 2011.

  31. Sam Elliott: “CRDTs: An UPDATE (or Maybe Just a PUT)arrow-up-right,” at RICON West, October 2013.

  32. Russell Brown: “A Bluffers Guide to CRDTs in Riakarrow-up-right,” gist.github.com, October 28, 2013.

  33. Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: “Mergeable Persistent Data Structuresarrow-up-right,” at 26es Journées Francophones des Langages Applicatifs (JFLA), January 2015.

  34. Chengzheng Sun and Clarence Ellis: “Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievementsarrow-up-right,” at ACM Conference on Computer Supported Cooperative Work (CSCW), November 1998.

  35. Lars Hofhansl: “HBASE-7709: Infinite Loop Possible in Master/Master Replicationarrow-up-right,” issues.apache.org, January 29, 2013.

  36. David K. Gifford: “Weighted Voting for Replicated Dataarrow-up-right,” at 7th ACM Symposium on Operating Systems Principles (SOSP), December 1979. doi:10.1145/800215.806583arrow-up-right

  37. Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “Flexible Paxos: Quorum Intersection Revisitedarrow-up-right,” arXiv:1608.06696, August 24, 2016.

  38. Joseph Blomstedt: “Re: Absolute Consistencyarrow-up-right,” email to riak-users mailing list, lists.basho.com, January 11, 2012.

  39. Joseph Blomstedt: “Bringing Consistency to Riakarrow-up-right,” at RICON West, October 2012.

  40. Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: “Quantifying Eventual Consistency with PBSarrow-up-right,” Communications of the ACM, volume 57, number 8, pages 93–102, August 2014. doi:10.1145/2632792arrow-up-right

  41. Jonathan Ellis: “Modern Hinted Handoffarrow-up-right,” datastax.com, December 11, 2012.

  42. Project Voldemort Wikiarrow-up-right,” github.com, 2013.

  43. Apache Cassandra Documentationarrow-up-right,” Apache Software Foundation, cassandra.apache.org.

  44. Riak Enterprise: Multi-Datacenter Replicationarrow-up-right.” Technical whitepaper, Basho Technologies, Inc., September 2014.

  45. Jonathan Ellis: “Why Cassandra Doesn't Need Vector Clocksarrow-up-right,” datastax.com, September 2, 2013.

  46. Leslie Lamport: “Time, Clocks, and the Ordering of Events in a Distributed Systemarrow-up-right,” Communications of the ACM, volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563arrow-up-right

  47. Joel Jacobson: “Riak 2.0: Data Typesarrow-up-right,” blog.joeljacobson.com, March 23, 2014.

  48. D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, et al.: “Detection of Mutual Inconsistency in Distributed Systemsarrow-up-right,” IEEE Transactions on Software Engineering, volume 9, number 3, pages 240–247, May 1983. doi:10.1109/TSE.1983.236733arrow-up-right

  49. Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: “Dotted Version Vectors: Logical Clocks for Optimistic Replicationarrow-up-right,” arXiv:1011.5808, November 26, 2010.

  50. Sean Cribbs: “A Brief History of Time in Riakarrow-up-right,” at RICON, October 2014.

  51. Russell Brown: “Vector Clocks Revisited Part 2: Dotted Version Vectorsarrow-up-right,” basho.com, November 10, 2015.

  52. Carlos Baquero: “Version Vectors Are Not Vector Clocksarrow-up-right,” haslab.wordpress.com, July 8, 2011.

  53. Reinhard Schwarz and Friedemann Mattern: “Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grailarrow-up-right,” Distributed Computing, volume 7, number 3, pages 149–174, March 1994. doi:10.1007/BF02277859arrow-up-right

Last updated