githubEdit

8. The Trouble with Distributed Systems

Hey I just met you The network’s laggy But here’s my data So store it maybe

​ — Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions (2013)


A recurring theme in the last few chapters has been how systems handle things going wrong. For example, we discussed replica failover (“Handling Node Outagesarrow-up-right”), replication lag (“Problems with Replication Lagarrow-up-right”), and con‐ currency control for transactions (“Weak Isolation Levelsarrow-up-right”). As we come to understand various edge cases that can occur in real systems, we get better at handling them.

However, even though we have talked a lot about faults, the last few chapters have still been too optimistic. The reality is even darker. We will now turn our pessimism to the maximum and assume that anything that can go wrong will go wrong. (Experienced systems operators will tell you that is a reasonable assumption. If you ask nicely, they might tell you some frightening stories while nursing their scars of past battles.)

Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and excit‐ ing ways for things to go wrong [1, 2]. In this chapter, we will get a taste of the prob‐ lems that arise in practice, and an understanding of the things we can and cannot rely on.

In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong. In Chapter 9arrow-up-right, we will look at some examples of algorithms that can provide such guarantees in a distributed system. But first, in this chapter, we must understand what challenges we are up against.

This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”); clocks and timing issues (“Unreliable Clocks”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a dis‐ tributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”).

……

Summary

In this chapter we have discussed a wide range of problems that can occur in dis‐ tributed systems, including:

  • Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.

  • A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval.

  • A process may pause for a substantial amount of time at any point in its execu‐ tion (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.

The fact that such partial failures can occur is the defining characteristic of dis‐ tributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crash‐ ing. Moreover, sometimes a node can be in a degraded state: for example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [94]. Such a node that is “limping” but not dead can be even more difficult to deal with than a cleanly failed node.

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as triv‐ ial if it can be solved on a single computer [5], and indeed a single computer can do a lot nowadays [95]. If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

However, as discussed in the introduction to Part IIarrow-up-right, scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things can‐ not be achieved with a single node.

In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in net‐ works, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expen‐ sive and reliable.

We also touched on supercomputers, which assume reliable components and thus have to be stopped and restarted entirely when a component does fail. By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level—at least in theory. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.)

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems.

References

  1. Jay Kreps: “Getting Real About Distributed System Reliabilityarrow-up-right,” blog.empathybox.com, March 19, 2012.

  2. Sydney Padua: The Thrilling Adventures of Lovelace and Babbage: The (Mostly) True Story of the First Computer. Particular Books, April 2015. ISBN: 978-0-141-98151-2

  3. Coda Hale: “You Can’t Sacrifice Partition Tolerancearrow-up-right,” codahale.com, October 7, 2010.

  4. Jeff Hodges: “Notes on Distributed Systems for Young Bloodsarrow-up-right,” somethingsimilar.com, January 14, 2013.

  5. Antonio Regalado: “Who Coined 'Cloud Computing'?arrow-up-right,” technologyreview.com, October 31, 2011.

  6. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle: “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Editionarrow-up-right,” Synthesis Lectures on Computer Architecture, volume 8, number 3, Morgan & Claypool Publishers, July 2013. doi:10.2200/S00516ED2V01Y201306CAC024arrow-up-right, ISBN: 978-1-627-05010-4

  7. David Fiala, Frank Mueller, Christian Engelmann, et al.: “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computingarrow-up-right,” at International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), November 2012.

  8. Arjun Singh, Joon Ong, Amit Agarwal, et al.: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Networkarrow-up-right,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2785956.2787508arrow-up-right

  9. Glenn K. Lockwood: “Hadoop's Uncomfortable Fit in HPCarrow-up-right,” glennklockwood.blogspot.co.uk, May 16, 2014.

  10. John von Neumann: “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Componentsarrow-up-right,” in Automata Studies (AM-34), edited by Claude E. Shannon and John McCarthy, Princeton University Press, 1956. ISBN: 978-0-691-07916-5

  11. Richard W. Hamming: The Art of Doing Science and Engineering. Taylor & Francis, 1997. ISBN: 978-9-056-99500-3

  12. Claude E. Shannon: “A Mathematical Theory of Communicationarrow-up-right,” The Bell System Technical Journal, volume 27, number 3, pages 379–423 and 623–656, July 1948.

  13. Peter Bailis and Kyle Kingsbury: “The Network Is Reliablearrow-up-right,” ACM Queue, volume 12, number 7, pages 48-55, July 2014. doi:10.1145/2639988.2639988arrow-up-right

  14. Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish: “Taming Uncertainty in Distributed Systems with Help from the Networkarrow-up-right,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741976arrow-up-right

  15. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implicationsarrow-up-right,” at ACM SIGCOMM Conference, August 2011. doi:10.1145/2018436.2018477arrow-up-right

  16. Mark Imbriaco: “Downtime Last Saturdayarrow-up-right,” github.com, December 26, 2012.

  17. Marc A. Donges: “Re: bnx2 cards Intermittantly Going Offlinearrow-up-right,” Message to Linux netdev mailing list, spinics.net, September 13, 2012.

  18. Kyle Kingsbury: “Call Me Maybe: Elasticsearcharrow-up-right,” aphyr.com, June 15, 2014.

  19. Salvatore Sanfilippo: “A Few Arguments About Redis Sentinel Properties and Fail Scenariosarrow-up-right,” antirez.com, October 21, 2014.

  20. Bert Hubert: “The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliablearrow-up-right,” blog.netherlabs.nl, January 18, 2009.

  21. Nicolas Liochon: “CAP: If All You Have Is a Timeout, Everything Looks Like a Partitionarrow-up-right,” blog.thislongrun.com, May 25, 2015.

  22. Jerome H. Saltzer, David P. Reed, and David D. Clark: “End-To-End Arguments in System Designarrow-up-right,” ACM Transactions on Computer Systems, volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402arrow-up-right

  23. Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, et al.: “Queues Don’t Matter When You Can JUMP Them!arrow-up-right,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

  24. Guohui Wang and T. S. Eugene Ng: “The Impact of Virtualization on Network Performance of Amazon EC2 Data Centerarrow-up-right,” at 29th IEEE International Conference on Computer Communications (INFOCOM), March 2010. doi:10.1109/INFCOM.2010.5461931arrow-up-right

  25. Van Jacobson: “Congestion Avoidance and Controlarrow-up-right,” at ACM Symposium on Communications Architectures and Protocols (SIGCOMM), August 1988. doi:10.1145/52324.52356arrow-up-right

  26. Brandon Philips: “etcd: Distributed Locking and Service Discoveryarrow-up-right,” at Strange Loop, September 2014.

  27. Steve Newman: “A Systematic Look at EC2 I/Oarrow-up-right,” blog.scalyr.com, October 16, 2012.

  28. Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “The ϕ Accrual Failure Detectorarrow-up-right,” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004.

  29. Jeffrey Wang: “Phi Accrual Failure Detectorarrow-up-right,” ternarysearch.blogspot.co.uk, August 11, 2013.

  30. Srinivasan Keshav: An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6

  31. Cisco, “Integrated Services Digital Networkarrow-up-right,” docwiki.cisco.com.

  32. Othmar Kyas: ATM Networks. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

  33. InfiniBand FAQarrow-up-right,” Mellanox Technologies, December 22, 2014.

  34. Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman: “End-to-End Congestion Control for InfiniBandarrow-up-right,” at 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. doi:10.1109/INFCOM.2003.1208949arrow-up-right

  35. Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley: “The NTP FAQ and HOWTOarrow-up-right,” ntp.org, November 2006.

  36. John Graham-Cumming: “How and why the leap second affected Cloudflare DNSarrow-up-right,” blog.cloudflare.com, January 1, 2017.

  37. Steve Loughran: “Time on Multi-Core, Multi-Socket Serversarrow-up-right,” steveloughran.blogspot.co.uk, September 17, 2015.

  38. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “Spanner: Google’s Globally-Distributed Databasearrow-up-right,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

  39. M. Caporaloni and R. Ambrosini: “How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?arrow-up-right,” European Journal of Physics, volume 23, number 4, pages L17–L21, June 2012. doi:10.1088/0143-0807/23/4/103arrow-up-right

  40. Nelson Minar: “A Survey of the NTP Networkarrow-up-right,” alumni.media.mit.edu, December 1999.

  41. Viliam Holub: “Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problemarrow-up-right,” blog.rapid7.com, March 14, 2014.

  42. Poul-Henning Kamp: “The One-Second War (What Time Will You Die?)arrow-up-right,” ACM Queue, volume 9, number 4, pages 44–48, April 2011. doi:10.1145/1966989.1967009arrow-up-right

  43. Nelson Minar: “Leap Second Crashes Half the Internetarrow-up-right,” somebits.com, July 3, 2012.

  44. Christopher Pascoe: “Time, Technology and Leaping Secondsarrow-up-right,” googleblog.blogspot.co.uk, September 15, 2011.

  45. Mingxue Zhao and Jeff Barr: “Look Before You Leap – The Coming Leap Second and AWSarrow-up-right,” aws.amazon.com, May 18, 2015.

  46. Darryl Veitch and Kanthaiah Vijayalayan: “Network Timing and the 2015 Leap Secondarrow-up-right,” at 17th International Conference on Passive and Active Measurement (PAM), April 2016. doi:10.1007/978-3-319-30505-9_29arrow-up-right

  47. Timekeeping in VMware Virtual Machinesarrow-up-right,” Information Guide, VMware, Inc., December 2011.

  48. MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I (Draft)arrow-up-right,” European Securities and Markets Authority, Report ESMA/2015/1464, September 2015.

  49. Kyle Kingsbury: “Call Me Maybe: Cassandraarrow-up-right,” aphyr.com, September 24, 2013.

  50. Kyle Kingsbury: “The Trouble with Timestampsarrow-up-right,” aphyr.com, October 12, 2013.

  51. Leslie Lamport: “Time, Clocks, and the Ordering of Events in a Distributed Systemarrow-up-right,” Communications of the ACM, volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563arrow-up-right

  52. Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, et al.: “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databasesarrow-up-right,” State University of New York at Buffalo, Computer Science and Engineering Technical Report 2014-04, May 2014.

  53. Justin Sheehy: “There Is No Now: Problems With Simultaneity in Distributed Systemsarrow-up-right,” ACM Queue, volume 13, number 3, pages 36–41, March 2015. doi:10.1145/2733108arrow-up-right

  54. Murat Demirbas: “Spanner: Google's Globally-Distributed Databasearrow-up-right,” muratbuffalo.blogspot.co.uk, July 4, 2013.

  55. Dahlia Malkhi and Jean-Philippe Martin: “Spanner's Concurrency Controlarrow-up-right,” ACM SIGACT News, volume 44, number 3, pages 73–77, September 2013. doi:10.1145/2527748.2527767arrow-up-right

  56. Manuel Bravo, Nuno Diegues, Jingna Zeng, et al.: “On the Use of Clocks to Enforce Consistency in the Cloudarrow-up-right,” IEEE Data Engineering Bulletin, volume 38, number 1, pages 18–31, March 2015.

  57. Spencer Kimball: “Living Without Atomic Clocksarrow-up-right,” cockroachlabs.com, February 17, 2016.

  58. Cary G. Gray and David R. Cheriton: “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistencyarrow-up-right,” at 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989. doi:10.1145/74850.74870arrow-up-right

  59. Martin Thompson: “Java Garbage Collection Distilledarrow-up-right,” mechanical-sympathy.blogspot.co.uk, July 16, 2013.

  60. Christopher Clark, Keir Fraser, Steven Hand, et al.: “Live Migration of Virtual Machinesarrow-up-right,” at 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.

  61. Mike Shaver: “fsyncers and Curveballsarrow-up-right,” shaver.off.net, May 25, 2008.

  62. Zhenyun Zhuang and Cuong Tran: “Eliminating Large JVM GC Pauses Caused by Background IO Trafficarrow-up-right,” engineering.linkedin.com, February 10, 2016.

  63. David Terei and Amit Levy: “Blade: A Data Center Garbage Collectorarrow-up-right,” arXiv:1504.02578, April 13, 2015.

  64. Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz: “Trash Day: Coordinating Garbage Collection in Distributed Systemsarrow-up-right,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

  65. Predictable Low Latencyarrow-up-right,” Cinnober Financial Technology AB, cinnober.com, November 24, 2013.

  66. Martin Fowler: “The LMAX Architecturearrow-up-right,” martinfowler.com, July 12, 2011.

  67. Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination. O'Reilly Media, 2013. ISBN: 978-1-449-36130-3

  68. Enis Söztutar: “HBase and HDFS: Understanding Filesystem Usage in HBasearrow-up-right,” at HBaseCon, June 2013.

  69. Leslie Lamport, Robert Shostak, and Marshall Pease: “The Byzantine Generals Problemarrow-up-right,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 4, number 3, pages 382–401, July 1982. doi:10.1145/357172.357176arrow-up-right

  70. Jim N. Gray: “Notes on Data Base Operating Systemsarrow-up-right,” in Operating Systems: An Advanced Course, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7

  71. Brian Palmer: “How Complicated Was the Byzantine Empire?arrow-up-right,” slate.com, October 20, 2011.

  72. Leslie Lamport: “My Writingsarrow-up-right,” lamport.azurewebsites.net, December 16, 2014. This page can be found by searching the web for the 23-character string obtained by removing the hyphens from the string allla-mport-spubso-ntheweb.

  73. John Rushby: “Bus Architectures for Safety-Critical Embedded Systemsarrow-up-right,” at 1st International Workshop on Embedded Software (EMSOFT), October 2001.

  74. Jake Edge: “ELC: SpaceX Lessons Learnedarrow-up-right,” lwn.net, March 6, 2013.

  75. Andrew Miller and Joseph J. LaViola, Jr.: “Anonymous Byzantine Consensus from Moderately-Hard Puzzles: A Model for Bitcoinarrow-up-right,” University of Central Florida, Technical Report CS-TR-14-01, April 2014.

  76. James Mickens: “The Saddest Momentarrow-up-right,” USENIX ;login: logout, May 2013.

  77. Evan Gilman: “The Discovery of Apache ZooKeeper’s Poison Packetarrow-up-right,” pagerduty.com, May 7, 2015.

  78. Jonathan Stone and Craig Partridge: “When the CRC and TCP Checksum Disagreearrow-up-right,” at ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), August 2000. doi:10.1145/347059.347561arrow-up-right

  79. Evan Jones: “How Both TCP and Ethernet Checksums Failarrow-up-right,” evanjones.ca, October 5, 2015.

  80. Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “Consensus in the Presence of Partial Synchronyarrow-up-right,” Journal of the ACM, volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283arrow-up-right

  81. Peter Bailis and Ali Ghodsi: “Eventual Consistency Today: Limitations, Extensions, and Beyondarrow-up-right,” ACM Queue, volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076arrow-up-right

  82. Bowen Alpern and Fred B. Schneider: “Defining Livenessarrow-up-right,” Information Processing Letters, volume 21, number 4, pages 181–185, October 1985. doi:10.1016/0020-0190(85)90056-0arrow-up-right

  83. Flavio P. Junqueira: “Dude, Where’s My Metadata?arrow-up-right,” fpj.me, May 28, 2015.

  84. Scott Sanders: “January 28th Incident Reportarrow-up-right,” github.com, February 3, 2016.

  85. Jay Kreps: “A Few Notes on Kafka and Jepsenarrow-up-right,” blog.empathybox.com, September 25, 2013.

  86. Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systemsarrow-up-right,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523627arrow-up-right

  87. Frank McSherry, Michael Isard, and Derek G. Murray: “Scalability! But at What COST?arrow-up-right,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

Last updated