1. Reliable, Scalable, and Maintainable Applications

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a techโ nology with a scale like that was so error-free?
โ Alan Kay, in interview with Dr Dobbโs Journal (2012)
Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applicationsโbigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.
A data-intensive application is typically built from standard building blocks that proโ vide commonly needed functionality. For example, many applications need to:
Store data so that they, or another application, can find it again later (databases)
Remember the result of an expensive operation, to speed up reads (caches)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send a message to another process, to be handled asynchronously (stream proโ cessing)
Periodically crunch a large amount of accumulated data (batch processing)
If that sounds painfully obvious, thatโs just because these data systems are such a sucโ cessful abstraction: we use them all the time without thinking too much. When buildโ ing an application, most engineers wouldnโt dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job.
But reality is not that simple. There are many database systems with different characโ teristics, because different applications have different requirements. There are variโ ous approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.
This book is a journey through both the principles and the practicalities of data sysโ tems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics.
In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. Weโll clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application.
โฆโฆ
Summary
In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail.
An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some nonfunctional requireโ ments (general properties like security, reliability, compliance, scalability, compatibilโ ity, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail.
Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically sysโ tematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.
Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitterโs home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.
Maintainability has many facets, but in essence itโs about making life better for the engineering and operations teams who need to work with the system. Good abstracโ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the systemโs health, and having effective ways of managing it.
There is unfortunately no easy fix for making applications reliable, scalable, or mainโ tainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.
Later in the book, in Part III, we will look at patterns for systems that consist of sevโ eral components working together, such as the one in Figure 1-1.
References
Michael Stonebraker and Uฤur รetintemel: โ'One Size Fits All': An Idea Whose Time Has Come and Gone,โ at 21st International Conference on Data Engineering (ICDE), April 2005.
Walter L. Heimerdinger and Charles B. Weinstock: โA Conceptual Framework for System Fault Tolerance,โ Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.
Ding Yuan, Yu Luo, Xin Zhuang, et al.: โSimple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems,โ at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.
Yury Izrailevsky and Ariel Tseitlin: โThe Netflix Simian Army,โ netflixtechblog.com, July 19, 2011.
Daniel Ford, Franรงois Labelle, Florentina I. Popovici, et al.: โAvailability in Globally Distributed Storage Systems,โ at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.
Brian Beach: โHard Drive Reliability Update โ Sep 2014,โ backblaze.com, September 23, 2014.
Laurie Voss: โAWS: The Good, the Bad and the Ugly,โ blog.awe.sm, December 18, 2012.
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: โWhat Bugs Live in the Cloud?,โ at 5th ACM Symposium on Cloud Computing (SoCC), November 2014. doi:10.1145/2670979.2670986
Nelson Minar: โLeap Second Crashes Half the Internet,โ somebits.com, July 3, 2012.
Amazon Web Services: โSummary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region,โ aws.amazon.com, April 29, 2011.
Richard I. Cook: โHow Complex Systems Fail,โ Cognitive Technologies Laboratory, April 2000.
Jay Kreps: โGetting Real About Distributed System Reliability,โ blog.empathybox.com, March 19, 2012.
David Oppenheimer, Archana Ganapathi, and David A. Patterson: โWhy Do Internet Services Fail, and What Can Be Done About It?,โ at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.
Nathan Marz: โPrinciples of Software Engineering, Part 1,โ nathanmarz.com, April 2, 2013.
Michael Jurewitz: โThe Human Impact of Bugs,โ jury.me, March 15, 2013.
Raffi Krikorian: โTimelines at Scale,โ at QCon San Francisco, November 2012.
Martin Fowler: Patterns of Enterprise Application Architecture. Addison Wesley, 2002. ISBN: 978-0-321-12742-6
Kelly Sommers: โAfter all that run around, what caused 500ms disk latency even when we replaced physical server?โ twitter.com, November 13, 2014.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: โDynamo: Amazon's Highly Available Key-Value Store,โ at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.
Greg Linden: โMake Data Useful,โ slides from presentation at Stanford University Data Mining class (CS345), December 2006.
Tammy Everts: โThe Real Cost of Slow Time vs Downtime,โ slideshare.net, November 5, 2014.
Jake Brutlag: โSpeed Matters,โ ai.googleblog.com, June 23, 2009.
Tyler Treat: โEverything You Know About Latency Is Wrong,โ bravenewgeek.com, December 12, 2015.
Jeffrey Dean and Luiz Andrรฉ Barroso: โThe Tail at Scale,โ Communications of the ACM, volume 56, number 2, pages 74โ80, February 2013. doi:10.1145/2408776.2408794
Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: โForward Decay: A Practical Time Decay Model for Streaming Systems,โ at 25th IEEE International Conference on Data Engineering (ICDE), March 2009.
Ted Dunning and Otmar Ertl: โComputing Extremely Accurate Quantiles Using t-Digests,โ github.com, March 2014.
Gil Tene: โHdrHistogram,โ hdrhistogram.org.
Baron Schwartz: โWhy Percentiles Donโt Work the Way You Think,โ solarwinds.com, November 18, 2016.
James Hamilton: โOn Designing and Deploying Internet-Scale Services,โ at 21st Large Installation System Administration Conference (LISA), November 2007.
Brian Foote and Joseph Yoder: โBig Ball of Mud,โ at 4th Conference on Pattern Languages of Programs (PLoP), September 1997.
Frederick P Brooks: โNo Silver Bullet โ Essence and Accident in Software Engineering,โ in The Mythical Man-Month, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3
Ben Moseley and Peter Marks: โOut of the Tar Pit,โ at BCS Software Practice Advancement (SPA), 2006.
Rich Hickey: โSimple Made Easy,โ at Strange Loop, September 2011.
Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: โAnalyzing Software Evolvability,โ at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.50
Last updated