githubEdit

12. The Future of Data Systems

If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.

(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)

​ — St. Thomas Aquinas, Summa Theologica (1265–1274)


So far, this book has been mostly about describing things as they are at present. In this final chapter, we will shift our perspective toward the future and discuss how things should be: I will propose some ideas and approaches that, I believe, may funda‐ mentally improve the ways we design and build applications.

Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused.

The goal of this book was outlined in Chapter 1arrow-up-right: to explore how to create applications and systems that are reliable, scalable, and maintainable. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algo‐ rithms that help improve reliability, partitioning to improve scalability, and mecha‐ nisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today— robust, correct, evolvable, and ultimately beneficial to humanity.

……

Summary

In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this data integration problem by using batch processing and event streams to let data changes flow between different systems.

In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole.

Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if some‐ thing goes wrong, you can fix the code and reprocess the data in order to recover.

These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as unbundling the components of a database, and building an application by composing these loosely coupled components.

Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline.

Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scala‐ bly with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk hav‐ ing to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.

By structuring applications around dataflow and checking constraints asynchro‐ nously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the pres‐ ence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption.

Finally, we took a step back and examined some ethical aspects of building data- intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, nor‐ malizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.

As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.

References

  1. Rachid Belaid: “Postgres Full-Text Search is Good Enough!arrow-up-right,” rachbelaid.com, July 13, 2015.

  2. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scalearrow-up-right,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

  3. Pat Helland and Dave Campbell: “Building on Quicksandarrow-up-right,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

  4. Jessica Kerr: “Provenance and Causality in Distributed Systemsarrow-up-right,” blog.jessitron.com, September 25, 2016.

  5. Kostas Tzoumas: “Batch Is a Special Case of Streamingarrow-up-right,” data-artisans.com, September 15, 2015.

  6. Shinji Kim and Robert Blafford: “Stream Windowing Performance Analysis: Concord and Spark Streamingarrow-up-right,” concord.io, July 6, 2016.

  7. Pat Helland: “Life Beyond Distributed Transactions: An Apostate’s Opinionarrow-up-right,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

  8. Great Western Railway (1835–1948)arrow-up-right,” Network Rail Virtual Archive, networkrail.co.uk.

  9. Jacqueline Xu: “Online Migrations at Scalearrow-up-right,” stripe.com, February 2, 2017.

  10. Molly Bartlett Dishman and Martin Fowler: “Agile Architecturearrow-up-right,” at O'Reilly Software Architecture Conference, March 2015.

  11. Nathan Marz and James Warren: Big Data: Principles and Best Practices of Scalable Real-Time Data Systemsarrow-up-right. Manning, 2015. ISBN: 978-1-617-29034-3

  12. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “Summingbird: A Framework for Integrating Batch and Online MapReduce Computationsarrow-up-right,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

  13. Jay Kreps: “Questioning the Lambda Architecturearrow-up-right,” oreilly.com, July 2, 2014.

  14. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “Liquid: Unifying Nearline and Offline Big Data Integrationarrow-up-right,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

  15. Dennis M. Ritchie and Ken Thompson: “The UNIX Time-Sharing Systemarrow-up-right,” Communications of the ACM, volume 17, number 7, pages 365–375, July 1974. doi:10.1145/361011.361061arrow-up-right

  16. Eric A. Brewer and Joseph M. Hellerstein: “CS262a: Advanced Topics in Computer Systemsarrow-up-right,” lecture notes, University of California, Berkeley, cs.berkeley.edu, August 2011.

  17. Michael Stonebraker: “The Case for Polystoresarrow-up-right,” wp.sigmod.org, July 13, 2015.

  18. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “The BigDAWG Polystore Systemarrow-up-right,” ACM SIGMOD Record, volume 44, number 2, pages 11–16, June 2015. doi:10.1145/2814710.2814713arrow-up-right

  19. Patrycja Dybka: “Foreign Data Wrappers for PostgreSQLarrow-up-right,” vertabelo.com, March 24, 2015.

  20. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “Unbundling Transaction Services in the Cloudarrow-up-right,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

  21. Martin Kleppmann and Jay Kreps: “Kafka, Samza and the Unix Philosophy of Distributed Dataarrow-up-right,” IEEE Data Engineering Bulletin, volume 38, number 4, pages 4–14, December 2015.

  22. John Hugg: “Winning Now and in the Future: Where VoltDB Shinesarrow-up-right,” voltdb.com, March 23, 2016.

  23. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “Differential Dataflowarrow-up-right,” at 6th Biennial Conference on Innovative Data Systems Research (CIDR), January 2013.

  24. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “Naiad: A Timely Dataflow Systemarrow-up-right,” at 24th ACM Symposium on Operating Systems Principles (SOSP), pages 439–455, November 2013. doi:10.1145/2517349.2522738arrow-up-right

  25. Martin Kleppmann: “Turning the Database Inside-out with Apache Samza,arrow-up-right” at Strange Loop, September 2014.

  26. Peter Van Roy and Seif Haridi: Concepts, Techniques, and Models of Computer Programmingarrow-up-right. MIT Press, 2004. ISBN: 978-0-262-22069-9

  27. Juttle Documentationarrow-up-right,” juttle.github.io, 2016.

  28. Evan Czaplicki and Stephen Chong: “Asynchronous Functional Reactive Programming for GUIsarrow-up-right,” at 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013. doi:10.1145/2491956.2462161arrow-up-right

  29. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “A Survey on Reactive Programmingarrow-up-right,” ACM Computing Surveys, volume 45, number 4, pages 1–34, August 2013. doi:10.1145/2501654.2501666arrow-up-right

  30. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “Consistency Analysis in Bloom: A CALM and Collected Approacharrow-up-right,” at 5th Biennial Conference on Innovative Data Systems Research (CIDR), January 2011.

  31. Felienne Hermans: “Spreadsheets Are Codearrow-up-right,” at Code Mesh, November 2015.

  32. Dan Bricklin and Bob Frankston: “VisiCalc: Information from Its Creatorsarrow-up-right,” danbricklin.com.

  33. D. Sculley, Gary Holt, Daniel Golovin, et al.: “Machine Learning: The High-Interest Credit Card of Technical Debtarrow-up-right,” at NIPS Workshop on Software Engineering for Machine Learning (SE4ML), December 2014.

  34. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “Feral Concurrency Control: An Empirical Investigation of Modern Application Integrityarrow-up-right,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2737784arrow-up-right

  35. Guy Steele: “Re: Need for Macros (Was Re: Icon)arrow-up-right,” email to ll1-discuss mailing list, people.csail.mit.edu, December 24, 2001.

  36. David Gelernter: “Generative Communication in Lindaarrow-up-right,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 7, number 1, pages 80–112, January 1985. doi:10.1145/2363.2433arrow-up-right

  37. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “The Many Faces of Publish/Subscribearrow-up-right,” ACM Computing Surveys, volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078arrow-up-right

  38. Ben Stopford: “Microservices in a Streaming Worldarrow-up-right,” at QCon London, March 2016.

  39. Christian Posta: “Why Microservices Should Be Event Driven: Autonomy vs Authorityarrow-up-right,” blog.christianposta.com, May 27, 2016.

  40. Alex Feyerke: “Say Hello to Offline Firstarrow-up-right,” hood.ie, November 5, 2013.

  41. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “Global Sequence Protocol: A Robust Abstraction for Replicated Shared Statearrow-up-right,” at 29th European Conference on Object-Oriented Programming (ECOOP), July 2015. doi:10.4230/LIPIcs.ECOOP.2015.568arrow-up-right

  42. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “Unifying Stream Processing and Interactive Queries in Apache Kafkaarrow-up-right,” confluent.io, October 26, 2016.

  43. Frank McSherry: “Dataflow as Databasearrow-up-right,” github.com, July 17, 2016.

  44. Peter Alvaro: “I See What You Meanarrow-up-right,” at Strange Loop, September 2015.

  45. Nathan Marz: “Trident: A High-Level Abstraction for Realtime Computationarrow-up-right,” blog.twitter.com, August 2, 2012.

  46. Edi Bice: “Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friendsarrow-up-right,” at Merchant Risk Council MRC Vegas Conference, March 2016.

  47. Charity Majors: “The Accidental DBAarrow-up-right,” charity.wtf, October 2, 2016.

  48. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “Semantic Conditions for Correctness at Different Isolation Levelsarrow-up-right,” at 16th International Conference on Data Engineering (ICDE), February 2000. doi:10.1109/ICDE.2000.839387arrow-up-right

  49. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “Automating the Detection of Snapshot Isolation Anomaliesarrow-up-right,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

  50. Kyle Kingsbury: Jepsen blog post seriesarrow-up-right, aphyr.com, 2013–2016.

  51. Michael Jouravlev: “Redirect After Postarrow-up-right,” theserverside.com, August 1, 2004.

  52. Jerome H. Saltzer, David P. Reed, and David D. Clark: “End-to-End Arguments in System Designarrow-up-right,” ACM Transactions on Computer Systems, volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402arrow-up-right

  53. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “Coordination-Avoiding Database Systemsarrow-up-right,” Proceedings of the VLDB Endowment, volume 8, number 3, pages 185–196, November 2014.

  54. Alex Yarmula: “Strong Consistency in Manhattanarrow-up-right,” blog.twitter.com, March 17, 2016.

  55. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage Systemarrow-up-right,” at 15th ACM Symposium on Operating Systems Principles (SOSP), pages 172–182, December 1995. doi:10.1145/224056.224070arrow-up-right

  56. Jim Gray: “The Transaction Concept: Virtues and Limitationsarrow-up-right,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

  57. Hector Garcia-Molina and Kenneth Salem: “Sagasarrow-up-right,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742arrow-up-right

  58. Pat Helland: “Memories, Guesses, and Apologiesarrow-up-right,” blogs.msdn.com, May 15, 2007.

  59. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errorsarrow-up-right,” at 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. doi:10.1145/2678373.2665726arrow-up-right

  60. Mark Seaborn and Thomas Dullien: “Exploiting the DRAM Rowhammer Bug to Gain Kernel Privilegesarrow-up-right,” googleprojectzero.blogspot.co.uk, March 9, 2015.

  61. Jim N. Gray and Catharine van Ingen: “Empirical Measurements of Disk Failure Rates and Error Ratesarrow-up-right,” Microsoft Research, MSR-TR-2005-166, December 2005.

  62. Annamalai Gurusami and Daniel Price: “Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021arrow-up-right,” bugs.mysql.com, July 2014.

  63. Gary Fredericks: “Postgres Serializability Bugarrow-up-right,” github.com, September 2015.

  64. Xiao Chen: “HDFS DataNode Scanners and Disk Checker Explainedarrow-up-right,” blog.cloudera.com, December 20, 2016.

  65. Jay Kreps: “Getting Real About Distributed System Reliabilityarrow-up-right,” blog.empathybox.com, March 19, 2012.

  66. Martin Fowler: “The LMAX Architecturearrow-up-right,” martinfowler.com, July 12, 2011.

  67. Sam Stokes: “Move Fast with Confidencearrow-up-right,” blog.samstokes.co.uk, July 11, 2016.

  68. Hyperledger Sawtooth documentationarrow-up-right,” Intel Corporation, sawtooth.hyperledger.org, 2017.

  69. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “BigchainDB: A Scalable Blockchain Databasearrow-up-right,” bigchaindb.com, June 8, 2016.

  70. Ben Laurie: “Certificate Transparencyarrow-up-right,” ACM Queue, volume 12, number 8, pages 10-19, August 2014. doi:10.1145/2668152.2668154arrow-up-right

  71. Mark D. Ryan: “Enhanced Certificate Transparency and End-to-End Encrypted Mailarrow-up-right,” at Network and Distributed System Security Symposium (NDSS), February 2014. doi:10.14722/ndss.2014.23379arrow-up-right

  72. ACM Code of Ethics and Professional Conductarrow-up-right,” Association for Computing Machinery, acm.org, 2018.

  73. François Chollet: “Software development is starting to involve important ethical choicesarrow-up-right,” twitter.com, October 30, 2016.

  74. Igor Perisic: “Making Hard Choices: The Quest for Ethics in Machine Learningarrow-up-right,” engineering.linkedin.com, November 2016.

  75. John Naughton: “Algorithm Writers Need a Code of Conductarrow-up-right,” theguardian.com, December 6, 2015.

  76. Logan Kugler: “What Happens When Big Data Blunders?arrow-up-right,” Communications of the ACM, volume 59, number 6, pages 15–16, June 2016. doi:10.1145/2911975arrow-up-right

  77. Bill Davidow: “Welcome to Algorithmic Prisonarrow-up-right,” theatlantic.com, February 20, 2014.

  78. Don Peck: “They're Watching You at Workarrow-up-right,” theatlantic.com, December 2013.

  79. Leigh Alexander: “Is an Algorithm Any Less Racist Than a Human?arrow-up-righttheguardian.com, August 3, 2016.

  80. Jesse Emspak: “How a Machine Learns Prejudicearrow-up-right,” scientificamerican.com, December 29, 2016.

  81. Maciej Cegłowski: “The Moral Economy of Techarrow-up-right,” idlewords.com, June 2016.

  82. Julia Angwin: “Make Algorithms Accountablearrow-up-right,” nytimes.com, August 1, 2016.

  83. Bryce Goodman and Seth Flaxman: “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’arrow-up-right,” arXiv:1606.08813, August 31, 2016.

  84. A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposesarrow-up-right,” Staff Report, United States Senate Committee on Commerce, Science, and Transportation, commerce.senate.gov, December 2013.

  85. Donella H. Meadows and Diana Wright: Thinking in Systems: A Primer. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7

  86. Daniel J. Bernstein: “Listening to a ‘big data’/‘data science’ talkarrow-up-right,” twitter.com, May 12, 2015.

  87. Marc Andreessen: “Why Software Is Eating the Worldarrow-up-right,” The Wall Street Journal, 20 August 2011.

  88. The Grugq: “Nothing to Hidearrow-up-right,” grugq.tumblr.com, April 15, 2016.

  89. Tony Beltramelli: “Deep-Spying: Spying Using Smartwatch and Deep Learningarrow-up-right,” Masters Thesis, IT University of Copenhagen, December 2015. Available at arxiv.org/abs/1512.05616

  90. Shoshana Zuboff: “Big Other: Surveillance Capitalism and the Prospects of an Information Civilizationarrow-up-right,” Journal of Information Technology, volume 30, number 1, pages 75–89, April 2015. doi:10.1057/jit.2015.5arrow-up-right

  91. Carina C. Zona: “Consequences of an Insightful Algorithmarrow-up-right,” at GOTO Berlin, November 2016.

  92. Bruce Schneier: “Data Is a Toxic Asset, So Why Not Throw It Out?arrow-up-right,” schneier.com, March 1, 2016.

  93. John E. Dunn: “The UK’s 15 Most Infamous Data Breachesarrow-up-right,” techworld.com, November 18, 2016.

  94. Bruce Schneier: “Mission Creep: When Everything Is Terrorismarrow-up-right,” schneier.com, July 16, 2013.

  95. Lena Ulbricht and Maximilian von Grafenstein: “Big Data: Big Power Shifts?arrow-up-right,” Internet Policy Review, volume 5, number 1, March 2016. doi:10.14763/2016.1.406arrow-up-right

  96. Ellen P. Goodman and Julia Powles: “Facebook and Google: Most Powerful and Secretive Empires We've Ever Knownarrow-up-right,” theguardian.com, September 28, 2016.

  97. Maciej Cegłowski: “Haunted by Dataarrow-up-right,” idlewords.com, October 2015.

  98. Conor Friedersdorf: “Edward Snowden’s Other Motive for Leakingarrow-up-right,” theatlantic.com, May 13, 2014.

  99. Phillip Rogaway: “The Moral Character of Cryptographic Workarrow-up-right,” Cryptology ePrint 2015/1162, December 2015.

Last updated