3. Storage and Retrieval

Wer Ordnung haĢlt, ist nur zu faul zum Suchen. (If you keep things tidily ordered, youāre just too lazy to go searching.)
ā ā German proverb
On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.
In Chapter 2 we discussed data models and query languagesāi.e., the format in which you (the application developer) give the database your data, and the mechaā nism by which you can ask for it again later. In this chapter we discuss the same from the databaseās point of view: how we can store the data that weāre given, and how we can find it again when weāre asked for it.
Why should you, as an application developer, care how the database handles storage and retrieval internally? Youāre probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.
In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. We will explore that distinction later in āTransaction Processing or Analytics?ā, and in āColumn-Oriented Storageā weāll discuss a family of storage engines that is optimized for analytics.
However, first weāll start this chapter by talking about storage engines that are used in the kinds of databases that youāre probably familiar with: traditional relational dataā bases, and also most so-called NoSQL databases. We will examine two families of storage engines: log-structured storage engines, and page-oriented storage engines such as B-trees.
ā¦ā¦
Summary
In this chapter we tried to get to the bottom of how databases handle storage and retrieval. What happens when you store data in a database, and what does the dataā base do when you query for the data again later?
On a high level, we saw that storage engines fall into two broad categories: those optiā mized for transaction processing (OLTP), and those optimized for analytics (OLAP). There are big differences between the access patterns in those use cases:
OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query. The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.
Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time. Disk bandwidth (not seek time) is often the bottleneck here, and column- oriented storage is an increasingly popular solution for this kind of workload.
On the OLTP side, we saw storage engines from two main schools of thought:
The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.
The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones.
Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures, and databases that are optimized for keeping all data in memory.
We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scanā ning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk. We discussed how column-oriented storage helps achieve this goal.
As an application developer, if youāre armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application. If you need to adjust a databaseās tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have.
Although this chapter couldnāt make you an expert in tuning any one particular storā age engine, it has hopefully equipped you with enough vocabulary and ideas that you can make sense of the documentation for the database of your choice.
References
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman: Data Structures and Algorithms. Addison-Wesley, 1983. ISBN: 978-0-201-00023-8
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: Introduction to Algorithms, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8
Justin Sheehy and David Smith: āBitcask: A Log-Structured Hash Table for Fast Key/Value Data,ā Basho Technologies, April 2010.
Yinan Li, Bingsheng He, Robin Jun Yang, et al.: āTree Indexing on Solid State Drives,ā Proceedings of the VLDB Endowment, volume 3, number 1, pages 1195ā1206, September 2010.
Goetz Graefe: āModern B-Tree Techniques,ā Foundations and Trends in Databases, volume 3, number 4, pages 203ā402, August 2011. doi:10.1561/1900000028
Jeffrey Dean and Sanjay Ghemawat: āLevelDB Implementation Notes,ā github.com.
Dhruba Borthakur: āThe History of RocksDB,ā rocksdb.blogspot.com, November 24, 2013.
Matteo Bertozzi: āApache HBase I/O ā HFile,ā blog.cloudera.com, June 29, 2012.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: āBigtable: A Distributed Storage System for Structured Data,ā at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil: āThe Log-Structured Merge-Tree (LSM-Tree),ā Acta Informatica, volume 33, number 4, pages 351ā385, June 1996. doi:10.1007/s002360050048
Mendel Rosenblum and John K. Ousterhout: āThe Design and Implementation of a Log-Structured File System,ā ACM Transactions on Computer Systems, volume 10, number 1, pages 26ā52, February 1992. doi:10.1145/146941.146943
Adrien Grand: āWhat Is in a Lucene Index?,ā at Lucene/Solr Revolution, November 14, 2013.
Deepak Kandepet: āHacking LuceneāThe Index Format,ā hackerlabs.github.io, October 1, 2011.
Michael McCandless: āVisualizing Lucene's Segment Merges,ā blog.mikemccandless.com, February 11, 2011.
Burton H. Bloom: āSpace/Time Trade-offs in Hash Coding with Allowable Errors,ā Communications of the ACM, volume 13, number 7, pages 422ā426, July 1970. doi:10.1145/362686.362692
āOperating Cassandra: Compaction,ā Apache Cassandra Documentation v4.0, 2016.
Rudolf Bayer and Edward M. McCreight: āOrganization and Maintenance of Large Ordered Indices,ā Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970.
Douglas Comer: āThe Ubiquitous B-Tree,ā ACM Computing Surveys, volume 11, number 2, pages 121ā137, June 1979. doi:10.1145/356770.356776
Emmanuel Goossaert: āCoding for SSDs,ā codecapsule.com, February 12, 2014.
C. Mohan and Frank Levine: āARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging,ā at ACM International Conference on Management of Data (SIGMOD), June 1992. doi:10.1145/130283.130338
Howard Chu: āLDAP at Lightning Speed,ā at Build Stuff '14, November 2014.
Bradley C. Kuszmaul: āA Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees,ā tokutek.com, April 22, 2014.
Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, et al.: āDesigning Access Methods: The RUM Conjecture,ā at 19th International Conference on Extending Database Technology (EDBT), March 2016. doi:10.5441/002/edbt.2016.42
Peter Zaitsev: āInnodb Double Write,ā percona.com, August 4, 2006.
Tomas Vondra: āOn the Impact of Full-Page Writes,ā blog.2ndquadrant.com, November 23, 2016.
Mark Callaghan: āThe Advantages of an LSM vs a B-Tree,ā smalldatum.blogspot.co.uk, January 19, 2016.
Mark Callaghan: āChoosing Between Efficiency and Performance with RocksDB,ā at Code Mesh, November 4, 2016.
Michi Mutsuzaki: āMySQL vs. LevelDB,ā github.com, August 2011.
Benjamin Coverston, Jonathan Ellis, et al.: āCASSANDRA-1608: Redesigned Compaction, issues.apache.org, July 2011.
Igor Canadi, Siying Dong, and Mark Callaghan: āRocksDB Tuning Guide,ā github.com, 2016.
MySQL 5.7 Reference Manual. Oracle, 2014.
Books Online for SQL Server 2012. Microsoft, 2012.
Joe Webb: āUsing Covering Indexes to Improve Query Performance,ā simple-talk.com, 29 September 2008.
Frank Ramsak, Volker Markl, Robert Fenk, et al.: āIntegrating the UB-Tree into a Database System Kernel,ā at 26th International Conference on Very Large Data Bases (VLDB), September 2000.
The PostGIS Development Group: āPostGIS 2.1.2dev Manual,ā postgis.net, 2014.
Robert Escriva, Bernard Wong, and Emin Gün Sirer: āHyperDex: A Distributed, Searchable Key-Value Store,ā at ACM SIGCOMM Conference, August 2012. doi:10.1145/2377677.2377681
Michael McCandless: āLucene's FuzzyQuery Is 100 Times Faster in 4.0,ā blog.mikemccandless.com, March 24, 2011.
Steffen Heinz, Justin Zobel, and Hugh E. Williams: āBurst Tries: A Fast, Efficient Data Structure for String Keys,ā ACM Transactions on Information Systems, volume 20, number 2, pages 192ā223, April 2002. doi:10.1145/506309.506312
Klaus U. Schulz and Stoyan Mihov: āFast String Correction with Levenshtein Automata,ā International Journal on Document Analysis and Recognition, volume 5, number 1, pages 67ā85, November 2002. doi:10.1007/s10032-002-0082-8
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book
Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: āThe End of an Architectural Era (Itās Time for a Complete Rewrite),ā at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.
āVoltDB Technical Overview White Paper,ā VoltDB, 2014.
Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout: āLog-Structured Memory for DRAM-Based Storage,ā at 12th USENIX Conference on File and Storage Technologies (FAST), February 2014.
Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: āOLTP Through the Looking Glass, and What We Found There,ā at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376713
Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: āAnti-Caching: A New Approach to Database Management System Architecture,ā Proceedings of the VLDB Endowment, volume 6, number 14, pages 1942ā1953, September 2013.
Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor: āLet's Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems,ā at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2749441
Edgar F. Codd, S. B. Codd, and C. T. Salley: āProviding OLAP to User-Analysts: An IT Mandate,ā E. F. Codd Associates, 1993.
Surajit Chaudhuri and Umeshwar Dayal: āAn Overview of Data Warehousing and OLAP Technology,ā ACM SIGMOD Record, volume 26, number 1, pages 65ā74, March 1997. doi:10.1145/248603.248616
Per-Ć ke Larson, Cipri Clinciu, Campbell Fraser, et al.: āEnhancements to SQL Server Column Stores,ā at ACM International Conference on Management of Data (SIGMOD), June 2013.
Franz FƤrber, Norman May, Wolfgang Lehner, et al.: āThe SAP HANA Database ā An Architecture Overview,ā IEEE Data Engineering Bulletin, volume 35, number 1, pages 28ā33, March 2012.
Michael Stonebraker: āThe Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong,ā presentation at EPFL, May 2013.
Daniel J. Abadi: āClassifying the SQL-on-Hadoop Solutions,ā hadapt.com, October 2, 2013.
Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: āImpala: A Modern, Open-Source SQL Engine for Hadoop,ā at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.
Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: āDremel: Interactive Analysis of Web-Scale Datasets,ā at 36th International Conference on Very Large Data Bases (VLDB), pages 330ā339, September 2010.
Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edition. John Wiley & Sons, July 2013. ISBN: 978-1-118-53080-1
Derrick Harris: āWhy Apple, eBay, and Walmart Have Some of the Biggest Data Warehouses Youāve Ever Seen,ā gigaom.com, March 27, 2013.
Julien Le Dem: āDremel Made Simple with Parquet,ā blog.twitter.com, September 11, 2013.
Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: āThe Design and Implementation of Modern Column-Oriented Database Systems,ā Foundations and Trends in Databases, volume 5, number 3, pages 197ā280, December 2013. doi:10.1561/1900000024
Peter Boncz, Marcin Zukowski, and Niels Nes: āMonetDB/X100: Hyper-Pipelining Query Execution,ā at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.
Jingren Zhou and Kenneth A. Ross: āImplementing Database Operations Using SIMD Instructions,ā at ACM International Conference on Management of Data (SIGMOD), pages 145ā156, June 2002. doi:10.1145/564691.564709
Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: āC-Store: A Column-oriented DBMS,ā at 31st International Conference on Very Large Data Bases (VLDB), pages 553ā564, September 2005.
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: āThe Vertica Analytic Database: C-Store 7 Years Later,ā Proceedings of the VLDB Endowment, volume 5, number 12, pages 1790ā1801, August 2012.
Julien Le Dem and Nong Li: āEfficient Data Storage for Analytics with Apache Parquet 2.0,ā at Hadoop Summit, San Jose, June 2014.
Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: āData Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,ā Data Mining and Knowledge Discovery, volume 1, number 1, pages 29ā53, March 2007. doi:10.1023/A:1009726021843
Last updated
Was this helpful?