githubEdit

ch6

Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationโ€ ships, not procedures.

โ€‹ โ€” Grace Murray Hopper, Management and the Computer of the Future (1962)


In Chapter 5arrow-up-right we discussed replicationโ€”that is, having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions, also known as sharding.

Terminological confusion

What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; itโ€™s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase. However, partitioning is the most established term, so weโ€™ll stick with that.

Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition. There are various ways of achievโ€ ing this, which we discuss in depth in this chapter. In effect, each partition is a small database of its own, although the database may support operations that touch multiโ€ ple partitions at the same time.

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster (see the introduction to Part IIarrow-up-right for a definition of shared nothing). Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder.

Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL [1], and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses. Some systems are designed for transactional workโ€ loads, and others for analytics (see โ€œTransaction Processing or Analytics?arrow-up-rightโ€): this difference affects how the system is tuned, but the fundamentals of partitioning apply to both kinds of workloads.

In this chapter we will first look at different approaches for partitioning large datasets and observe how the indexing of data interacts with partitioning. Weโ€™ll then talk about rebalancing, which is necessary if you want to add or remove nodes in your cluster. Finally, weโ€™ll get an overview of how databases route requests to the right partitions and execute queries.

โ€ฆโ€ฆ

Summary

In this chapter we explored different ways of partitioning a large dataset into smaller subsets. Partitioning is necessary when you have so much data that storing and proโ€ cessing it on a single machine is no longer feasible.

The goal of partitioning is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). This requires choosing a partitioning scheme that is appropriate to your data, and rebaโ€ lancing the partitions when nodes are added to or removed from the cluster.

We discussed two main approaches to partitioning:

  • Key range partitioning, where keys are sorted, and a partition owns all the keys from some minimum up to some maximum. Sorting has the advantage that effiโ€ cient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order.

    In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big.

  • Hash partitioning, where a hash function is applied to each key, and a partition owns a range of hashes. This method destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly.

    When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partiโ€ tions from one node to another when nodes are added or removed. Dynamic partitioning can also be used.

Hybrid approaches are also possible, for example with a compound key: using one part of the key to identify the partition and another part for the sort order.

We also discussed the interaction between partitioning and secondary indexes. A secโ€ ondary index also needs to be partitioned, and there are two methods:

  • Document-partitioned indexes (local indexes), where the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/gather across all partitions.

  • Term-partitioned indexes (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is writโ€ ten, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.

Finally, we discussed techniques for routing queries to the appropriate partition, which range from simple partition-aware load balancing to sophisticated parallel query execution engines.

By design, every partition operates mostly independentlyโ€”thatโ€™s what allows a partiโ€ tioned database to scale to multiple machines. However, operations that need to write to several partitions can be difficult to reason about: for example, what happens if the write to one partition succeeds, but another fails? We will address that question in the following chapters.

References

  1. David J. DeWitt and Jim N. Gray: โ€œParallel Database Systems: The Future of High Performance Database Systemsarrow-up-right,โ€ Communications of the ACM, volume 35, number 6, pages 85โ€“98, June 1992. doi:10.1145/129888.129894arrow-up-right

  2. Lars George: โ€œHBase vs. BigTable Comparisonarrow-up-right,โ€ larsgeorge.com, November 2009.

  3. โ€œThe Apache HBase Reference Guidearrow-up-right,โ€ Apache Software Foundation, hbase.apache.org, 2014.

  4. MongoDB, Inc.: โ€œNew Hash-Based Sharding Feature in MongoDB 2.4arrow-up-right,โ€ blog.mongodb.org, April 10, 2013.

  5. Ikai Lan: โ€œApp Engine Datastore Tip: Monotonically Increasing Values Are Badarrow-up-right,โ€ ikaisays.com, January 25, 2011.

  6. Martin Kleppmann: โ€œJava's hashCode Is Not Safe for Distributed Systemsarrow-up-right,โ€ martin.kleppmann.com, June 18, 2012.

  7. David Karger, Eric Lehman, Tom Leighton, et al.: โ€œConsistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Webarrow-up-right,โ€ at 29th Annual ACM Symposium on Theory of Computing (STOC), pages 654โ€“663, 1997. doi:10.1145/258533.258660arrow-up-right

  8. John Lamping and Eric Veach: โ€œA Fast, Minimal Memory, Consistent Hash Algorithmarrow-up-right,โ€ arxiv.org, June 2014.

  9. Eric Redmond: โ€œA Little Riak Bookarrow-up-right,โ€ Version 1.4.0, Basho Technologies, September 2013.

  10. โ€œCouchbase 2.5 Administrator Guidearrow-up-right,โ€ Couchbase, Inc., 2014.

  11. Avinash Lakshman and Prashant Malik: โ€œCassandra โ€“ A Decentralized Structured Storage Systemarrow-up-right,โ€ at 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), October 2009.

  12. Jonathan Ellis: โ€œFacebookโ€™s Cassandra Paper, Annotated and Compared to Apache Cassandra 2.0arrow-up-right,โ€ docs.datastax.com, September 12, 2013.

  13. โ€œIntroduction to Cassandra Query Languagearrow-up-right,โ€ DataStax, Inc., 2014.

  14. Samuel Axon: โ€œ3% of Twitter's Servers Dedicated to Justin Bieberarrow-up-right,โ€ mashable.com, September 7, 2010.

  15. โ€œRiak KV Docsarrow-up-right,โ€ docs.riak.com.

  16. Richard Low: โ€œThe Sweet Spot for Cassandra Secondary Indexingarrow-up-right,โ€ wentnet.com, October 21, 2013.

  17. Zachary Tong: โ€œCustomizing Your Document Routingarrow-up-right,โ€ elastic.co, June 3, 2013.

  18. โ€œApache Solr Reference Guidearrow-up-right,โ€ Apache Software Foundation, 2014.

  19. Andrew Pavlo: โ€œH-Store Frequently Asked Questionsarrow-up-right,โ€ hstore.cs.brown.edu, October 2013.

  20. โ€œAmazon DynamoDB Developer Guidearrow-up-right,โ€ Amazon Web Services, Inc., 2014.

  21. Rusty Klophaus: โ€œDifference Between 2I and Searcharrow-up-right,โ€ email to riak-users mailing list, lists.basho.com, October 25, 2011.

  22. Donald K. Burleson: โ€œObject Partitioning in Oraclearrow-up-right,โ€dba-oracle.com, November 8, 2000.

  23. Eric Evans: โ€œRethinking Topology in Cassandraarrow-up-right,โ€ at ApacheCon Europe, November 2012.

  24. Rafaล‚ Kuฤ‡: โ€œReroute API Explainedarrow-up-right,โ€ elasticsearchserverbook.com, September 30, 2013.

  25. โ€œProject Voldemort Documentationarrow-up-right,โ€ project-voldemort.com.

  26. Enis Soztutar: โ€œApache HBase Region Splitting and Mergingarrow-up-right,โ€ hortonworks.com, February 1, 2013.

  27. Brandon Williams: โ€œVirtual Nodes in Cassandra 1.2arrow-up-right,โ€ datastax.com, December 4, 2012.

  28. Richard Jones: โ€œlibketama: Consistent Hashing Library for Memcached Clientsarrow-up-right,โ€ metabrew.com, April 10, 2007.

  29. Branimir Lambov: โ€œNew Token Allocation Algorithm in Cassandra 3.0arrow-up-right,โ€ datastax.com, January 28, 2016.

  30. Jason Wilder: โ€œOpen-Source Service Discoveryarrow-up-right,โ€ jasonwilder.com, February 2014.

  31. Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: โ€œUntangling Cluster Management with Helixarrow-up-right,โ€ at ACM Symposium on Cloud Computing (SoCC), October 2012. doi:10.1145/2391229.2391248arrow-up-right

  32. โ€œMoxi 1.8 Manualarrow-up-right,โ€ Couchbase, Inc., 2014.

  33. Shivnath Babu and Herodotos Herodotou: โ€œMassively Parallel Databases and MapReduce Systemsarrow-up-right,โ€ Foundations and Trends in Databases, volume 5, number 1, pages 1โ€“104, November 2013. doi:10.1561/1900000036arrow-up-right

Last updated