Taken from:
- Build a Database Pt. 2: MemTable
- MemTable – facebook/rocksdb wiki
- Write-Ahead Logging -SQLite
- Writing A Database: Part 2 — Write Ahead Log
- SSTable and Log Structured Storage: LevelDB
- Study Guide — Cassandra Reads and Writes
- On Disk IO, Part 3: LSM Trees
- Bigtable: A Distributed Storage System for Structured Data
MemTable
The MemTable (aka. Memory Table) is the in-memory cache of the latest set of record writes applied to the database before it’s flushed into SSTable or SST files. Simply, it is a container, whether that be a Vector, HashLinkList, RB Tree, SkipList, HashSkipList or any other container, that holds the written records sorted, in total order, by key. By sorting the records, lookups and scans in the MemTable can be done efficiently using a data structure that supports a O(Log N)
access pattern.
It serves both read and write – new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replaced by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed.
Skiplist MemTable
Skiplist-based memtable provides general good performance to both read and write, random access and sequential scan. Besides, it provides some other useful features that other memtable implementations don’t currently support, like Concurrent Insert and Insert with Hint.
Mem Table Type Comparison: SkipList, HashSkipList, HashLinkList, Vector
At their core, LSM-Tree databases take a random I/O problem in a B-Tree model and turn it into a sequential I/O problem, which is much faster. This is achieved by batching the writes for updated records. The MemTable does this working in coordination with two methods: the Write Ahead Log(WAL) and the Sorted String Table(SSTable). First, the WAL holds a replica of the MemTable so we can be assured that our data is intact in the event of a restart. Instead of storing the MemTable byte-for-byte, the WAL stores a running log of the operations applied to the database, hence its name. By replaying the operations stored in the WAL, the MemTable can be recovered. Second, the SSTables are created to store MemTables once they have reached capacity. Again, this writes all of the records to disk in one go, eliminating the need for random disk writes.
Immutable files
Keeping the data structure immutable favors the sequential writes: data is written on disk in a single pass, append-only.
Another advantage of immutable files is that data can be read from the disk without any segment locking between operations, which significantly simplifies concurrent access. In contrast, mutable data structures employ hierarchical locks and latches in order to ensure on disk data structure integrity, allow multiple readers at the same time but give exclusive ownership for parts of tree to writers.
Both mutable and immutable data structures require some housekeeping in order to optimize performance but for different reasons. Since amount of allocated files constantly grows, immutable data structures have to merge and rewrite files in order to make sure that the least amount of files is hit during the query, as the requested record might be spread across multiple files.
The Write Ahead Log
The Write Ahead Log (WAL) is a commonly used technique in database systems to maintain atomicity and durability of writes. The key idea behind the WAL is that before we make any actual modifications to the database state, we have to first log the complete set of actions we wish to have to be atomic and durable to durable storage (e.g. disk).
Durability is provided by writing the intended mutation to the WAL first, before applying the changes to for example, the in-memory representation. By writing to the WAL first, should the database then crash, we will be able to recover the mutation and reapply if necessary.
Atomicity is a little more subtle. Suppose a mutation requires changes A
, B
and C
to happen, but we have no means of atomically applying all of them at once. We could first log
intending to apply A
intending to apply B
intending to apply C
and only then start making the actual applications. Should the server crash halfway, we can look at the log and see what operations potentially need to be redone.
In DDB, the WAL is an append-only file of records:
record:
length: uint32 // length of data section
checksum: uint32 // CRC32 checksum of data
data: byte[length] // serialized ddb.internal.LogRecord proto
Since serialized protos are not self describing, we need a length field to know big the data
payload is. Additionally, to guard against various forms of corruption (and bugs!) we have a CRC32 checksum of the data.
SSTable: Sorted String Table
An immutable data structure that stores a large number of sorted key:value
pairs sorted by key
. It is a file on disk.
- Duplicate keys are fine, there is no need for “padding” for keys or values, and keys and values are arbitrary blobs.
- Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone
key:offset
index for fast access.
- Once an SSTable is on disk it is effectively immutable because an insert or delete would require a large I/O rewrite of the file.
- Having said that, for static indexes it is a great solution: read in the index, and you are always one disk seek away, or simply
memmap
the entire file to memory. Random reads are fast and easy. - Random writes are much harder and expensive, that is, unless the entire table is in memory, in which case we’re back to simple pointer manipulation. Turns out, this is the very problem that Google’s BigTable set out to solve: fast read/write access for petabyte datasets in size, backed by SSTables underneath.
LevelDB: SSTables and Log Structured Merge Trees
We want to preserve the fast read access which SSTables give us, but we also want to support fast random writes. Turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory (let’s call it MemTable
), and if the table is immutable then an on-disk SSTable is also fast to read from. Now let’s introduce the following conventions:
- On-disk
SSTable
indexes are always loaded into memory - All writes go directly to the
MemTable
index - Reads check the MemTable first and then the SSTable indexes
- Periodically, the MemTable is flushed to disk as an SSTable
- Periodically, on-disk SSTables are “collapsed together”
Writes are always done in memory and hence are always fast. Once the MemTable
reaches a certain size, it is flushed to disk as an immutable SSTable
. However, we will maintain all the SSTable indexes in memory, which means that for any read we can check the MemTable first, and then walk the sequence of SSTable indexes to find our data. Turns out, we have just reinvented the “The Log-Structured Merge-Tree” (LSM Tree), described by Patrick O’Neil, and this is also the very mechanism behind “BigTable Tablets“.
LevelDB: LSM & SSTables: Updates, Deletes and Maintenance
This “LSM” architecture provides a number of interesting behaviors: writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek.
However, what about updates and deletes?
Once the SSTable is on disk, it is immutable, hence updates and deletes can’t touch the data. Instead, a more recent value is simply stored in MemTable
in case of update, and a “tombstone” record is appended for deletes. Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values! Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data.
Take an SSTable
, add a MemTable
and apply a set of processing conventions and what you get is a nice database engine for certain type of workloads. In fact, Google’s BigTable, Hadoop’s HBase, and Cassandra amongst others are all using a variant or a direct copy of this very architecture.
Google BigTable
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
(row:string, column:string, time:int64) → string
Rows
The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row.
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses.
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named
anchor:cnnsi.com
and anchor:my.look.ca
. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.
For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html
under the key com.google.maps/index.html
. Storing pages from the same domain near each other makes some host and domain analyses more efficient.
Column Families
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together).
A column key is named using the following syntax: family:qualifier
. Column family names must be printable, but qualifiers may be arbitrary strings.
Timestamps
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.
To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days).
Building Blocks
Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File System (GFS) [17] to store log and data files.
Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.
SSTable
The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range.
Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.
Chubby
Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8]. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm [9, 23] to keep its replicas consistent in the face of failure.
Compactions
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background.
A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data.
Lessons
In the process of designing, implementing, maintaining, and supporting Bigtable, we gained useful experience and learned several interesting lessons.
One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols. For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance. As we have gained more experience with these problems, we have addressed them by changing various protocols. For example, we added checksumming to our RPC mechanism. We also handled some problems by removing assumptions made by one part of the system about another part. For example, we stopped assuming a given Chubby operation could return only one of a fixed set of errors.
Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used. For example, we initially planned to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them. Now that we have many real applications running on Bigtable, we have been able to examine their actual needs, and have discovered that most applications require only single-row transactions.
A practical lesson that we learned from supporting Bigtable is the importance of proper system-level monitoring (i.e., monitoring both Bigtable itself, as well as the client processes using Bigtable). For example, we extended our RPC system so that for a sample of the RPCs, it keeps a detailed trace of the important actions done on behalf of that RPC. This feature has allowed us to detect and fix many problems such as lock contention on tablet data structures, slow writes to GFS while committing Bigtable mutations, and stuck accesses to the METADATA table when METADATA tablets are unavailable.
The most important lesson we learned is the value of simple designs. Given both the size of our system (about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we have found that code and design clarity are of immense help in code maintenance and debugging.
Related Work
Many recent projects have tackled the problem of providing distributed storage or higher-level services over wide area networks, often at “Internet scale.” This includes work on distributed hash tables that began with projects such as CAN [29], Chord [32], Tapestry [37], and Pastry [30]. These systems address concerns that do not arise for Bigtable, such as highly variable bandwidth, untrusted participants, or frequent reconfiguration; decentralized control and Byzantine fault tolerance are not Bigtable goals.
Bigtable locality groups realize similar compression and disk read performance benefits observed for other systems that organize data on disk using column-based rather than row-based storage, including C-Store [1, 34] and commercial products such as Sybase IQ [15, 36], SenSage [31], KDB+ [22], and the ColumnBM storage layer in MonetDB/X100 [38]. A
The manner in which Bigtable uses memtables and SSTables to store updates to tablets is analogous to the way that the Log-Structured Merge Tree [26] stores updates to index data. In both systems, sorted data is buffered in memory before being written to disk, and reads must merge data from memory and disk.
Bigtable’s load balancer has to solve some of the same kinds of load and memory balancing problems faced by shared-nothing databases (e.g., [11, 35]). Our problem is somewhat simpler: (1) we do not consider the possibility of multiple copies of the same data, possibly in alternate forms due to views or indices; (2) we let the user tell us what data belongs in memory and what data should stay on disk, rather than trying to determine this dynamically; (3) we have no complex queries to execute or optimize.