Storage and Retrieval

A database stores and retrieves data.

This chapter focuses on how databases store and retrieve data, contrasting with Chapter 3’s discussion of data models and query languages from the user’s perspective.

Application developers should understand how databases handle storage and retrieval to select an appropriate storage engine and configure it for optimal performance.

The chapter examines two families of storage engines for OLTP: log-structured storage engines and B-trees. These structures are used for key-value storage and secondary indexes.

Storage engines optimized for analytics and indexes for advanced queries are discussed in later chapters.

Storage and Indexing for OLTP

Two bash functions, db_set and db_get, implement a simple key-value store.

#!/bin/bash
db_set() {
	echo "$1,$2" >>database
}
db_get() {
	grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

A simple key-value storage system is implemented using a text file, where each line represents a key-value pair. Updates to a key are appended to the file, and the latest value is retrieved by examining the last occurrence of the key.

The db_set function performs well due to the efficiency of appending to a file, similar to how databases use append-only logs. While real databases handle more complex issues, the basic principle remains the same.

The term “log” refers to an append-only sequence of records on disk, not limited to human-readable application logs.

The db_get function has poor performance with large databases, as it scans the entire database for a key, resulting in $O(n)$ lookup time.

Indexes are data structures that efficiently locate data in a database by structuring it in a way that facilitates faster searches.

Indexes, derived from primary data, improve query performance but slow down writes due to the need for index updates.

Indexes improve read query performance in storage systems but increase disk space usage and slow down writes. Databases require manual selection of indexes based on application query patterns to optimize performance.

Log-Structured Storage

To speed up reads from an append-only file, a hash map can be used to map keys to the byte offsets of their most recent values.

Appending key-value pairs to a file updates a hash map with the data offset. Lookups use the hash map to find the offset, seek to it, and read the value, potentially avoiding disk I/O if cached.

The approach is faster but still problematic.

Old log entries overwritten but not deleted, potentially exhausting disk space.
HashMap persistence is required for faster database restarts.
Hash tables must fit in memory due to the difficulty of maintaining them on disk, which requires excessive random access I/O and complex collision handling.
Range queries are inefficient in hash maps.

The SSTable file format

Sorted Strings Tables (SSTables), which store key-value pairs sorted by key, are more commonly used than hash tables for database indexes.

SSTable key-value pairs can be grouped into blocks, with the first key of each block stored in a sparse index for efficient querying.

Sparse indexes allow for efficient searching by leveraging sorted keys. For example, to find “handiwork” between “handbag” and “handsome,” one can seek to the offset for “handbag” and scan the file.

Record blocks can be compressed to save disk space and reduce I/O bandwidth, though it increases CPU usage.

Constructing and merging SSTables

SSTable file format improves read performance but complicates writes, as inserting keys disrupts the sorted order.

A log-structured approach, combining an append-only log and a sorted file, can solve the problem.

Add writes to an in-memory ordered map, such as a red-black tree, skip list, or trie, called the memtable.
When the memtable exceeds a threshold, it is written to disk as an SSTable file, becoming the most recent database segment. The database can continue writing to a new memtable while the old one is written to disk.
To read a key’s value, search the memtable and most recent on-disk segment first. If not found, continue searching older segments until the key is located or the oldest segment is reached.
Periodically merge and compact segment files to remove overwritten or deleted values.

Merging segments, similar to mergesort, involves reading input files side by side, copying the lowest key to the output file, and repeating. This process creates a new sorted merged segment file with one value per key.

To prevent data loss in the memtable during a crash, the storage engine appends every write to a separate log on disk. This log, which is not sorted by key, is used to restore the memtable and can be discarded after the memtable is written to an SSTable.

Deleting a key-value pair requires appending a tombstone to the data file, which instructs the merging process to discard previous values.

The algorithm, used in RocksDB, Cassandra, ScyllaDB, and HBase, is based on the Log-Structured Merge-tree (LSM-tree) principle. This principle, inspired by Google’s Bigtable paper, involves merging and compacting sorted files.

LSM storage engines write segment files immutably in one pass. Merging and compaction occur in the background, allowing reads from input segments until the merge is complete.

Segment files can be stored on local disks or object storage, as demonstrated by SlateDB and Delta Lake.

Immutable segment files simplify crash recovery by allowing the database to delete unfinished SSTables and restart. Checksums in the log help detect and discard corrupted or incomplete entries.

Bloom filters

LSM storage engines use Bloom filters in segment files to speed up reads by providing a fast way to check for key existence.

A Bloom filter, used in SSTables, contains a bitmap and a sparse index of keys. Hash functions are applied to each key, setting corresponding bits in the bitmap to 1.

Key presence in an SSTable is determined by hashing the key and checking the corresponding bits, which can be done quickly using bitwise operations.

If any query bit is 0, the key is definitely not in the SSTable. If all bits are 1, the key is likely present, but a false positive is possible.

False positive probability in Bloom filters depends on the number of keys, bits per key, and total bits. Allocate 10 bits per key for a 1% false positive rate, reducing by tenfold for every 5 additional bits.

False positives are not an issue for LSM storage engines:

Bloom filter indicates key absence, allowing SSTable skipping.
If the Bloom filter indicates a key’s presence, the sparse index is consulted to verify its existence. A false positive results in unnecessary work, but the search continues with the next segment.

Compaction strategies

LSM storage compaction strategies vary, impacting performance and efficiency.

Size-tiered compaction

Newer, smaller SSTables are merged into older, larger ones, reducing the number of SSTables and improving write throughput. This process requires significant temporary disk space.

Leveled compaction

Leveled compaction keeps SSTable sizes fixed and groups them into increasing levels, allowing for incremental compaction and more efficient reads.

Size-tiered compaction is better for write-heavy workloads, while leveled compaction is better for read-heavy workloads. Most LSM-tree implementations offer various compaction strategies.

LSM-trees use a cascade of SSTables merged in the background, a simple and effective approach.

Embedded databases, like RocksDB and SQLite, are libraries that run within an application, interacting through function calls. They are suitable for mobile apps and backend systems with small, separate datasets.

B-Trees

B-trees are the most widely used structure for reading and writing database records by key.

B-trees, introduced in 1970, are still the standard index implementation in most databases.

B-trees and SSTables both sort key-value pairs for efficient lookups and queries, but differ in design philosophy.

Log-structured indexes use variable-size segments, while B-trees use fixed-size blocks or pages.

Page numbers, similar to pointers, allow referencing between pages stored in the same file, enabling the construction of a page tree.

A B-tree’s root page contains keys and references to child pages, each responsible for a range of keys.

The example in Figure 4-5 demonstrates how to locate key 251 by following page references within the 200-300 range, eventually reaching a leaf page containing the value.

B-tree branching factor, determined by space and range boundaries, is typically several hundred.

To update a value in a B-tree, overwrite the leaf page containing the key. To add a new key, find the appropriate page, add the key, and split the page if necessary.

Inserting a key into a full page requires splitting the page and updating the parent page, potentially causing a cascade of splits up to the root.

B-trees maintain balance, ensuring a depth of $O(log n)$ for $n$ keys. This allows databases to fit within a few levels, minimizing page references.

Making B-trees reliable

B-trees overwrite disk pages with new data, unlike LSM-trees which append and delete files.

Overwriting multiple pages simultaneously, like in a page split, risks database corruption if a crash occurs mid-operation. This can result in orphaned or partially written pages.

B-tree implementations use a write-ahead log (WAL) to ensure database resilience to crashes. The WAL, an append-only file, records modifications before they are applied to the B-tree, allowing for recovery after a crash.

B-tree implementations buffer modified pages in memory before writing to disk, using a write-ahead log to ensure data durability in case of a crash.

Using B-tree variants

Many B-tree variants have been developed over the years.

Some databases, like LMDB, use a copy-on-write scheme for crash recovery and concurrency control, writing modified pages to new locations and creating new parent pages.
Abbreviating keys in tree pages saves space and allows for a higher branching factor.
B-tree implementations aim to store leaf pages sequentially on disk to speed up scans, but maintaining this order becomes challenging as the tree grows.
Additional pointers were added to the tree, allowing for ordered scanning of keys.

Comparing B-Trees and LSM-Trees

LSM-trees are better for write-heavy applications, while B-trees are faster for reads. However, performance depends on workload, and some storage engines combine characteristics of both.

Read performance

B-trees and LSM storage engines both offer fast read performance, with B-trees having predictable performance and LSM engines benefiting from Bloom filters to reduce disk I/O.

Range queries are efficient on B-trees due to their sorted structure. LSM storage also benefits from sorted SSTables but requires parallel scanning of segments, making range queries more expensive than point queries.

High write throughput can cause latency spikes in log-structured storage engines if the memtable fills up, leading to backpressure and suspended reads and writes.

Modern SSDs, especially NVMe SSDs, can perform many independent read requests in parallel. Both LSM-trees and B-trees can provide high read throughput, but storage engines must be designed to utilize this parallelism.

Sequential versus random writes

B-trees cause scattered disk operations due to random key writes, while log-structured storage engines write larger segment files, reducing random disk access.

Sequential writes, used in LSM-trees, generally have higher throughput than random writes, used in B-trees, due to disk characteristics. This difference is more pronounced on spinning-disk hard drives than on SSDs.

SSDs have higher throughput for sequential writes than random writes due to the way flash memory is read, written, and erased. Random writes require more garbage collection, reducing write bandwidth and increasing wear on the drive.

Write amplification

LSM-trees incur overhead from multiple I/O operations for each write request, including writing to the log, memtable, and during compaction. This overhead can be reduced by storing values separately from keys.

B-tree indexes require data to be written twice, once to the write-ahead log and once to the tree page, and sometimes entire pages are written to ensure recovery after crashes.

Write amplification, the ratio of bytes written to disk to bytes written in an append-only log, impacts write performance in write-heavy applications. Higher write amplification reduces the number of writes per second a database can handle.

LSM-trees generally have lower write amplification than B-trees, making them better suited for write-heavy workloads.

Write amplification impacts SSD throughput and wear.

Write throughput experiments must be long enough to account for write amplification, as new writes compete with compaction for disk bandwidth as the database grows.

Disk space usage

B-trees can become fragmented over time, requiring a background process to optimize page placement and free up space.

LSM-trees experience less fragmentation due to periodic data file rewrites during compaction. SSTables, used in LSM-trees, allow for better compression of key-value pairs, resulting in smaller files compared to B-trees.

Multiple copies of data on disk can complicate data deletion, especially for compliance with data protection regulations. In LSM storage engines, deleted records may persist until tombstones propagate through all compaction levels, potentially taking a long time.

SSTable segment files’ immutability allows for efficient database snapshots, useful for backups and testing, without needing to copy the files.

Multicolumn and Secondary Indexes

Key-value indexes, similar to primary-key indexes, uniquely identify rows, documents, or vertices in databases.

Secondary indexes, created using the CREATE INDEX command, allow searching by columns other than the primary key in relational databases.

Secondary indexes differ from key-value indexes in that indexed values are not unique, requiring either a list of matching row identifiers or unique entries with appended row identifiers.

Storing Values Within the Index

Index keys are used for queries, with additional data stored depending on the index type.

Clustered indexes store actual data within the index structure, like MySQL’s InnoDB primary key and SQL Server’s per-table clustered index.
Data values can be references to the actual data, either as primary keys or direct disk locations. Heap files, used by Postgres, store data in no particular order and may be append-only or overwrite deleted rows.
Covering indexes store some table columns within the index, allowing some queries to be answered without accessing the heap or primary key, improving query speed but increasing disk space usage and slowing down writes.

Indexes discussed map a single key to a value; for querying multiple columns or fields, see “Multidimensional and Full-Text Indexes” on page 145.

Heap file approach allows in-place record updates if new value is smaller, but requires relocation and index updates if larger.

Keeping Everything in Memory

Data structures are designed to address disk limitations, such as awkward data layout for optimal performance, while leveraging disks’ durability and cost-effectiveness compared to RAM.

In-memory databases are feasible for many datasets due to decreasing RAM costs.

Some in-memory databases prioritize durability through hardware, logging, snapshots, or replication, unlike caching-focused stores like Memcached.

In-memory databases reload state from disk or network replicas upon restart. Disk writes, used for durability, allow for backups and analysis.

VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model, while RAMCloud is an open-source, in-memory key-value store with durability. Redis and Couchbase offer weak durability by writing to disk asynchronously.

In-memory databases are faster than disk-based ones because they avoid the overhead of encoding data for disk storage, not because they avoid disk reads.

In-memory databases, like Redis, offer data models difficult to implement with disk-based indexes, such as priority queues and sets, due to their in-memory storage.

Data Storage for Analytics

Data warehouses commonly use relational data models and SQL for analytical queries, supported by graphical data analysis tools.

Data warehouses and relational OLTP databases, while both using SQL, are optimized for different query patterns and workloads.

HTAP databases, combining transaction processing and data warehousing, are evolving into separate storage and query engines accessible through a common SQL interface.

Cloud Data Warehouses

Cloud-only data warehouses like BigQuery, Redshift, and Snowflake leverage scalable cloud infrastructure, unlike traditional on-premises data warehouses.

Cloud data warehouses integrate well with other cloud services, offering automatic log ingestion and easy integration with data processing frameworks. They are also more elastic due to their decoupled architecture, allowing independent adjustment of storage and compute resources.

Open source data warehouses like Hive, Trino, and Spark have evolved with the cloud, separating components previously integrated in a single system.

Query engine

Query engines like Trino, DataFusion, and Presto parse and optimize SQL queries for execution, often using parallel, distributed processing.

Storage format

Storage format determines how table rows are encoded as bytes in a file, allowing access by the query engine and other applications. Examples include Parquet, ORC, Lance, and Nimble.

Table format

Table formats like Apache Iceberg and Delta support row inserts and deletions in immutable Parquet files, offering features like time travel and transactions.

Data catalog

Data catalogs, like Snowflake’s Polaris and Databricks’s Unity Catalog, define tables within a database and are used to create, rename, and drop tables. Unlike integrated catalogs and query engines, decoupled catalogs enable data discovery and governance systems to access metadata.

Column-Oriented Storage

Data warehouses typically use a relational schema with a large fact table referencing dimension tables. Efficient storage and querying of trillions of rows in fact tables with petabytes of data is challenging.

Fact tables in data warehouses are typically wide, but queries usually access only a few columns. For example, a query accessing sales data might only need date, product, and quantity columns.

Column-oriented storage stores values from each column together, allowing queries to read only relevant columns, saving work.

Column storage, applicable to both relational and nonrelational data, is exemplified by Parquet, a columnar format supporting document data models.

Column-oriented storage layout stores rows in the same order within each column, allowing for easy reassembly of entire rows.

Columnar storage engines store data in blocks of rows, with each block containing values from a specific timestamp range. This allows queries to load only the necessary columns from relevant blocks.

Columnar storage is widely used in analytical databases, including cloud data warehouses, embedded databases, and product analytics systems. It is utilized in storage formats like Parquet and ORC, as well as in-memory analytics formats like Apache Arrow.

Column compression

Column-oriented storage can be compressed to reduce disk and network demands.

Figure 4-7 shows value sequences with repetition, indicating potential for compression. Bitmap encoding, effective in data warehouses, is illustrated in Figure 4-8.

Sparse bitmaps, containing many 0s, can be run-length encoded to improve storage efficiency. Roaring bitmaps offer a technique that switches between bitmap representations for optimal compactness.

Bitmap indexes are well-suited for common data warehouse queries:

WHERE product_sk IN (31, 68, 69)

Load three bitmaps and calculate their bitwise OR.

WHERE product_sk = 30 AND store_sk = 3

Load bitmaps for product_sk and store_sk, calculate bitwise AND.

Bitmaps can answer graph queries, like finding users who follow and are followed by specific users.

Wide-column databases, despite their name, are row-oriented and store all values from a row together, unlike column-oriented databases.

Sort order in column storage

Column store row order is flexible, allowing for insertion order or imposed order for indexing.

Sorting columns independently would lose row information, as items in different columns are only associated by their shared row position.

Data should be sorted by row, even when stored by column, based on the administrator’s knowledge of common queries. Sorting by frequently queried columns, like date_key for date range queries, can improve query performance.

A second sort key can be used to determine the order of rows with identical values in the first column, such as grouping sales by product within a date range.

Sorted order aids column compression, especially when the primary sort column has few distinct values, allowing for efficient run-length encoding.

Sorting the first column has the strongest compression effect, while subsequent columns compress less well.

Writing to column-oriented storage

Column-oriented storage, compression, and sorting improve read query performance in data warehouses.

Columnar storage makes bulk writes efficient by amortizing the cost of rewriting compressed columns.

Log-structured approach batches writes to an in-memory store, then merges them with disk files for bulk writing, making object storage ideal.

Query execution engines combine column data on disk with recent writes in memory, making data modifications immediately visible in subsequent queries.

Query Execution: Compilation and Vectorization

SQL query plans are optimized by query planners for parallel execution across machines.

The query engine performs operations on column values, such as finding rows matching specific criteria or comparing values. It also examines multiple columns for the same row to fulfill queries.

Efficient query execution for data warehouse queries scanning millions of rows is crucial. Two alternative approaches have emerged to address the slow performance of simple operators.

Query compilation

The query engine generates code to execute SQL queries, iterating over rows, performing comparisons, and copying values to an output buffer if conditions are met. The generated code is compiled to machine code and executed on loaded column-encoded data.

Vectorized processing

Queries are interpreted and optimized by processing column values in batches using predefined operators. This allows for efficient filtering, such as finding sales of bananas in a specific store.

Two approaches, differing in implementation, achieve good performance by leveraging modern CPU characteristics like sequential memory access, tight inner loops, parallelism, and direct operation on compressed data.

Materialized Views and Data Cubes

Materialized views are table-like objects storing query results on disk, unlike virtual views which are query shortcuts.

Materialized views require updates when underlying data changes, impacting write performance. However, they can enhance read performance for repeated queries.

Materialized aggregates, a type of materialized view, cache frequently used aggregate functions like COUNT and SUM in data warehouses, improving query performance.

A two-dimensional table can be created with dates and products as axes, where each cell contains the aggregate of a fact attribute for a specific date-product combination. Applying the same aggregate along rows or columns reduces the summary by one dimension.

Facts often have multiple dimensions, as illustrated in Figure 3-5 with five dimensions: date, product, store, promotion, and customer.

Materialized data cubes enable fast queries by precomputing data, eliminating the need to scan large datasets.

Data cubes lack flexibility compared to querying raw data, limiting calculations like the proportion of sales from items costing over $100. Data warehouses retain raw data and use data cubes for performance optimization.

Multidimensional and Full-Text Indexes

B-trees and LSM-trees enable range queries on a single attribute, but searching by a single attribute may be insufficient.

Concatenated indexes, the most common type of multicolumn index, combine fields into one key for efficient searching based on the specified order of concatenation.

Multidimensional indexes are crucial for querying multiple columns simultaneously, especially for geospatial data like restaurant locations on a map. A concatenated index over latitude and longitude columns is inefficient for two-dimensional range queries.

SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
                          AND longitude > -0.1162 AND longitude < -0.1004;

Spatial indexes, like R-trees or Bkd-trees, are commonly used to efficiently organize and query multidimensional data, grouping nearby data points together.

Multidimensional indexes, beyond geographic locations, enable efficient searches in various datasets. For example, a three-dimensional index on color dimensions can search for products within a specific color range, while a two-dimensional index on date and temperature can efficiently search for weather observations within a specific temperature range.

Full-Text Search

Full-text search enables keyword searches within text documents, but challenges like language-specific processing, word matching, and synonyms are beyond this book’s scope.

Full-text search is a multidimensional query where each word is a dimension. A document’s value is 1 if it contains the term, otherwise 0.

Search engines use an inverted index, a key-value structure, to answer queries. The key is a term, and the value is a list of document IDs containing the term.

Finding documents containing both terms x and y is similar to a vectorized data warehouse query, efficiently computing the bitwise AND of their bitmaps.

Lucene, used by Elasticsearch and Solr, stores term-to-postings list mappings in sorted files merged in the background. PostgreSQL’s GIN index type also uses postings lists for full-text search in JSON documents.

An alternative to word-based text processing is using n-grams, such as trigrams, to build an inverted index for substring searches. While trigram indexes enable regular expression searches, they are large in size.

Lucene uses a Levenshtein automaton to efficiently search for words within a given edit distance, similar to a trie.

Vector Embeddings

Semantic search, crucial for AI applications like retrieval-augmented generation, understands document concepts and user intentions beyond synonyms and typos.

Semantic search indexes use embedding models, often LLMs, to translate documents into vector embeddings. These embeddings, representing documents in multidimensional space, are similar for semantically similar documents.

Vectors in semantic search are arrays of floating-point numbers representing locations in multidimensional space, unlike the batch of bits processed in vectorized processing.

Three-dimensional vector embeddings illustrate the proximity of Wikipedia pages, such as agriculture and vegetables, compared to unrelated topics like star schemas.

Embedding models use large vectors to represent locations in abstract multidimensional space. Search engines use distance functions like cosine similarity and Euclidean distance to measure vector distances.

Early embedding models like Word2Vec, BERT, and GPT were text-based neural networks. Researchers later developed models for video, audio, and images, and now multimodal models generate embeddings for multiple modalities.

Semantic search engines use embedding models to generate vector embeddings for user queries, which are then used to find similar documents in a vector index.

Vector indexes store document embeddings for querying similar documents.

Storage and Retrieval

On this page