406. database

Database

Alongside operating systems, compilers, and large language models, the database is arguably one of the most fascinating pieces of software ever built. It packs an extraordinary density of CS subfields into a single system, from relational algebra and B-tree indexing to write-ahead logging, multiversion concurrency control, query optimisation (which is essentially a compiler problem), and distributed consensus. Nearly every non-trivial application depends on one.

https://www.youtube.com/watch?v=W2Z7fbCLSTw

I

1.1. Relational Model

Edgar Codd’s relational model (1970) formalised data management around the concept of relations, which are sets of tuples over named attributes. In practice, a relation maps to a table, a tuple to a row, and an attribute to a column. Each table is defined by a schema that specifies column names, data types, and constraints. Primary keys uniquely identify rows within a table, while foreign keys enforce referential integrity between tables by requiring that a value in one table correspond to an existing primary key in another.

Normalisation decomposes tables to eliminate redundancy and update anomalies, with successive normal forms (1NF through BCNF) imposing progressively stricter constraints on functional dependencies. Denormalisation deliberately reintroduces redundancy to optimise read performance for specific query patterns, trading storage and write complexity for faster lookups. The tension between normalisation and denormalisation is a recurring design decision in schema design.

SQL (Structured Query Language) serves as the declarative interface to relational databases. The programmer specifies what data to retrieve or modify (SELECT, INSERT, UPDATE, DELETE) and the database engine determines how to execute the request efficiently. This separation of specification from execution is what enables query optimisation, as the engine can choose among multiple equivalent execution strategies without changing the semantics of the query.

1.2. Storage Engine

A database’s storage engine manages how data is physically laid out on disk and brought into memory. Data is organised into fixed-size pages (typically 4-16 KB), which serve as the unit of I/O between disk and the buffer pool, an in-memory cache of recently accessed pages. The buffer pool interacts with the OS page cache but is managed independently by the database to implement its own eviction policies (e.g. LRU with clock sweep in PostgreSQL) and ensure transactional safety.

Within pages, data can be arranged in two fundamentally different orientations. Row-oriented (NSM, N-ary Storage Model) storage places all columns of a row contiguously, which is efficient for OLTP (Online Transaction Processing) workloads that read or write individual records. Column-oriented (DSM, Decomposition Storage Model) storage groups values of the same column together, enabling aggressive compression and fast aggregation over large datasets, which is the foundation of OLAP (Online Analytical Processing) systems and data warehouses.

A heap file stores rows in insertion order with no particular structure, making inserts fast but full-table scans necessary for lookups without an index. Clustered indexes (e.g. InnoDB’s primary key index) physically order rows on disk according to the index key, so range queries on the clustering key access contiguous pages. Understanding this physical layout is essential for reasoning about query performance.

1.3. Indexing

An index is an auxiliary data structure that accelerates lookups by mapping search keys to the locations of matching rows. The dominant indexing structure in relational databases is the B+ tree, a balanced, self-adjusting tree where all values reside in leaf nodes connected by sibling pointers. With a branching factor $b$ and $n$ keys, a B+ tree has height $O(\log_b n)$, and since $b$ is typically in the hundreds (one node per page), even tables with billions of rows require only 3-4 levels of traversal.

Hash indexes provide $O(1)$ average-case point lookups but do not support range queries or ordered iteration. Composite indexes on multiple columns $(c_1, c_2, \ldots, c_k)$ satisfy queries that filter on a prefix of the indexed columns, following the leftmost prefix rule. A covering index includes all columns needed by a query, allowing the engine to answer entirely from the index without accessing the base table (an index-only scan).

The query planner chooses between an index scan and a sequential scan based on estimated selectivity. For queries that match a small fraction of rows, the index avoids reading irrelevant pages. For queries that touch a large fraction, a sequential scan is cheaper because it reads pages contiguously rather than following random pointers. The crossover point depends on table size, index structure, and the ratio of random to sequential I/O cost on the storage medium.

II

2.1. Query Optimisation

When a SQL statement arrives, the database processes it through a pipeline analogous to a compiler. The parser validates syntax and produces an abstract syntax tree. The binder resolves table and column names against the catalogue. The logical planner translates the AST into a tree of relational algebra operators (select, project, join, aggregate), then applies equivalence-preserving transformations such as predicate pushdown, join reordering, and projection pruning to reduce intermediate result sizes.

The physical planner maps each logical operator to a concrete algorithm and estimates the cost of each candidate plan using statistics maintained on tables and columns. Cardinality estimation, often based on histograms and selectivity heuristics, is notoriously imprecise, and errors compound multiplicatively across joins. Despite decades of research, the quality of the cardinality estimator remains the single largest determinant of plan quality in practice.

Join algorithms illustrate the range of physical operators. A nested loop join iterates over the outer relation and probes the inner relation for each row, with $O(n \cdot m)$ cost but benefiting from an index on the inner side. A hash join builds a hash table on the smaller relation and probes it with the larger, achieving $O(n + m)$ expected cost. A sort-merge join sorts both relations on the join key and merges them in a single pass, performing well when inputs are already sorted or when the result must be ordered.

2.2. Execution

The classical execution model is the volcano (or iterator) model, where each operator implements an open(), next(), close() interface. Operators are composed into a tree, and calling next() on the root pulls one tuple at a time through the pipeline. This model is simple, composable, and supports pipelining (producing output before consuming all input), but incurs high per-tuple function call overhead.

Vectorised execution (e.g. DuckDB, ClickHouse) amortises this overhead by processing batches of tuples (typically 1024-4096) per next() call, enabling SIMD instructions and better cache utilisation. Some engines (e.g. PostgreSQL 12+, Apache Spark) employ JIT compilation via LLVM to generate specialised native code for hot query paths, eliminating interpreter overhead entirely. Parallel query execution partitions work across multiple threads or processes, with the exchange operator redistributing intermediate tuples between parallel pipeline segments.

III

3.1. ACID

A transaction groups one or more operations into a logical unit of work that the database guarantees will be executed according to ACID properties. Atomicity ensures that a transaction either completes entirely or has no effect, even if the system crashes midway. Consistency guarantees that a transaction transforms the database from one valid state to another, respecting all defined constraints. Isolation ensures that concurrent transactions do not observe each other’s intermediate states. Durability guarantees that once a transaction commits, its effects persist even through power failures or crashes.

These properties are not free. Enforcing strict isolation reduces concurrency, and guaranteeing durability requires synchronous disk writes. The art of database engineering lies in providing the strongest guarantees possible while minimising their performance cost, which motivates the mechanisms in the following sections.

3.2. Concurrency Control

Isolation levels relax the strictness of the isolation guarantee to permit greater concurrency. Read Uncommitted allows dirty reads (seeing uncommitted writes). Read Committed prevents dirty reads but permits non-repeatable reads (a row changes between two reads within the same transaction). Repeatable Read prevents both but may allow phantom reads (new rows appearing in a range query). Serialisable guarantees that the result is equivalent to some serial execution order, the strongest level.

Two-phase locking (2PL) is the classical approach, where a transaction acquires locks before accessing data (growing phase) and releases them only after committing or aborting (shrinking phase). This guarantees serialisability but can cause deadlocks and reduces concurrency. Multiversion concurrency control (MVCC), used by PostgreSQL, MySQL/InnoDB, and Oracle, takes a fundamentally different approach. Each write creates a new version of the row rather than overwriting in place, and each transaction sees a consistent snapshot of the database as of its start time. Readers never block writers and writers never block readers, which dramatically improves throughput for read-heavy workloads.

In PostgreSQL, MVCC is implemented by stamping each row version with xmin (the creating transaction) and xmax (the deleting transaction), and a visibility map determines which versions are visible to a given transaction’s snapshot. Dead versions are reclaimed by the VACUUM process. MySQL/InnoDB implements MVCC differently, using undo logs to reconstruct older versions on demand rather than storing multiple physical copies.

3.3. Write-Ahead Logging

Write-ahead logging (WAL) is the mechanism that provides atomicity and durability. Before any modification is applied to the data pages, a log record describing the change is written to a sequential, append-only log file on stable storage. Because the log is written sequentially, it is far cheaper than random writes to data pages, and the database can batch (or “group commit”) multiple transactions’ log records into a single fsync() call.

During normal operation, modified (“dirty”) pages accumulate in the buffer pool and are flushed to disk lazily by a background writer. A checkpoint periodically forces all dirty pages to disk and records the log position at which recovery can begin, bounding the amount of log that must be replayed after a crash.

On crash recovery, the database replays the WAL from the last checkpoint forward. The redo phase reapplies all committed changes that may not have reached the data files. The undo phase rolls back any transactions that were in progress at the time of the crash. After recovery completes, the database is in a consistent state reflecting exactly the set of committed transactions, regardless of when the crash occurred. This is the database analogue of journaling in file systems, but with the additional complexity of transaction boundaries and concurrent modifications.

IV

4.1. Distributed Databases

Scaling beyond a single machine requires distributing data and computation across nodes. Replication copies data to multiple nodes for fault tolerance and read throughput. Leader-follower (primary-replica) replication routes all writes through a single leader that propagates changes to followers, which serve read queries. This is simple but creates a single point of write contention. Multi-leader and leaderless (e.g. Dynamo-style) replication admit concurrent writes at multiple nodes, introducing the complexity of conflict detection and resolution.

Partitioning (or sharding) divides a dataset across nodes so that each node is responsible for a subset of the data. Range partitioning assigns contiguous key ranges to each node, supporting efficient range queries but risking hot spots if access patterns are skewed. Hash partitioning distributes keys uniformly but sacrifices range query locality. Consistent hashing (e.g. DynamoDB, Cassandra) minimises data movement when nodes join or leave the cluster. Distributed transactions across partitions require coordination protocols such as two-phase commit (2PC), which guarantees atomicity at the cost of blocking if the coordinator fails.

4.2. Modern Landscape

The relational model is not universally optimal. Key-value stores (e.g. Redis, DynamoDB) provide sub-millisecond lookups on single keys. Document databases (e.g. MongoDB) store schema-flexible JSON-like documents, well suited to hierarchical data and rapid iteration. Wide-column stores (e.g. Cassandra, HBase) partition data by row key and column family for high write throughput. Graph databases (e.g. Neo4j) model data as nodes and edges, enabling efficient traversal of relationship-heavy datasets. The choice depends on access patterns, consistency requirements, and operational constraints.

NewSQL systems (e.g. CockroachDB, Google Spanner, TiDB) attempt to combine the scalability of NoSQL with the transactional guarantees of traditional relational databases, typically using Raft consensus for distributed transactions and hybrid logical clocks for global ordering. On the analytical side, the distinction between OLTP and OLAP workloads has driven architectural divergence. Data warehouses (e.g. Snowflake, BigQuery, Redshift) optimise for large-scale aggregations over column-oriented storage, while HTAP (Hybrid Transactional/Analytical Processing) systems aim to serve both workloads from a single engine.

At the storage engine level, LSM trees (Log-Structured Merge trees, used by RocksDB, Cassandra, LevelDB) offer superior write throughput by buffering writes in memory and periodically merging sorted runs to disk, at the cost of read amplification from checking multiple levels. B+ trees offer more predictable read latency but slower writes due to random I/O for page updates. Many modern systems (e.g. TiDB, CockroachDB) layer SQL query processing on top of an LSM-based storage engine like RocksDB.

4.3. ML and Databases

Modern ML pipelines interact with databases at multiple points. Training data is often stored in relational databases or data warehouses and extracted via SQL queries or batch export. Feature stores (e.g. Feast, Tecton) manage the lifecycle of ML features, ensuring consistency between training-time and serving-time feature computation. At serving time, models may query databases for real-time features (e.g. user history, product metadata) as part of the inference pipeline.

Vector databases (e.g. FAISS, Pinecone, Milvus, Weaviate) and vector extensions to existing databases (e.g. pgvector for PostgreSQL) have emerged to support approximate nearest neighbour (ANN) search over high-dimensional embedding spaces. These systems index vectors using structures such as HNSW (Hierarchical Navigable Small World graphs) or IVF (Inverted File Index) with product quantisation, enabling sub-linear similarity search over billions of vectors. This capability underpins retrieval-augmented generation (RAG), recommendation systems, and semantic search, bridging the gap between traditional data management and modern AI applications.

I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .

Hikikomori