Chapter 1: B-Tree and B+Tree — The Mathematical Skeleton of Indexes
1. Definition
A B-Tree (Balanced Multi-way Search Tree) is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. It was defined by Rudolf Bayer and Edward M. McCreight in 1972.
A B+Tree is a variant of the B-Tree with the following characteristics:
- Internal nodes do not store data: They only store keys to act as indexes.
- Linked Leaf Nodes: All data is stored in the leaf nodes, which are linked together (often singly or doubly) to facilitate range scans.
2. Technical Depth: Why B+Tree instead of Binary Trees?
The primary reason database indexes use B+Trees rather than Red-Black trees or AVL trees is to minimize disk I/O operations.
A. Fan-out and Tree Height
On disk, data is read in units called Pages (InnoDB defaults to 16KB).
- Binary Trees: Each node has only 2 children. The height $H$ is approximately $\log_2 N$. For 10 million rows, $H \approx 24$, requiring up to 24 disk I/Os.
- B+Trees: By increasing the Fan-out (e.g., storing 100 keys per node), the tree height $H$ is drastically reduced. For the same 10 million rows, the height is typically only 3-4 levels.
B. Disk Pre-fetching and Range Scans
Since internal nodes do not store data, a single Page can hold significantly more keys, further increasing fan-out. Additionally, the linked structure of leaf nodes allows Range Scans to move horizontally across the leaf layer without re-traversing parent nodes.
3. Visualizing the Invisible: The Page Split Process
When a Page is full (reaches its fill-factor limit) and a new record is inserted, the database triggers a low-level operation: the Page Split.
sequenceDiagram
participant App as Insert Request
participant Buffer as Buffer Pool (RAM)
participant PageA as Page 101 (Full)
participant PageB as Page 102 (New)
App->>Buffer: Insert Key 50
Buffer->>PageA: Attempt Write
Note over PageA: Overflow Detected
Buffer->>PageB: Request New Page
PageA->>PageB: Migrate portion of records (e.g., Keys 30-49)
Note over PageA,PageB: Nodes Rebalanced
Buffer-->>Buffer: Update Parent Index Pointer
Note over Buffer: Disk I/O Amplification begins4. Real-World Case: Instagram’s ID Generation Strategy (2011)
Background: In 2011, the Instagram engineering team needed a high-performance ID generation system that could scale across shards.
Technical Choice: They initially considered UUID (Universally Unique Identifier). Reasons for Rejection:
- Index Size: UUIDs are 128-bit, while
bigintis 64-bit. For tables with billions of records, the index becomes bloated and difficult to fit in RAM. - B+Tree Performance Nightmare: UUIDs are random. In a B+Tree, this forces data to be inserted into random Pages. Frequent Page Splits cause physical storage to become fragmented and drastically lower the Fill Factor. This results in heavy random disk I/O, slowing down writes and squeezing Buffer Pool space.
Final Solution: Inspired by Twitter’s Snowflake, they designed a 64-bit ID with a Timestamp Prefix.
- Advantage: Because IDs are chronologically ordered, new records are mostly appended to the rightmost leaf nodes of the B+Tree. This avoids frequent random splits in the middle of the tree and generally maintains a higher Fill Factor (depending on the split policy), significantly improving cache hit rates and Write Locality.
Lesson: In high-performance database design, the sequentiality of an index column is often more important than its randomness. High-quality architects trade some randomness for physical B+Tree continuity.
5. Detailed Defense & Optimization
A. Root Fix: Choose the Right Clustered Index
- Prioritize Auto-increment IDs or keys with a natural chronological order.
- For non-sequential keys, consider lowering the
FILLFACTOR(Postgres) orinnodb_fill_factor(MySQL) to reserve space for random inserts and mitigate splitting.
B. Defense in Depth: Covering Index
- Include frequently queried fields in the secondary index.
- Benefit: Data exists directly in the secondary index’s leaf node, allowing the engine to return results by avoiding the additional back-to-table lookup; the initial index traversal remains $O(\log N)$.
C. Fragmentation Management
- Regularly perform
OPTIMIZE TABLE(MySQL) orREINDEX(Postgres) to reclaim physical space and rebalance the tree structure.
6. References
- Knuth, D. E. - The Art of Computer Programming, Vol 3: Sorting and Searching
- MySQL InnoDB Disk Structure - B+Tree
- PostgreSQL Index Types - B-Tree Internals
