Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions docs/databases/database-systems/database-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ sidebar_position: 3

# Database Storage

:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

**Database storage** is the physical representation of data within a database system. It's typically organized into **files** and **pages**.

## Storage Hierarchy
Expand Down Expand Up @@ -114,7 +120,7 @@ Understanding the storage hierarchy is crucial for designing efficient and cost-
- **Example:** Jumping directly to a specific page in a book using the table of contents.

> Random access on **non-volatile** storage is almost always **much slower** than sequential access.
> DBMS will want to maximize sequential access.
> DBMS will want to maximize sequential access.s

## Database Storage Layers

Expand All @@ -124,9 +130,9 @@ A database storage system can be thought of as three stacked layers, each respon

```mermaid
flowchart TB
Logical["Logical Layer\n(schema, tables, queries, indexes)"]
StorageEngine["Storage Engine\n(buffer manager, page manager, access methods, recovery)"]
Physical["Physical Layer\n(file system, block device, disk/SSD, cloud storage)"]
Logical["Logical Layer (schema, tables, queries, indexes)"]
StorageEngine["Storage Engine (buffer manager, page manager, access methods, recovery)"]
Physical["Physical Layer (file system, block device, disk/SSD, cloud storage)"]
Logical -->|requests| StorageEngine
StorageEngine -->|I/O| Physical
```
Expand Down
15 changes: 10 additions & 5 deletions docs/databases/database-systems/index-organized-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ sidebar_position: 5

# Index Organized Storage

:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

Index-Organized Storage (IOS) is a storage technique in databases where data is stored directly in the index structure itself. Unlike traditional tables where data and indexes are stored separately, an index-organized table (IOT) combines both the data and index, allowing for efficient access patterns and performance benefits in specific use cases.

<div style={{textAlign: 'center'}}>
Expand All @@ -13,10 +19,10 @@ flowchart LR
root((Root))
internal1((Internal Node))
internal2((Internal Node))
leafA("[Leaf: PK=1\nRowData]")
leafB("[Leaf: PK=2\nRowData]")
leafC("[Leaf: PK=100\nRowData]")
secIdx("[Secondary Index\n(Non-clustered)]")
leafA("[Leaf: PK=1 RowData]")
leafB("[Leaf: PK=2 RowData]")
leafC("[Leaf: PK=100 RowData]")
secIdx("[Secondary Index (Non-clustered)]")

root --> internal1
root --> internal2
Expand All @@ -25,7 +31,6 @@ flowchart LR
internal2 --> leafC
secIdx --> leafB
secIdx --> leafC

```

</div>
Expand Down
6 changes: 6 additions & 0 deletions docs/databases/database-systems/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ sidebar_position: 1

# Introduction

:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

**Database Systems** are sophisticated software systems designed to store, manage, retrieve, and protect data efficiently. Understanding the internals and architecture of database systems is crucial for building scalable and reliable applications.

## Core Concepts
Expand Down
18 changes: 11 additions & 7 deletions docs/databases/database-systems/lsm-tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ sidebar_position: 6

# Log-Structured Merge Tree

<!-- markdownlint-disable MD024 -->
:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

Log-Structured Merge (LSM) trees are a fundamental data structure used in database storage, particularly for handling high-write workloads efficiently. Here’s a breakdown of key concepts and considerations when working with LSM storage:

Expand Down Expand Up @@ -261,7 +265,7 @@ An SSTable, a file stored on disk, is an immutable and sorted structure optimize
- _Type_: Distributed SQL database.
- _Usage_: TiDB uses an LSM-based storage layer (with RocksDB or TiKV) for high-performance and distributed data management, balancing SQL compatibility with NoSQL performance.

## **3. Time-Series Databases**
## Time-Series Databases

### InfluxDB

Expand All @@ -273,7 +277,7 @@ An SSTable, a file stored on disk, is an immutable and sorted structure optimize
- _Type_: Time-series database built on PostgreSQL.
- _Usage_: While PostgreSQL uses a B-tree structure by default, TimescaleDB includes LSM options and optimizations for handling high-frequency data insertions in time-series data.

### **4. Search and Logging Databases**
### Search and Logging Databases

### Elasticsearch

Expand All @@ -292,9 +296,9 @@ An SSTable, a file stored on disk, is an immutable and sorted structure optimize

## FAQ: LSM Tree Sizing and Tuning

### How do I select the number of levels?
### How do we select the number of levels?

- The number of levels is determined primarily by your total on-disk dataset size (S), the size of your base level or memtable flush size (M), and the growth factor between levels (T). As a rule-of-thumb for leveling compaction:
- The number of levels is determined primarily by our total on-disk dataset size (S), the size of our base level or memtable flush size (M), and the growth factor between levels (T). As a rule-of-thumb for leveling compaction:

L ≈ ceil(log_T(S / M))

Expand All @@ -304,15 +308,15 @@ An SSTable, a file stored on disk, is an immutable and sorted structure optimize
- Choose the memtable / base SSTable size small enough to avoid large L0 write stalls.
- Use a T value (commonly 8–10 for leveling) to make level sizes grow geometrically and keep levels manageable.

### How do I choose level sizes and the growth factor (T)?
### How do we choose level sizes and the growth factor (T)?

- Strategy:

- Choose a base level size (often determined by memtable size and an L0 cap).
- Select a growth factor T where each level is roughly T times the previous level.
- Larger T reduces the number of levels (and total compaction passes) but increases individual level sizes and can increase read costs for certain patterns.

- Example: If your base level (L1) target is 1GB and T=10, then L2 target is 10GB, L3 is 100GB and so on.
- Example: If our base level (L1) target is 1GB and T=10, then L2 target is 10GB, L3 is 100GB and so on.

- Tuning trade-offs:
- Larger T -> fewer levels -> potentially lower write amplification -> larger compaction work per event -> potentially higher per-compaction latency impact.
Expand Down
6 changes: 5 additions & 1 deletion docs/databases/database-systems/relational-algebra.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ sidebar_position: 2

# Relational Algebra

<!-- markdownlint-disable MD024 -->
:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

**Relational Algebra** is a formal system of query operations used on relational databases. It provides a mathematical foundation for understanding how database queries work and serves as the theoretical basis for SQL. Relational algebra operations allow us to manipulate relations (tables) and extract meaningful information from data.

Expand Down
26 changes: 19 additions & 7 deletions docs/databases/database-systems/tuple-oriented-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ sidebar_position: 4

# Tuple Oriented Storage

:::tip[Status]

This note is complete, reviewed, and considered stable.

:::

## Storage Manager

The Storage Manager is a critical component of a Database Management System (DBMS) responsible for managing the physical storage and retrieval of data. It acts as an interface between the DBMS and the underlying storage devices, such as hard disk drives (HDDs) or solid-state drives (SSDs).
Expand Down Expand Up @@ -49,7 +55,7 @@ flowchart LR

### Database Page Components

Each database page can be thought of as four main components: the Header, the Slot Directory, the Data Area, and the Free Space. Together these components let the DBMS store variable-length tuples efficiently and handle inserts, updates, and deletes without layout changes to the schema.
Each database page can be thought of as four main components: the Header, the Slot Directory, the Data Area, and the Free Space. Together, these components let the DBMS store variable-length tuples efficiently and handle inserts, updates, and deletes without layout changes to the schema.

#### Header

Expand Down Expand Up @@ -104,7 +110,7 @@ This layout — the slot directory growing forward and the Data Area growing bac

### Considerations

- **Page Size:** The choice of page size can impact performance and storage efficiency. Larger pages may reduce the number of I/O operations but can also lead to wasted space if pages are not fully utilized. Also write operations can be slow if the page size is too large.
- **Page Size:** The choice of page size can impact performance and storage efficiency. Larger pages may reduce the number of I/O operations but can also lead to wasted space if pages are not fully utilized. Write operations can also be slow if the page size is too large.
- **Page Organization:** The way data is organized within a page can affect retrieval efficiency. Techniques like B-trees, hash tables, and heap files are commonly used.
- **Page Compression:** Compressing data within pages can reduce storage requirements and improve I/O performance.

Expand Down Expand Up @@ -379,20 +385,26 @@ A tuple is essentially a sequence of bytes (these bytes do not have to be contig
- Bit Map for NULL values.
- Note that the DBMS does not need to store meta-data about the schema of the database here.
- **Tuple Data:** Actual data for attributes.
- Attributes are typically stored in the order that you specify them when you create the table.
- Attributes are typically stored in the order that we specify them when we create the table.
- Most DBMSs do not allow a tuple to exceed the size of a page.
- **Tuple Header:** Contains meta-data about the tuple.
- Visibility information for the DBMS’s concurrency control protocol (i.e., information about which transaction created/modified that tuple).
- Bit Map for NULL values.
- Note that the DBMS does not need to store meta-data about the schema of the database here.
- **Tuple Data:** Actual data for attributes.
- Attributes are typically stored in the order that we specify them when we create the table.
- Most DBMSs do not allow a tuple to exceed the size of a page.
- **Unique Identifier:**
- Each tuple in the database is assigned a unique identifier.
- Most common: page id + (offset or slot).
- An application cannot rely on these ids to mean anything.
- An application cannot rely on these IDs to mean anything.

**Denormalized Tuple Data**: If two tables are related, the DBMS can “pre-join” them, so the tables end up on the same page. This makes reads faster since the DBMS only has to load in one page rather than two separate pages. However, it makes updates more expensive since the DBMS needs more space for each tuple.
**Denormalized Tuple Data**: If two tables are related, the DBMS can “pre-join” them, so the tables end up on the same page. This makes reads faster since the DBMS only has to load one page rather than two separate pages. However, it makes updates more expensive since the DBMS needs more space for each tuple.

## Large Attribute Storage (Overflow & External Storage)

When an attribute value (for example, a large TEXT or BLOB column) cannot fit comfortably into the remaining free space on a page, DBMSs use a few common techniques to store it without violating page-size constraints:

- Inline first, overflow later: The DBMS tries to store a small prefix of the attribute inline and stores the remaining portion in overflow pages. The main tuple contains a pointer or descriptor to the overflow chain so the DBMS can reconstruct the full attribute when needed.
- External storage (TOAST / LOB table): Some systems (e.g., PostgreSQL) store large attributes in a separate storage object (TOAST table or LOB store). The tuple stores a compact reference (pointer) to the external storage.
- Overflow pages / chained pages: The DBMS stores the large attribute across multiple overflow pages and links them so the attribute can be streamed or reassembled by following pointers.
- Compression & partial inline: The DBMS may compress the attribute or save a compressed chunk inline and use external storage for the rest.
Expand Down Expand Up @@ -433,6 +445,6 @@ In MVCC systems, multiple versions of a tuple can exist concurrently. A tuple is

### Notes & trade-offs

- VACUUM is necessary to reclaim space and update statistics—the more frequently you vacuum, the less bloat and the better the optimizer statistics.
- VACUUM is necessary to reclaim space and update statistics—the more frequently we vacuum, the less bloat and the better the optimizer statistics.
- Reclaiming/deleting tuples requires careful management of concurrency and durability (WAL) to ensure other transactions cannot read partially-deleted states.
- Some reclamation processes only mark slots available for reuse (so physical layout doesn't change), while table-rewrite operations will physically compact data and change tuple offsets.
Loading