diff --git a/docs/data-structure-and-algorithms/introduction.md b/docs/data-structure-and-algorithms/introduction.md index 23cf3777..4a9105c5 100644 --- a/docs/data-structure-and-algorithms/introduction.md +++ b/docs/data-structure-and-algorithms/introduction.md @@ -3,33 +3,3 @@ sidebar_position: 1 --- # Introduction - -These are my notes on Data Structure Algorithms - -## Content Table - -### Algorithms - -1. [Array and Hashing](/docs/data-structure-and-algorithms/algorithms/arrays-and-hasing.md) -2. [Sorting](/docs/data-structure-and-algorithms/algorithms/sorting.md) -3. [Two Pointers](/docs/data-structure-and-algorithms/algorithms/two-pointers.md) -4. [Sliding Window](/docs/data-structure-and-algorithms/algorithms/sliding-window.md) -5. [Stack](/docs/data-structure-and-algorithms/algorithms/stack.md) -6. [Queue](/docs/data-structure-and-algorithms/algorithms/queue.md) -7. [Tree](/docs/data-structure-and-algorithms/algorithms/tree/introduction.md) - 1. [Introduction](/docs/data-structure-and-algorithms/algorithms/tree/introduction.md) - 2. [Binary Tree](/docs/data-structure-and-algorithms/algorithms/tree/binary-tree.md) - 3. [Strict Binary Tree](/docs/data-structure-and-algorithms/algorithms/tree/strict-binary-tree.md) - 4. [Full vs Complete Binary Tree](/docs/data-structure-and-algorithms/algorithms/tree/full-vs-complete-binary-tree.md) - 5. [Strict vs Complete Binary Tree](/docs/data-structure-and-algorithms/algorithms/tree/strict-vs-complete-binary-tree.md) - 6. [N-ary Tree](/docs/data-structure-and-algorithms/algorithms/tree/n-ary-tree.md) - 7. [Tree Representation](/docs/data-structure-and-algorithms/algorithms/tree/tree-representation.md) - 8. [Tree Traversal](/docs/data-structure-and-algorithms/algorithms/tree/tree-traversal.md) - 9. [Common Algorithms](/docs/data-structure-and-algorithms/algorithms/tree/common-algorithms.md) -8. [Binary Search Tree](/docs/data-structure-and-algorithms/algorithms/bst/introduction.md) - 1. [Introduction](/docs/data-structure-and-algorithms/algorithms/bst/introduction.md) - 2. [Key Operations](/docs/data-structure-and-algorithms/algorithms/bst/key-operations.md) - -### Time and Space Complexity - -1. [Time Complexity](/docs/data-structure-and-algorithms/time-space-complexity/time-complexity.md) diff --git a/docs/databases/database-engineering/introduction.md b/docs/databases/database-engineering/introduction.md index dc0da6ae..bb92f532 100644 --- a/docs/databases/database-engineering/introduction.md +++ b/docs/databases/database-engineering/introduction.md @@ -4,6 +4,12 @@ sidebar_position: 1 # Introduction +:::tip[Status] + +This note is complete, reviewed, and considered stable. + +::: + Database engineering is about setting up, configuring, and managing databases to make sure they work well and meet the needs of users. It involves tasks like choosing the right type of database, setting it up for performance and security, managing backups, and making sure the database is always available. Database engineering ensures that databases are reliable, fast, and secure for everyday use. diff --git a/docs/databases/database-engineering/pooling.md b/docs/databases/database-engineering/pooling.md index d84f4df9..f03173a2 100644 --- a/docs/databases/database-engineering/pooling.md +++ b/docs/databases/database-engineering/pooling.md @@ -4,46 +4,365 @@ sidebar_position: 2 # Connection Pooling -**Connection pooling** is a technique used to improve the performance and scalability of applications interacting with databases. It maintains a pool of database connections that can be reused rather than creating and destroying connections for every request. +:::tip[Status] -## How Connection Pooling Works +This note is complete, reviewed, and considered stable. -1. **Create the Pool:** When the application starts, a fixed number of database connections are created and added to the pool. -2. **Borrow a Connection:** When the application needs a database connection, it borrows an available connection from the pool. -3. **Return the Connection:** After the operation is complete, the connection is returned to the pool for reuse. -4. **Connection Reuse:** Connections are reused until they become stale or are removed from the pool. Typically when a connected become staled or removed from the pool, a new connection then added to the pool **on demand**. +::: -## Benefits of Connection Pooling +A database connection is a **persistent communication channel** between an application and a database server. -1. **Reduced Overhead:** Opening and closing database connections is expensive; connection pooling reuses connections, avoiding the setup costs. -2. **Improved Performance:** Connections are readily available, which leads to faster query processing and reduced waiting time. -3. **Better Resource Management:** The pool controls the number of active connections, preventing the database from being overwhelmed by too many concurrent requests. -4. **Scalability:** Connection pooling allows applications to handle more requests with feOur resources. -5. **Connection Health Checks:** Pooling libraries often include automatic handling of stale or broken connections. +Establishing a connection involves: -## Normal Database Connections vs. Connection Pooling +* TCP handshake +* Authentication (credentials, certificates, SSL) +* Allocating memory and worker resources on the database +* Registering a session in the database’s internal state -- **Normal Connections:** Every request opens a new connection, which is slow and resource-intensive, leading to high latency and poor scalability. -- **Connection Pooling:** Connections are established once and reused, reducing latency, optimizing resource usage, and improving scalability. +This makes connection creation **significantly more expensive** than executing most queries. -## Example Using PostgreSQL and PGBouncer +## Naive ways to connect to a database (and why they fail) -### Configure PGBouncer +### One connection per request -- Install and configure `pgbouncer` to manage connections to PostgreSQL. Example configuration: - ```ini - [databases] - mydb = host=localhost dbname=mydb user=myuser password=mypassword - [pgbouncer] - pool_mode = transaction - max_client_conn = 100 - default_pool_size = 20 - ``` +```text +Request arrives +→ Open database connection +→ Execute query +→ Close connection +→ Send response +``` + +
+ +```mermaid +sequenceDiagram + participant Client + participant App + participant DB + + Client->>App: HTTP Request + App->>DB: Open Connection + App->>DB: Execute Query + App->>DB: Close Connection + App->>Client: Response +``` + +
+ +#### Why this approach fails + +* Connection setup dominates request latency +* High CPU and memory usage on the database +* Connection storms during traffic spikes +* Poor scalability beyond low traffic + +This model is acceptable only for scripts or extremely low-load systems. + +### Single shared global connection + +```text +Create one DB connection +Reuse it for all requests +``` + +
+ +```mermaid +flowchart LR + R1[Request 1] + R2[Request 2] + R3[Request 3] + Conn[(Single DB Connection)] + DB[(Database)] + + R1 --> Conn + R2 --> Conn + R3 --> Conn + Conn --> DB +``` + +
+ +#### Why this approach fails + +* No real concurrency +* Thread safety issues +* One broken connection breaks the entire system + +## Core idea of connection pooling + +**Connection pooling** means: + +> Maintain a fixed number of pre-opened database connections and reuse them across requests. + +Connections are **borrowed**, not created or destroyed per request. + +## How connection pooling works internally + +### Pool initialization + +At application startup: + +
+ +```mermaid +flowchart LR + App[Application Start] + Pool[Connection Pool] + C1[(Conn 1)] + C2[(Conn 2)] + C3[(Conn 3)] + DB[(Database)] + + App --> Pool + Pool --> C1 --> DB + Pool --> C2 --> DB + Pool --> C3 --> DB +``` + +
+ +* Connections are created once +* Stored in an idle state +* Kept alive for reuse + +### Request lifecycle with pooling + +
+ +```mermaid +sequenceDiagram + participant Client + participant App + participant Pool + participant DB + + Client->>App: Request + App->>Pool: Borrow connection + Pool-->>App: Connection + App->>DB: Execute query + DB->>App: Query Response + App->>Pool: Return connection + App->>Client: Response +``` + +
+ +Key rule: + +* **Connections must always be returned to the pool** + +### Pool exhaustion and waiting + +When all connections are busy: + +
+ +```mermaid +flowchart LR + Req[Incoming Request] + Pool["Connection Pool\n(All Busy)"] + Wait[Wait / Queue] + DB[(Database)] + + Req --> Pool + Pool --> Wait + Wait --> Pool + Pool --> DB +``` + +
+ +This introduces **backpressure**, protecting the database. + +## Key characteristics of a connection pool + +### Pool size + +The pool enforces a maximum number of concurrent database connections. + +* Small pool → higher wait times +* Large pool → database overload + +The pool size is a **control knob**, not a performance booster. + +### Connection reuse + +
+ +```mermaid +flowchart LR + Req1[Request A] + Req2[Request B] + Req3[Request C] + Conn[(Reusable Connection)] + + Req1 --> Conn + Req2 --> Conn + Req3 --> Conn +``` + +
+ +The same connection serves many requests over time, amortizing setup costs. -### Application Connection +## Advantages of connection pooling over other approaches -- Configure Our app to connect through PGBouncer, not directly to PostgreSQL. +### Performance advantage -```plaintext -host=pgbouncer_host port=6432 dbname=mydb user=myuser password=mypassword +Compared to per-request connections: + +
+ +```mermaid +flowchart TB + A[Per-request connections] + B[Connection pooling] + + A -->|Repeated setup| Slow[High Latency] + B -->|Reuse connections| Fast[Low Latency] +``` + +
+ +Pooling removes repeated connection overhead from the critical path. + +### Scalability advantage + +
+ +```mermaid +flowchart LR + Threads[Many App Threads] + Pool[Few DB Connections] + DB[(Database)] + + Threads --> Pool --> DB +``` + +
+ +* Application concurrency ≠ database concurrency +* Pool acts as a **gatekeeper** + +### Database protection + +From the database’s perspective: + +
+ +```mermaid +flowchart LR + App[Application] + Pool[Connection Pool] + DB[(Database)] + + App --> Pool + Pool -->|Limited| DB +``` + +
+ +The pool: + +* Prevents connection floods +* Keeps resource usage stable +* Improves predictability under load + +### Resource efficiency + +Without pooling: + +* Constant connection creation/destruction +* High CPU and memory churn + +With pooling: + +* Long-lived connections +* Better cache and worker reuse + +### Fault tolerance + +Pools can: + +* Detect broken connections +* Replace them automatically +* Retry safely + +This logic is centralized instead of duplicated across application code. + +## Connection pooling vs concurrency (important distinction) + +
+ +```mermaid +flowchart LR + Threads[100 Threads] + Pool[10 DB Connections] + DB[(Database)] + + Threads --> Pool --> DB +``` + +
+ +* Many threads can exist +* Only limited DB access is allowed +* Excess work waits instead of crashing the database + +This is **intentional system design**, not a bottleneck. + +## Common production failure modes + +### Connection leaks + +
+ +```mermaid +flowchart LR + Pool[Connection Pool] + Leak[Borrowed and Not Returned] + Exhausted[Pool Exhausted] + + Pool --> Leak --> Exhausted +``` + +
+ +Result: + +* Requests hang +* System degrades over time + +### Oversized pools + +
+ +```mermaid +flowchart LR + Pool[Huge Pool] + DB[(Database)] + Slow[High Latency] + + Pool --> DB --> Slow ``` + +
+ +More connections increase contention and reduce performance. + +### Long-running transactions + +Connections held too long reduce pool availability and cause cascading delays. + +## Real-world implementations + +Connection pooling is built into most ecosystems: + +* Java → HikariCP +* Python → SQLAlchemy / Django +* Go → `database/sql` +* Node.js → driver-level pools + +Different APIs, same underlying model. diff --git a/docs/databases/database-engineering/transactions.md b/docs/databases/database-engineering/transactions.md index 1dfdfae2..feefaacc 100644 --- a/docs/databases/database-engineering/transactions.md +++ b/docs/databases/database-engineering/transactions.md @@ -4,130 +4,440 @@ sidebar_position: 3 # Transactions -A **transaction** in a database is a sequence of operations that are executed as a single unit of work. Transactions ensure that the database maintains consistency, reliability, and integrity even in situations where errors, crashes, or concurrent updates occur. Transactions allow multiple database operations to be grouped together, so either all the operations succeed or none of them are applied, preserving the consistency of the database. +:::tip[Status] -## Transaction Lifecycle +This note is complete, reviewed, and considered stable. -A transaction follows a specific lifecycle, consisting of four main stages: +::: -### Begin Transaction +A **transaction** is a sequence of database operations that the database treats as **one logical unit**. -A transaction starts when a command such as `BEGIN` or `START TRANSACTION` is issued. This marks the beginning of a new transaction. +A transaction guarantees: -**Example:** +* Either **all operations succeed** +* Or **none of them take effect** -```sql -BEGIN; +There is **no visible partial state**. + +## Why transactions exist + +Databases operate in an environment where **failures and concurrency are normal**: + +* Multiple users updating data at the same time +* Application crashes +* Database crashes +* Partial execution of multi-step logic + +Without transactions, databases would constantly end up in **corrupted or inconsistent states**. + +Example problem: + +```text +1. Deduct ₹1000 from Account A - executed successfully +2. Add ₹1000 to Account B - db crahsed ``` -### Perform Operations +Money disappears. + +Transactions exist to guarantee: + +> **Multiple operations behave as one correct, indivisible unit of work.** + +## Transaction lifecycle (high level) + +
+ +```mermaid +stateDiagram-v2 + [*] --> Active + Active --> Committed + Active --> RolledBack + Committed --> [*] + RolledBack --> [*] +``` + +
+ +* `BEGIN` → transaction becomes **Active** +* `COMMIT` → changes become permanent +* `ROLLBACK` → changes are undone + +## ACID properties (what transactions guarantee) -During the transaction, several SQL operations (e.g., `INSERT`, `UPDATE`, `DELETE`) are executed. These operations are treated as part of a single unit of work. +Transactions are defined by **ACID**. These are **engineering guarantees**, not theory. -**Example:** +## Atomicity – all or nothing + +Atomicity means: + +> A transaction is indivisible. Partial results are never visible. + +Example: ```sql +BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2; +COMMIT; ``` -### Commit Transaction +If the second update fails, the first update is **undone**. -Once all the operations within the transaction are successfully executed, the transaction is committed using the `COMMIT` command. This makes all changes permanent in the database. +### Atomicity internally -**Example:** +Databases achieve atomicity using: -```sql -COMMIT; +* **Undo information** +* **Transaction logs** + +
+ +```mermaid +flowchart LR + Tx[Transaction] + Change[Data Change] + Undo[Undo / Old Value] + + Tx --> Change + Change --> Undo ``` -### Rollback Transaction +
-If an error occurs during the transaction, or if the changes are no longer needed, the transaction can be rolled back. The `ROLLBACK` command undoes all the operations performed in the transaction, restoring the database to its state before the transaction started. +If the transaction aborts, the database reverts changes using undo data. -**Example:** +## Consistency – rules are enforced -```sql -ROLLBACK; +Consistency means: + +> A transaction moves the database from one **valid state** to another **valid state**. + +Examples of consistency rules: + +* Foreign keys must exist +* Unique constraints must hold +* Balance must not be negative + +If a transaction violates constraints: + +* The database **rejects it** +* The transaction is rolled back + +Important: + +> Transactions do not define rules, **constraints do**. +> Transactions ensure rules are never bypassed. + +## Isolation – transactions don’t interfere + +Isolation means: + +> Concurrent transactions must not see each other’s **partial work**. + +Without isolation, anomalies occur. + +### Dirty read example (bad) + +```text +T1: UPDATE balance = balance - 500 (not committed) +T2: SELECT balance → sees reduced balance +T1: ROLLBACK ``` -## Advantages of Transactions +T2 observed data that **never existed**. -Transactions offer several important benefits that help maintain the integrity and performance of the database: +Isolation prevents this. -1. **Consistency:** Transactions ensure that the database remains in a consistent state. If a transaction involves multiple operations, the database is either updated with all the changes, or none of them are applied. This guarantees the integrity of the data. +## Durability – committed means permanent -2. **Reliability:** Transactions make sure that database operations are applied reliably, even in the event of system crashes. Once a transaction is committed, the changes are permanent, ensuring that the database reflects the intended operations. +Durability means: -3. **Atomicity:** The atomic nature of a transaction means that all operations within the transaction are treated as a single unit. If one operation fails, the entire transaction is rolled back, ensuring that partial updates do not corrupt the database. +> Once a transaction commits, its changes will survive crashes. -4. **Concurrency:** In multi-user environments, transactions help manage concurrent access to the database. They ensure that each transaction works as if it Oure the only transaction in the system, preventing conflicts and maintaining consistency. +Databases achieve durability using: -5. **Error Handling:** If an error occurs during a transaction, the changes can be rolled back, and the database can return to its previous consistent state. This helps to avoid partial updates or data corruption. +* Write-Ahead Logging (WAL) +* fsync to stable storage +* Crash recovery -6. **Simplified Development:** Using transactions simplifies the development of complex operations by allowing developers to group multiple SQL statements into a single unit of work. This eliminates the need for manually managing intermediate states. +## Transactions and connections (critical relationship) -## Concurrency Control and Locking +A transaction is **always bound to one database connection**. -In multi-user environments, where multiple transactions can occur simultaneously, it's important to manage how transactions interact with each other to avoid conflicts. Databases use **locks** to ensure that one transaction does not interfere with another, maintaining data integrity. +
-There are several types of locks: +```mermaid +flowchart LR + Conn[DB Connection] + Tx[Transaction Context] + DB[(Database)] -- **Row-level locks:** Prevent other transactions from modifying or reading a row that is being modified by another transaction. -- **Table-level locks:** Prevent any other transaction from accessing the entire table while a transaction is modifying it. + Conn --> Tx --> DB +``` + +
+ +Implications: + +* A transaction cannot span multiple connections +* The connection is held until `COMMIT` or `ROLLBACK` +* This is why long transactions exhaust connection pools -In PostgreSQL, We can use commands like `FOR UPDATE` or `FOR SHARE` to lock rows explicitly within a transaction. +## How transactions work internally (real mechanism) -**Example:** +Internally, transactions rely on **four core systems**: + +1. Transaction IDs (TXID) +2. Write-Ahead Logging (WAL) +3. MVCC (row versioning) +4. Locks (minimal and scoped) + +## Transaction start (BEGIN internally) + +When you run: ```sql BEGIN; -SELECT * FROM accounts WHERE id = 1 FOR UPDATE; -- Lock row for update -UPDATE accounts SET balance = balance - 100 WHERE id = 1; -COMMIT; ``` -## Transaction Isolation Levels +The database: -The **isolation level** of a transaction controls how the operations of one transaction are visible to others. The goal of isolation is to prevent issues like **dirty reads**, **non-repeatable reads**, and **phantom reads**. +1. Assigns a **Transaction ID (TXID)** +2. Creates a **transaction context** +3. Records snapshot information -PostgreSQL supports several isolation levels: +
-### Read Committed (Default) +```mermaid +flowchart LR + Client --> Conn[Connection] + Conn --> Tx[Transaction\nTXID=42] +``` -The default level where a transaction can only see data committed before it started. It prevents dirty reads but allows non-repeatable reads. +
-### Repeatable Read +No data is copied. No locks are taken yet. -Ensures that if a transaction reads a value, it will always read the same value throughout the transaction. This prevents dirty reads and non-repeatable reads, but phantom reads can still occur. +## Write-Ahead Logging (WAL) -### Serializable +### Core rule -The highest level of isolation, where transactions are executed as if they Oure processed serially (one after another), completely eliminating dirty reads, non-repeatable reads, and phantom reads. +> **Changes must be written to the log before data pages are modified.** -## Transaction Deadlocks +Why? -A **deadlock** occurs when two or more transactions are waiting for each other to release locks, causing the transactions to get stuck in a cycle of dependencies. Deadlocks can cause transactions to be stuck indefinitely, and the database system must detect and resolve these deadlocks by rolling back one of the transactions involved. +* Logs are sequential → fast +* Logs are replayable → crash-safe -For example: +### UPDATE internally -- Transaction A locks resource 1 and waits for resource 2. -- Transaction B locks resource 2 and waits for resource 1. +```sql +UPDATE users SET balance = 900 WHERE id = 1; +``` -The database detects this circular waiting and will typically roll back one of the transactions to break the deadlock. +Steps: -## Savepoints +1. Create WAL record (old value, new value, TXID) +2. Append WAL record to disk +3. Modify data page in memory +4. Mark page as dirty -A **savepoint** allows We to set a point within a transaction to which We can later roll back, without affecting the entire transaction. This is useful for partial rollbacks, particularly in complex transactions where only a part of the transaction needs to be undone. +
-**Example:** +```mermaid +sequenceDiagram + participant Tx as Transaction + participant WAL as WAL Log + participant Mem as Memory Page + participant Disk as Disk -```sql -BEGIN; -UPDATE accounts SET balance = balance - 100 WHERE id = 1; -SAVEPOINT sp1; -- Create a savepoint -UPDATE accounts SET balance = balance + 100 WHERE id = 2; --- If the second update fails, roll back to the savepoint without affecting the first update -ROLLBACK TO SAVEPOINT sp1; -COMMIT; + Tx->>WAL: Write change record + WAL->>Disk: fsync + Tx->>Mem: Apply change + Mem-->>Tx: Page marked dirty +``` + +
+ +## COMMIT internally + +When `COMMIT` is issued: + +1. Write COMMIT record to WAL +2. Flush WAL to disk +3. Mark transaction as committed +4. Release locks +5. Make versions visible + +
+ +```mermaid +flowchart TB + Tx --> WAL + WAL --> Disk + Disk --> Tx ``` + +
+ +At this point: + +* Transaction is **durable** +* Crashes cannot undo it + +## ROLLBACK internally + +When `ROLLBACK` happens: + +1. Use undo / old versions +2. Revert in-memory changes +3. Discard transaction context + +
+ +```mermaid +flowchart LR + Tx --> Undo[Undo Data] --> Data[Restore Old State] +``` + +
+ +No durability guarantee is needed for rollback. + +## MVCC – how isolation really works + +Modern databases use **MVCC (Multi-Version Concurrency Control)**. + +Instead of overwriting rows: + +* Updates create **new versions** +* Old versions remain for other transactions + +
+ +```mermaid +flowchart LR + V1[Row v1\nTXID=10] + V2[Row v2\nTXID=20] + V3[Row v3\nTXID=30] + + V1 --> V2 --> V3 +``` + +
+ +## Snapshots and visibility + +Each transaction gets a **snapshot**: + +* Which transactions were committed +* Which were active + +When reading a row: + +```text +Is this version visible to my snapshot? +→ Yes → return +→ No → skip +``` + +Reads do **not block writes**. + +## Locks (still necessary, but limited) + +Despite MVCC, locks exist: + +| Lock | Purpose | +| -------------- | ------------------------- | +| Row locks | Prevent concurrent writes | +| Table locks | DDL | +| Advisory locks | App-level coordination | + +
+ +```mermaid +flowchart LR + Tx1 -->|Row Lock| Row + Tx2 -->|Wait| Row +``` + +
+ +Locks are: + +* Fine-grained +* Short-lived +* Scoped to transactions + +## Crash recovery + +After a crash: + +1. Database scans WAL +2. Replays committed transactions +3. Ignores uncommitted ones + +
+ +```mermaid +flowchart LR + Crash --> WAL + WAL --> Redo[Redo Committed] + WAL --> Skip[Ignore Uncommitted] +``` + +
+ +This guarantees **Atomicity + Durability**. + +## Garbage collection of old versions + +Old row versions cannot live forever. + +Background process: + +* Removes versions no snapshot needs +* Reclaims space + +(PostgreSQL: `VACUUM`) + +
+ +```mermaid +flowchart LR + Old[Old Versions] --> GC[Garbage Collector] --> Free[Free Space] +``` + +
+ +Long transactions delay cleanup. + +## Why long transactions are dangerous + +Long transactions: + +* Hold snapshots open +* Prevent garbage collection +* Hold connections +* Increase WAL size + +This leads to: + +* Pool exhaustion +* Disk bloat +* Latency spikes + +Rule: + +> **Start transactions late, commit early.** + +## Mental model + +Think of a transaction as: + +* A **private snapshot** +* A **stream of logged changes** +* A set of **new row versions** +* A **temporary ownership of a connection** + +Other transactions never see half-finished work.