leetcode/DB_questions.txt at main · amitdev101/leetcode · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

1. Optimistic vs pessimistic locking
2. locking in db avoid dirty reads
3. Transaction Isolation Levels in DBMS
5. why db stores data in B+ trees
6. SQL vs NoSQL difference
7. what is ACID in database
8. Isolation levels in db
9. what is phantom read and how to prevent it
10. Why NoSQL is more scalable than SQL?
11. https://softwareengineering.stackexchange.com/questions/194340/why-are-nosql-databases-more-scalable-than-sql
12. how many masters and replicas do redis cluster maintain?


1. Optimistic vs pessimistic locking
    Optimistic Locking is a strategy where you read a record, take note of a version number (other methods to do this involve dates, timestamps or checksums/hashes) and check that the version hasn't changed before you write the record back. When you write the record back you filter the update on the version to make sure it's atomic. (i.e. hasn't been updated between when you check the version and write the record to the disk) and update the version in one hit.
    If the record is dirty (i.e. different version to yours) you abort the transaction and the user can re-start it.
    This strategy is most applicable to high-volume systems and three-tier architectures where you do not necessarily maintain a connection to the database for your session. In this situation the client cannot actually maintain database locks as the connections are taken from a pool and you may not be using the same connection from one access to the next.
    Pessimistic Locking is when you lock the record for your exclusive use until you have finished with it. It has much better integrity than optimistic locking but requires you to be careful with your application design to avoid Deadlocks. To use pessimistic locking you need either a direct connection to the database (as would typically be the case in a two tier client server application) or an externally available transaction ID that can be used independently of the connection.
    In the latter case you open the transaction with the TxID and then reconnect using that ID. The DBMS maintains the locks and allows you to pick the session back up through the TxID. This is how distributed transactions using two-phase commit protocols (such as XA or COM+ Transactions) work.

5. why db stores data in B+ trees
    Relational databases often use B+ trees as a data structure for storing data because they offer efficient search and retrieval capabilities for large datasets. B+ trees are a type of self-balancing tree structure that allows for fast retrieval of data by using an index to locate the relevant data quickly.
    B+ trees have a number of advantages for database storage:
        Balanced Tree: B+ trees are self-balancing, which means that even if the size of the tree changes due to insertion or deletion of elements, the tree remains balanced, and the search time is not affected.
        Large Fanout: B+ trees have a large fanout, which means that they can hold a large number of keys in each node. This results in fewer levels of the tree and faster search times.
        Sequential Access: B+ trees are optimized for sequential access, making them well-suited for range queries or scans of large amounts of data.
        Disk Access Optimization: B+ trees have a special structure that optimizes disk access. The tree is designed in such a way that all leaf nodes are at the same depth, which ensures that a search will require only a fixed number of disk accesses.
    Because of these advantages, B+ trees are widely used in relational databases as a way to store data efficiently and ensure fast retrieval times for queries.

6. SQL vs NoSQL difference
    SQL and NoSQL are two different types of database management systems, each with its own strengths and weaknesses. Here are some key differences between them:
    Data structure: SQL databases use a structured data model, where data is stored in tables with predefined columns and data types, and relationships between tables are established using foreign keys. NoSQL databases, on the other hand, can use various data models, including document-based, key-value, graph-based, or column-based.
    Scalability: SQL databases are vertically scalable, which means that to increase their capacity, you need to add more resources (CPU, memory, storage) to a single server. NoSQL databases, on the other hand, are horizontally scalable, which means that to increase their capacity, you need to add more servers to a cluster.
    Flexibility: SQL databases have a rigid schema that defines the structure of the data, and any changes to the schema require altering the database structure and migrating the data. NoSQL databases, on the other hand, are schema-less or schema-flexible, which allows for more agile development and easier handling of unstructured or semi-structured data.
    Query language: SQL databases use a standardized query language (SQL) that is optimized for relational data models and provides powerful querying capabilities, such as filtering, sorting, grouping, joining, and aggregating data. NoSQL databases often have their own query languages or APIs that are optimized for their specific data models and use cases.
    ACID compliance: SQL databases are generally ACID compliant, which means that they provide guarantees of data consistency, integrity, and durability in transactions. NoSQL databases, on the other hand, often prioritize performance and scalability over strict consistency, and may not provide full ACID compliance.
    In summary, SQL databases are better suited for structured data and complex queries, while NoSQL databases are better suited for unstructured or semi-structured data, horizontal scalability, and flexible data models. However, the choice between the two ultimately depends on the specific requirements of the project or application.

7. what is ACID in database
    ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability, which are the four properties that ensure the reliability, consistency, and accuracy of data in a database system.
    Atomicity: This property ensures that a transaction is treated as a single, indivisible unit of work. It means that either all the operations in a transaction are executed successfully or none of them are. If a failure occurs during the execution of a transaction, the database system must roll back the changes made to the database, so that the data remains in a consistent state.
    Consistency: This property ensures that the data in the database always satisfies a set of predefined rules, constraints, and relationships. It means that every transaction must maintain the integrity and validity of the data, and if a transaction violates any constraint or rule, it must be rolled back.
    Isolation: This property ensures that concurrent transactions do not interfere with each other's execution and that they are executed in isolation. It means that transactions must be executed in a way that their intermediate states are not visible to other transactions until they are committed.
    Durability: This property ensures that once a transaction is committed, its changes are permanent and will survive any subsequent failures, such as power outages, crashes, or network errors. It means that the changes made to the database must be recorded in a durable storage medium, such as a hard disk or a solid-state drive.
    Together, these four properties form the basis of a reliable and consistent database system that can handle multiple transactions concurrently and recover from failures without losing data or violating data integrity. ACID compliance is essential for applications that require a high degree of data accuracy, such as financial systems, e-commerce sites, and healthcare applications.

8. Isolation levels in db
    Isolation level in a database is a property that determines how transactions interact with each other and how the database ensures consistency and correctness of the data in a multi-user environment. There are several isolation levels defined in most database systems, each with its own set of guarantees, trade-offs, and performance characteristics. Here are the four most common isolation levels:
        Read Uncommitted: This is the lowest isolation level, where transactions can read data that has been modified but not yet committed by other transactions, leading to dirty reads. This level provides the highest level of concurrency, but it can also result in inconsistent or incorrect data.
        Read Committed: This level ensures that a transaction can only read data that has been committed by other transactions, avoiding dirty reads. However, it still allows non-repeatable reads, where the same query can return different results within the same transaction if other transactions modify the data. This level is suitable for applications that require moderate consistency and concurrency.
        Repeatable Read: This level ensures that a transaction can read the same data consistently throughout its execution, avoiding non-repeatable reads. It also prevents phantom reads, where a query returns new rows that are added by other transactions during the same transaction. However, it still allows write skews, where two transactions can update the same row concurrently, leading to conflicting changes. This level is suitable for applications that require higher consistency than read committed level.
        Serializable: This is the highest isolation level, where transactions are executed as if they were running serially, with no overlapping effects. It ensures the strongest consistency guarantees by preventing all anomalies, including dirty reads, non-repeatable reads, phantom reads, and write skews. However, it can also cause a high degree of contention, blocking, and reduced concurrency, making it suitable for only specific use cases that require high consistency.
    It's important to note that higher isolation levels come at the cost of reduced concurrency, increased contention, and decreased performance, so it's necessary to choose the appropriate isolation level based on the specific requirements and trade-offs of the application or use case.

9. what is phantom read and how to prevent it.
    A phantom read is a phenomenon that can occur in databases when a transaction reads data that appears to be consistent with a previous query but includes new rows that weren't there before. This can happen when two transactions are accessing the same data simultaneously, and one transaction inserts new rows while the other transaction is reading the same data.
    Phantom reads occur because of a transaction isolation level known as "repeatable read." This isolation level ensures that if a transaction reads a set of rows at a certain point in time, it will always see those same rows until the transaction is complete. However, it doesn't prevent new rows from being added to the table during the transaction, which can lead to phantom reads.
    For example, suppose Transaction A queries a table and returns a set of rows. Then, before Transaction A completes, Transaction B inserts new rows that match the criteria of Transaction A's query. If Transaction A queries the same table again, it will see the new rows that were inserted by Transaction B, even though they weren't present during the first query.
    To avoid phantom reads, a higher isolation level can be used, such as serializable isolation, which prevents any concurrent transactions from modifying the data being accessed. However, this can result in decreased performance and increased blocking of concurrent transactions.

    Yes, there are several ways to prevent phantom reads in addition to using a higher isolation level or a locking mechanism. Here are a few examples:
    Use optimistic concurrency control: This approach involves adding a version number or timestamp to each row in the database, and checking this value before updating or deleting the row. If the value has changed since the row was read, the transaction will fail, preventing phantom reads. This approach can be useful in scenarios where conflicts are infrequent and locking overhead is not desirable.
    Use snapshot isolation: This is a type of transaction isolation that allows each transaction to see a consistent snapshot of the database at the start of the transaction. This ensures that the transaction will not see any changes made by concurrent transactions during its execution, preventing phantom reads. Snapshot isolation can be useful in scenarios where read-heavy workloads are common.
    Use index-range locking: This approach involves locking a range of values in an index, rather than locking individual rows. This can prevent phantom reads, as any changes made to the index within the locked range will be visible to the transaction. However, index-range locking can lead to increased contention and decreased performance in scenarios with high concurrency.
    Use a different database architecture: In some cases, changing the underlying database architecture can help prevent phantom reads. For example, using a distributed database with strong consistency guarantees can ensure that transactions see a consistent view of the data across all nodes, preventing phantom reads.
    Overall, the best approach to preventing phantom reads will depend on the specific requirements of your application and the underlying database technology being used.


10. Why NoSQL is more scalable than SQL?
    Yes, that is correct. LSM (Log-Structured Merge) trees are designed to be better suited for write-heavy workloads, while B-trees are more suited for read-heavy workloads.

    In an LSM tree, data is first written to a log, which is an append-only data structure. This allows for fast writes and eliminates the need for expensive index updates that are required in a B-tree. However, since data is not indexed immediately, reads may require scanning multiple levels of the LSM tree to find the required data, which can be slower than in a B-tree.

    To address this issue, LSM trees periodically merge the log and the indexed data, creating a new sorted index. This process converts random writes into sequential writes, which can be more efficiently written to disk. While this periodic merging can result in occasional spikes in write performance, on average, an LSM tree will perform better for write-heavy workloads.

    In contrast, B-trees provide fast access to indexed data, making them more suited for read-heavy workloads. B-trees maintain a balanced tree structure and perform index updates immediately upon write operations, ensuring that data is always indexed and available for fast access. However, this comes at the cost of more expensive write operations, which can slow down write-heavy workloads.

    Overall, the choice between LSM trees and B-trees depends on the specific requirements of the application and the workload. LSM trees are a good choice for write-heavy workloads where fast writes are critical, while B-trees are a good choice for read-heavy workloads where fast access to indexed data is important.


11. how many masters and replicas do redis cluster maintain?
    In Redis Cluster, each shard (also known as a slot) has a master and one or more replicas. By default, Redis Cluster creates one replica for each master, but it is also possible to configure a different number of replicas per master.

    When a master fails, one of its replicas is promoted to the new master, and a new replica is created for the new master. Redis Cluster uses a consensus protocol called Raft to ensure that the failover process is reliable and consistent.

    The number of masters and replicas in a Redis Cluster depends on the number of nodes in the cluster and the number of hash slots that are configured. Each node can be responsible for one or more hash slots, and each hash slot can have one master and one or more replicas.

    For example, if you have a Redis Cluster with 6 nodes, and you configure 16384 hash slots, each node will be responsible for approximately 2730 hash slots. If you have one master and one replica per hash slot, you will have a total of 16384 masters and 16384 replicas in the cluster.

    However, it's important to note that having too many replicas can have a negative impact on the performance of the cluster, as each replica increases the amount of network traffic and memory usage. Therefore, it's important to balance the number of replicas with the need for high availability and fault tolerance.