[ToDo] Designing Data Intensive Applications

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martin Kleppmann. 2017. ISBN: 978-1449373320 .

Link to notes

Part I. Foundations of Data Systems

✅ Ch 01. Reliable, Scalable and Maintainable Applications:

✅ Thinking About Data Systems
✅ Reliability: hardware faults; software errors; human errors; importance of reliability.
✅ Scalability: describing load; describing performance; approaches for coping with load.
✅ Maintainability: operability - making life easy for operations; simplicity - managing complexity; evolvability - making change easy.

Ch 02. Data Models and Query Languages:

Relational model vs. document model: the birth of NoSQL; the object-relational mismatch; many-to-one and many-to-many relationships; are document databases repeating history; relational vs. document databases today.
Query languages for data: declarative queries on the web; map-reduce querying.
Graph-like data models: property graphs; the cypher query language; graph queries in SQL; triple-stores and SPARQL; the foundation - datalog.

Ch 03. Storage and Retrieval:

Data structures that power your database: hash indexes; SSTables and LSM-trees; B-trees; comparing B-trees and LSM-trees; other indexing structures.
Transaction Processing or Analytics: data warehousing; stars and snowflakes - schemas for analytics.
Column-oriented storage: column compression; sort order in column storage; writing to column-oriented storage; aggregation - data cubes and materialized views.

Ch 04. Encoding and Evolution:

Formats for encoding data: language-specific formats; JSON, XML, and binary variants; thrift and protocol buffers; Avro; merits of schemas.
Modes of dataflow: through databases; through REST and RPC services; message-passing dataflow.

Part II. Distributed Data

Ch 05. Replication:

Leaders and followers: synchronous vs. asynchronous replication; setting up new followers; handling node outages; implementation of replication logs.
Problems with replication lag: reading your own writes; monotonic reads; consistent prefix reads; solutions for replication lag.
Multi-leader replication: use cases for multi-leader replication; handling write conflicts; multi-leader replication topologies.
Leaderless replication: writing to the database when a node is down; limitations of quorum consistency; sloppy quorums and hinted handoff; detecting concurrent writes.

Ch 06. Partitioning:

Partitioning and replication.
Partitioning of key-value data: partitioning by key range; partitioning by hash of key; skewed workloads and relieving hot spots.
Partitioning and secondary indexes: partitioning secondary indexes by document, partitioning secondary indexes by term.
Rebalancing partitions: strategies for rebalancing; automatic vs. manual rebalancing.
Request routing: parallel query execution.

Ch 07. Transactions:

The slippery concept of a transaction: the meaning of ACID; single-object and multi-object operations.
Weak isolation levels: read committed; snapshot isolation and repeatable read; preventing lost updates; write skew and phantoms.
Serializability: actual serial execution; two-phase locking (2PL); serializable snapshot isolation (SSI).

Ch 08. The Trouble with Distributed Systems:

Faults and partial failures: cloud computing and supercomputing.
Unreliable networks: network faults in practice; detecting faults; timeouts and unbounded delays; synchronous vs. asynchronous networks.
Unreliable clocks: monotonic vs. time-of-day clocks; clocks synchronization and accuracy; relying on synchronized clocks; process pauses.
Knowledge, truth, and lies: the truth is defined by the majority; byzantine faults; system model and reality.

Ch 09. Consistency and Consensus:

Consistency guarantees
Linearizability: what makes a system linearizable; relying on linearizability; implementing linearizable systems; the cost of linearizability.
Ordering guarantees: ordering and causality; sequence number ordering; total order broadcast.
Distributed transactions and consensus: atomic commit and two-phase commit (2PC); distributed transactions in practice; fault-tolerant consensus; membership and coordination services.

Derived Data

Ch 10. Batch Processing:

Batch processing with Unix tools: simple log analysis; the unix philosophy.
MapReduce and distributed file systems: MapReduce job execution; reduce-side joins and grouping; map-side joins; the output of batch workflows; comparing Hadoop to distributed databases.
Beyond MapReduce: materialization of intermediate state; graphs and iterative processing; high-level APIs and languages.

Ch 11. Stream Processing:

Transmitting event streams: messaging systems; partitioned logs.
Databases and streams: keeping systems in sync; change data capture; event sourcing; state, streams, and immutability.
Processing streams: uses of stream processing; reasoning about time; stream joins; fault tolerance.

Ch 12. The Future of Data Systems:

Data integration: combining specialized tools by deriving data; batch and stream processing.
Unbundling databases: composing data storage technologies; designing applications around dataflow; observing derived state.
Aiming for correctness: the end-to-end argument for databases; enforcing constraints; timeliness and integrity; trust, but verify.
Doing the right thing: predictive analytics; privacy and tracking.