[ToDo] Designing Data Intensive Applications

Dated Feb 2, 2020; last modified on Sat, 19 Jul 2025

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martin Kleppmann. 2017. ISBN: 978-1449373320 .

Link to notes

Part I. Foundations of Data Systems

✅ Ch 01. Reliable, Scalable and Maintainable Applications:

  • ✅ Thinking About Data Systems
  • ✅ Reliability: hardware faults; software errors; human errors; importance of reliability.
  • ✅ Scalability: describing load; describing performance; approaches for coping with load.
  • ✅ Maintainability: operability - making life easy for operations; simplicity - managing complexity; evolvability - making change easy.

Ch 02. Data Models and Query Languages:

  • Relational model vs. document model: the birth of NoSQL; the object-relational mismatch; many-to-one and many-to-many relationships; are document databases repeating history; relational vs. document databases today.
  • Query languages for data: declarative queries on the web; map-reduce querying.
  • Graph-like data models: property graphs; the cypher query language; graph queries in SQL; triple-stores and SPARQL; the foundation - datalog.

Ch 03. Storage and Retrieval:

  • Data structures that power your database: hash indexes; SSTables and LSM-trees; B-trees; comparing B-trees and LSM-trees; other indexing structures.
  • Transaction Processing or Analytics: data warehousing; stars and snowflakes - schemas for analytics.
  • Column-oriented storage: column compression; sort order in column storage; writing to column-oriented storage; aggregation - data cubes and materialized views.

Ch 04. Encoding and Evolution:

  • Formats for encoding data: language-specific formats; JSON, XML, and binary variants; thrift and protocol buffers; Avro; merits of schemas.
  • Modes of dataflow: through databases; through REST and RPC services; message-passing dataflow.

Part II. Distributed Data

Ch 05. Replication:

  • Leaders and followers: synchronous vs. asynchronous replication; setting up new followers; handling node outages; implementation of replication logs.
  • Problems with replication lag: reading your own writes; monotonic reads; consistent prefix reads; solutions for replication lag.
  • Multi-leader replication: use cases for multi-leader replication; handling write conflicts; multi-leader replication topologies.
  • Leaderless replication: writing to the database when a node is down; limitations of quorum consistency; sloppy quorums and hinted handoff; detecting concurrent writes.

Ch 06. Partitioning:

  • Partitioning and replication.
  • Partitioning of key-value data: partitioning by key range; partitioning by hash of key; skewed workloads and relieving hot spots.
  • Partitioning and secondary indexes: partitioning secondary indexes by document, partitioning secondary indexes by term.
  • Rebalancing partitions: strategies for rebalancing; automatic vs. manual rebalancing.
  • Request routing: parallel query execution.

Ch 07. Transactions:

  • The slippery concept of a transaction: the meaning of ACID; single-object and multi-object operations.
  • Weak isolation levels: read committed; snapshot isolation and repeatable read; preventing lost updates; write skew and phantoms.
  • Serializability: actual serial execution; two-phase locking (2PL); serializable snapshot isolation (SSI).

Ch 08. The Trouble with Distributed Systems:

  • Faults and partial failures: cloud computing and supercomputing.
  • Unreliable networks: network faults in practice; detecting faults; timeouts and unbounded delays; synchronous vs. asynchronous networks.
  • Unreliable clocks: monotonic vs. time-of-day clocks; clocks synchronization and accuracy; relying on synchronized clocks; process pauses.
  • Knowledge, truth, and lies: the truth is defined by the majority; byzantine faults; system model and reality.

Ch 09. Consistency and Consensus:

  • Consistency guarantees
  • Linearizability: what makes a system linearizable; relying on linearizability; implementing linearizable systems; the cost of linearizability.
  • Ordering guarantees: ordering and causality; sequence number ordering; total order broadcast.
  • Distributed transactions and consensus: atomic commit and two-phase commit (2PC); distributed transactions in practice; fault-tolerant consensus; membership and coordination services.

Derived Data

Ch 10. Batch Processing:

  • Batch processing with Unix tools: simple log analysis; the unix philosophy.
  • MapReduce and distributed file systems: MapReduce job execution; reduce-side joins and grouping; map-side joins; the output of batch workflows; comparing Hadoop to distributed databases.
  • Beyond MapReduce: materialization of intermediate state; graphs and iterative processing; high-level APIs and languages.

Ch 11. Stream Processing:

  • Transmitting event streams: messaging systems; partitioned logs.
  • Databases and streams: keeping systems in sync; change data capture; event sourcing; state, streams, and immutability.
  • Processing streams: uses of stream processing; reasoning about time; stream joins; fault tolerance.

Ch 12. The Future of Data Systems:

  • Data integration: combining specialized tools by deriving data; batch and stream processing.
  • Unbundling databases: composing data storage technologies; designing applications around dataflow; observing derived state.
  • Aiming for correctness: the end-to-end argument for databases; enforcing constraints; timeliness and integrity; trust, but verify.
  • Doing the right thing: predictive analytics; privacy and tracking.