Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems.
Martin Kleppmann.
2017.
ISBN: 978-1449373320 .
Part I. Foundations of Data Systems
✅ Ch 01. Reliable, Scalable and Maintainable Applications:
- ✅ Thinking About Data Systems
- ✅ Reliability: hardware faults; software errors; human errors; importance of reliability.
- ✅ Scalability: describing load; describing performance; approaches for coping with load.
- ✅ Maintainability: operability - making life easy for operations; simplicity - managing complexity; evolvability - making change easy.
Ch 02. Data Models and Query Languages:
- Relational model vs. document model: the birth of NoSQL; the object-relational mismatch; many-to-one and many-to-many relationships; are document databases repeating history; relational vs. document databases today.
- Query languages for data: declarative queries on the web; map-reduce querying.
- Graph-like data models: property graphs; the cypher query language; graph queries in SQL; triple-stores and SPARQL; the foundation - datalog.
Ch 03. Storage and Retrieval:
- Data structures that power your database: hash indexes; SSTables and LSM-trees; B-trees; comparing B-trees and LSM-trees; other indexing structures.
- Transaction Processing or Analytics: data warehousing; stars and snowflakes - schemas for analytics.
- Column-oriented storage: column compression; sort order in column storage; writing to column-oriented storage; aggregation - data cubes and materialized views.
Ch 04. Encoding and Evolution:
- Formats for encoding data: language-specific formats; JSON, XML, and binary variants; thrift and protocol buffers; Avro; merits of schemas.
- Modes of dataflow: through databases; through REST and RPC services; message-passing dataflow.
Part II. Distributed Data
Ch 05. Replication:
- Leaders and followers: synchronous vs. asynchronous replication; setting up new followers; handling node outages; implementation of replication logs.
- Problems with replication lag: reading your own writes; monotonic reads; consistent prefix reads; solutions for replication lag.
- Multi-leader replication: use cases for multi-leader replication; handling write conflicts; multi-leader replication topologies.
- Leaderless replication: writing to the database when a node is down; limitations of quorum consistency; sloppy quorums and hinted handoff; detecting concurrent writes.
Ch 06. Partitioning:
- Partitioning and replication.
- Partitioning of key-value data: partitioning by key range; partitioning by hash of key; skewed workloads and relieving hot spots.
- Partitioning and secondary indexes: partitioning secondary indexes by document, partitioning secondary indexes by term.
- Rebalancing partitions: strategies for rebalancing; automatic vs. manual rebalancing.
- Request routing: parallel query execution.
Ch 07. Transactions:
- The slippery concept of a transaction: the meaning of ACID; single-object and multi-object operations.
- Weak isolation levels: read committed; snapshot isolation and repeatable read; preventing lost updates; write skew and phantoms.
- Serializability: actual serial execution; two-phase locking (2PL); serializable snapshot isolation (SSI).
Ch 08. The Trouble with Distributed Systems:
- Faults and partial failures: cloud computing and supercomputing.
- Unreliable networks: network faults in practice; detecting faults; timeouts and unbounded delays; synchronous vs. asynchronous networks.
- Unreliable clocks: monotonic vs. time-of-day clocks; clocks synchronization and accuracy; relying on synchronized clocks; process pauses.
- Knowledge, truth, and lies: the truth is defined by the majority; byzantine faults; system model and reality.
Ch 09. Consistency and Consensus:
- Consistency guarantees
- Linearizability: what makes a system linearizable; relying on linearizability; implementing linearizable systems; the cost of linearizability.
- Ordering guarantees: ordering and causality; sequence number ordering; total order broadcast.
- Distributed transactions and consensus: atomic commit and two-phase commit (2PC); distributed transactions in practice; fault-tolerant consensus; membership and coordination services.
Derived Data
Ch 10. Batch Processing:
- Batch processing with Unix tools: simple log analysis; the unix philosophy.
- MapReduce and distributed file systems: MapReduce job execution; reduce-side joins and grouping; map-side joins; the output of batch workflows; comparing Hadoop to distributed databases.
- Beyond MapReduce: materialization of intermediate state; graphs and iterative processing; high-level APIs and languages.
Ch 11. Stream Processing:
- Transmitting event streams: messaging systems; partitioned logs.
- Databases and streams: keeping systems in sync; change data capture; event sourcing; state, streams, and immutability.
- Processing streams: uses of stream processing; reasoning about time; stream joins; fault tolerance.
Ch 12. The Future of Data Systems:
- Data integration: combining specialized tools by deriving data; batch and stream processing.
- Unbundling databases: composing data storage technologies; designing applications around dataflow; observing derived state.
- Aiming for correctness: the end-to-end argument for databases; enforcing constraints; timeliness and integrity; trust, but verify.
- Doing the right thing: predictive analytics; privacy and tracking.