Distributed Systems

Dated Jul 19, 2025; last modified on Sat, 19 Jul 2025

		Random Link ¯\_(ツ)_/¯
Jun 6, 2026	»	Distributing Work 3 min; updated Jul 16, 2026 Sharding Sharding comes up when a single database won’t work (e.g., hit storage limits, write throughput limits, or read throughput limits) and you need to split your data across multiple independent servers. For a user-centric social media app, sharding by `user_id` means all of a user’s posts, likes, and comments live on one shard. User-scoped queries are fast, but “trending posts across all users” become expensive because of hitting every shard and aggregating results. Mutating transations need to account for the distributed nature of the shards, making them complex and slow. ...
Jun 6, 2026	»	Data Modeling 3 min; updated Jul 14, 2026 Data Modeling Choosing what data to store and how to structure it directly affects performance, scalability, and maintenance. Relational databases are useful when you have structured data with clear relationships and need strong consistency (transaction-based actions, enforcing foreign key constraints). NoSQL databases shine for flexible schemas or when you need to scale horizontally across many servers without complex joins. That said, the overlap between relational DBs and NoSQL ones is substantial, so focus on the DB that your using and how it helps solve the problem at hand, e.g., NoSQL DBs can express relationships too, SQL DBs can have JSON columns with flexible schemas, etc. ...
Jul 8, 2026	»	System Design Practice: Metrics Monitoring System 5 min; updated Jul 9, 2026 Functional Requirements FR1: Platform can ingest metrics (CPU, memory, latency, custom counters) from services. FR2: Users can query and visualize metrics on dashboards with filters, aggregations, and time ranges. FR3: Users can define alert rules with thresholds over time windows, e.g., “alert if P99 latency > 500ms for 5min”. FR4: Users can receive notifications when alerts fire, e.g., email, Slack. Scale: 500K servers, 5M metrics per second, ~1GB/s raw ingestion. ...
Nov 27, 2024	»	Consistent Hashing 6 min; updated Jun 6, 2026 The term “consistent hashing” makes me think of hashing without randomization. Why isn’t every hash consistent by definition? For example, a map implementation would need consistent hashing lest it’s inaccurate when searching for stored values. Or is consistent hashing a tradeoff between collision-resistance and speed? Web Caching Web caching was the original motivation for consistent hashing. With a web cache, if a browser requests a URL that is not in the cache, the page is downloaded from the server, and the result is sent to both the browser and the cache. On a second request, the page is served from the cache without contacting the server. ...
Jun 6, 2026	»	Networking Essentials 2 min; updated Jun 6, 2026 Notes At a basic level, understand how services talk to each other and what happens when those connections fail or get slow. For 90% of the use cases, default to HTTP over TCP. WebSockets and Server-Sent Events (SSE) come up when you need real-time updates. SSE is unidirectional - the client makes an initial HTTP request to open the connection, and the server pushes data down that connection. WebSockets handle bidirectional communication where both sides send messages freely. ...
Jun 6, 2026	»	CAP Theorem 1 min; updated Jun 6, 2026 Notes The CAP theorem states that you can only have 2 of 3 properties at once: Consistency: All nodes see the same data. Availability: Every request gets a response. Partition Tolerance: System works even when network connections fail between nodes. In practice, network partitions are unavoidable in distributed system. Choosing consistency means some nodes will refuse to serve requests rather than return potentially stale data. Choosing availability means different nodes might temporarily have different data. ...
Jun 6, 2026	»	Caching 2 min; updated Jun 6, 2026 Notes For read-heavy applications, storing frequently accessed data in fast memory (e.g., Redis) allows you to skip the DB entirely for some reads. A cache hit on Redis takes ~1ms compared to 20-50ms for a typical DB query, and this speedup is impactful in the order of millions of requests. Caching requires a solution for invalidating stale data, e.g., user updates their profile in the DB. Strategies include invalidating the cache immediately after writes, using short TTLs, etc. ...
Jun 6, 2026	»	API Design 1 min; updated Jun 6, 2026 Notes For 90% of interviews, default to REST, which maps resources to URLs and uses HTTP methods to manipulate them, e.g., `POST /events/{id}/bookings` for creating a booking. When returning large result sets, pagination comes into play. Cursor-based pagination works better for real-time data where new items get added frequently. Offset-based pagination is fine for most cases. How does cursor-based pagination beat offset-based pagination for new additions? Aren’t new additions invisible to inflight paginations? ...
May 31, 2026	»	Delivery Framework for System Design Interviews 3 min; updated Jun 6, 2026 Requirements (~5min) Functional Requirements Completions for “Users/Clients should be able to…”, e.g., for Twitter: post tweets, follow other users, and see tweets from users they follow. Keep the list targeted and prioritized (e.g., top 3) as your job is to develop a system that meets those requirements. Non-Functional Requirements Completions for “The system should be…”, e.g., for Twitter: Highly available, prioritizing availability over consistency. Scale to support 100M+ daily active users. Low latency, rendering feeds in under 200ms. ...
Jul 19, 2025	»	Resilient App Development in .NET 6 min; updated Jul 19, 2025 `Microsoft.Extensions.Resilience` and `Microsoft.Extensions.Http.Resilience` provide resilience mechanisms against transient failures. These two packages are built on top of the open-source `Polly` resilience library. Build a Resilience Pipeline Given a `ServiceCollection services`, configure a keyed resilience pipeline as follows: `const string key = "Retry-Timeout"; services.AddResiliencePipeline(key, static builder => { builder.AddRetry(new RetryStrategyOptions { ShouldHandle = new PredicateBuilder().Handle<TimeoutRejectedException>() }); builder.AddTimeout(TimeSpan.FromSeconds(1.5)); });` Other `Add*` extension methods include `AddCircuitBreaker`, `AddRateLimiter`, `AddConcurrencyLimiter`, `AddFallback`, and `AddHedging`. Using `AddResiliencePipeline` separates the pipeline’s definition from its usage points where it’s injected. This allows for convenient unit testing, e.g., supplying `ResiliencePipeline<T>.Empty` for faster and less complicated tests. ...
May 17, 2020	»	Mergeable Replicated Data Types 2 min; updated Jul 19, 2025 On a distributed system, each replica should [eventually] converge to the same state. Commutative Replicated Data Types (CRDTs) can accept updates and achieve consistent without remote synchronization. The Need for Commutativity Say we have a queue \( 1 \to 2 \). Suppose two replicas, \(r_1\) and \(r_2\), independently call `pop()`. Each replica will have \(2\) on their queue. However, on receiving an update that the other replica popped, each replica will call `pop()` to be consistent, thereby deleting \(2\). ...
Mar 16, 2017	»	Designing Data-Intensive Applications [Book] (8 items) Relational Model Versus Document Model; Thinking About Data Systems; Designing Data-Intensive Applications [Kleppmann, Martin]; Query Languages for Data; Reliability; Maintainability; Blob Storage; Scalability;