Consistent hashing makes me think of hashing without randomization. Why isn’t every hash consistent by definition? For example, a map implementation would need consistent hashing lest it’s inaccurate when searching for stored values. Or is consistent hashing a tradeoff between collision-resistance and speed?
Web Caching
Was the original motivation for consistent hashing. With a web cache, if a browser requests a URL that is not in the cache, the page is downloaded from the server, and the result is sent to both the browser and the cache. On a second request, the page is served from the cache without contacting the server.
Web caching improves the end user’s response time. It also reduces network traffic overall.
While giving each user their own cache is good, one could go further and implement a web cache that is shared by many users, e.g., all users on a campus network. These users will enjoy many more cache hits. Akamai’s goal was to turn this into a viable technology.
Caching pages for a large number of users needs more storage than is available
on one machine. Suppose the shared cache is spread over \(N\) machines; where
should we look for a cached copy of www.contoso.com
?
Simple Solution Using Hashing
A hash function, \(h\) maps elements of a (usually super-big) universe \(U\), like URLs, to “buckets”, such as 32-bit values. A good \(h\):
- Is easy to compute, e.g., a few arithmetic operations and maybe a mod.
- Behaves like a totally random function, spreading out data evenly without notable correlations across the possible buckets.
Designing good hash functions isn’t easy, but can be treated like a solved problem. In practice, it’s common to use the MD5 hash function.
If we have \(n\) caches named \({0, 1, 2, …, n-1}\), how about storing the web page with URL \(x\) at the cache server named
$$ h(x) \mod{n} $$
In practice, \(n\) is not fixed, e.g., web caches can be added or removed due to business or network conditions. With the aforementioned approach, changing \(n\) forces almost all objects to relocate. Moving data between machines across a network is expensive, especially when \(n\) is prone to change.