A fault is when a component deviates from its spec, while a failure is when the system no longer provides the required service to the user.
Better to design fault-tolerant systems because we can’t reduce the probability of a fault to zero.
In fault-tolerant systems, one should trigger faults deliberately in order to test the tolerance, e.g. randomly killing individual processes.
However, not all things can be tolerated. For instance, if an attacker gains access to sensitive data, then the damage can’t be undone. In that case, aim for prevention.
Hardware Faults
Hard disks' mean time to failure (MTTF) is about 10 to 50 years. On a storage cluster with 10k disks, we should expect one disk to die per day.
Adding redundancy of hardware components is fine if backups can be quickly restored to new machines, and downtime can be tolerated.
But applications need tons of data and compute - think of AWS. The rate of hardware faults increases proportionally.
We’re now moving to software fault-tolerance. Systems are built such that they can tolerate the loss of entire machines without downtime.
Software Errors
Software errors are notorious because they’re correlated across nodes, e.g. Linux kernel bug that caused many applications to hang simultaneously on the leap second of June 30, 2012.
They are hard to trace because the software makes some kind of assumption that is usually but not always true.
Human Errors
Make it easy to do the right thing in the abstractions, APIs and admin interfaces.
Provide fully-featured non-production sandboxes for people to experiment in.
Test thoroughly at all levels - unit tests to integration tests to manual tests.
Allow quick and easy recovery from human errors.
Set up detailed and clear monitoring, e.g. performance metrics and error rates.
Reminds me of graph connectivity. Suppose a successful system means that there exists some path \(v_1 \rightsquigarrow v_n \). Tolerance means that removing some node, \(v_i\) with \(i \notin {1, n} \) still leaves some \(v_1 \rightsquigarrow v_n \). Suddenly, COS 423 starts to make practical sense.