Why Write Tests
The later a bug is caught in the development cycle, the more expensive it is to fix it. A good test today is a future debugging session saved.
The test is the first client of your code. It uncovers sub-optimal design choices, tight couplings, missed cases, etc..
Some test failures are hard to debug, and slow down development, e.g. flaky tests.
As a code base grows, poorly written test suites become counter-productive, e.g. instability, slowness, etc..
As the team size grows, programmer ability is insufficient to avoid defects. For instance, a 100-person team of devs that only write a single bug each month will have 5 new bugs every workday.
Automated testing turns the team members' collective wisdom into a benefit for the entire team. Everyone can run the test and will benefit when it detects an issue.
Manual testing (e.g. QA department) doesn’t scale well, especially when the software has multiple libraries/frameworks, platforms, user configurations, and intra-day releases. However, some tests, e.g. testing the quality of search results, often involves qualitative human judgment. Humans are also better at searching for complex security vulnerabilities, which can then be added to an automated security testing system, once the flaw is understood.
Expressing tests as code instead of a series of manual steps allows us to run the tests each time the code changes, and modularize the tests to be executed in various environments.
Tests work best as documentation only if the tests are kept clear and concise.
Tests help simplify reviews. A code reviewer need not mentally walk through scenarios to check code correctness, edge cases, and error conditions.
The reason for the lack of automated tests may be more nuanced, e.g. pain from legacy tests, early stages where it’s about validating the idea, etc..
Designing a Test Suite
A test suite is a collection of tests that is ran to verify that a feature is working correctly.
My opinions on what makes a good test suite is influenced by my experiences with Google Test (C++), Jest (TS, JS), Mocha (TS, JS), PyTest (Python), and HUnit (Haskell).
Each programming language has competing frameworks for writing tests. The languages themselves seldom come with adequate frameworks. Learning the specific framework becomes yet another thing, but framework share some commonalities, e.g. pre-test setup functions, hierarchical tests, etc..
There are 3 modes of testing: unit tests, integration tests and black-box tests. The first two are aware of the source code details, while the third one is performed on an already-compiled program.
The size of a test refers to the resources required to run a test, e.g. memory, processes, time, etc.. The scope of a test refers to the specific code paths that are verified (and not just executed!) in the test.
A taxonomy based on size (rather than “unit”, “integration”, etc.) is more
useful because speed and determinism of the test suite matter, regardless of the
scope of the test. Sources of test slowness and/or non-determinism (flakiness)
include blocking calls (e.g.
sleep), clock time, thread scheduling, network
access and latency, disk access, and third-party processes.
Small tests, must run on the same process as the code being tested. Any
network or disk access must be to a hermetic in-memory implementation. Medium
tests must be contained within a single machine, but can span multiple
processes (e.g. a real database instance), use threads, make blocking calls, and
make network calls to
localhost. Large tests do not have the
restriction, and can span multiple machines. Tooling can be configured to
enforce size constraints, e.g. a custom test runner that fails small tests that
attempt to establish a network connection.
Narrow-scoped tests (unit tests) validate the logic in an individual class or method. Medium-scoped tests (integration tests) verify interactions between a small number of components, e.g. server and database. Large-scoped tests (functional tests, end-to-end tests, system tests) validate interaction of several distinct parts of the system, and emergent behaviors that aren’t expressed in a single class or method.
Ideally, most of the tests should be narrow-scoped unit tests that validate business logic, and then medium-scoped integration tests, and finally few large-scoped end-to-end tests. Suites with many end-to-end tests but few integration and unit tests are a symptom of rushing to production without addressing testing debt. Suites with many end-to-end and unit-tests but few integration tests may suffer from tightly coupled components.
A good test suite should support running a subset of tests (multiple times if need be), hooks for setting up and tearing down a test case, independence of various test cases (the order of running the tests shouldn’t matter), and temporarily disabling tests.
A test should be obvious upon introspection. There are no tests for the tests themselves, so avoid complex test flows, e.g. conditionals and loops.
The test suite should support analyses such as: identifying flaky tests, generating code coverage reports, surfacing troubleshooting info (e.g. logs, repro commands), etc.
Flaky tests are expensive. If each test has a 0.1% of failing when it should not, and one runs 10,000 tests per day, then there will be 10 flakes to investigate each day. As flakiness increases (past 1%), devs stop reacting to test failures.
Attaining 100% test coverage, especially covering all possible code paths, is difficult in large code bases. Are there tools that help developers be more judicious as to which code paths are tested?
Code coverage can flag untested code, but it shouldn’t be used as a proxy for how well a system is tested. Code coverage measures code that was invoked, and not code that was validated. Coverage should be used for small tests to avoid coverage inflation. Even further, devs tend to treat coverage targets as ceilings rather than floors, i.e. not adding tests as soon as the coverage threshold is met. Instead of relying on coverage numbers, assess your confidence that everything your customers expect to work will work, given that the tests are passing.
Choosing what to test is sometimes not obvious. Some tests are almost tautological or paranoid.
The Beyoncé Rule: If you liked it, then you shoulda put a test on it. Everything that you don’t want to break should be in a test, e.g. performance, behavioral correctness, accessibility, security, etc..
Sometimes it’s hard to get a 100% true representation of what the end-user sees. Tests frequently mock out parts of the systems that are being tested.
Brittle tests - those that over-specify expected outcomes or rely on extensive and complicated boilerplate - can resist change. The misuse of mock objects is a prolific source of brittleness.
Running tests before checking in code is usually provided by some paid service like Travis CI. Some projects, e.g. personal ones, may not have such funds.
Improving Testing Culture
Learn how to spot and use momentum from Watershed moments. In 2005, the Google Web Server project lead instituted a policy of engineer-driven, automated testing, e.g. new code had to have tests, and tests would be run continuously. Within a year of the policy, despite record numbers of new changes, the number of emergency pushes dropped by half. However, instead of issuing similar mandates to the rest of Google, the Testing Grouplet was focused on demonstrating success, in the belief that successful ideas would spread.
During orientation, new engineers were taught about testing as if it were standard practice at Google. The new engineers would quickly outnumber existing team members, and helped bring about cultural change.
A “Test Certified” program with levels 1 to 5, and each level having goals achievable in a quarter. Level 1: set up a continuous build; start tracking code coverage; classify all tests as small, medium, or large; identify flaky tests; create a set of fast tests. Level 5: all tests are automated; fast tests ran before every commit; all non-determinism removed; every behavior covered. An internal dashboard showing the level of each team sparked competition. “Test Certified” was later replaced with an automated approach: Project Health (pH).
Testing on the Toilet (TotT). Weekly flyers on how to improve testing were placed in restroom stalls. An email newsletter would have been lost in the noise. TotT’s reception was polarized (e.g. invasion of personal space), but the uproar subsided. has online versions of the flyers.
- Software Engineering at Google: Lessons Learned From Programming Over Time. Ch 11. Testing Overview. Adam Bender; Tom Manshreck. abseil.io . Feb 28, 2020. ISBN: 9781492082743 .
- 7 Absolute Truths I Unlearned as Junior Developer. Monica Lent. monicalent.com . Jun 3, 2019. Accessed Jun 6, 2022.
- Absolute truths I unlearned as junior developer (2019) | Hacker News. news.ycombinator.com . Accessed Jun 6, 2022.
- Google Testing Blog: TotT. testing.googleblog.com . Accessed Jun 13, 2022.
Sometimes I do the opposite: be meticulous when reviewing code, and then blaze through the test files assuming that if the tests are wrong, then they’d fail in CI. I need to get better at reviewing test files, and identifying holes in tests.