Why Write Tests
Some test failures are hard to debug, and slow down development, e.g. flaky tests.
Manual testing (e.g. QA department) doesn’t scale well, especially when the software has multiple libraries/frameworks, platforms, user configurations, and intra-day releases. However, some tests, e.g. testing the quality of search results, often involves qualitative human judgment. Humans are also better at searching for complex security vulnerabilities, which can then be added to an automated security testing system, once the flaw is understood.
Designing a Test Suite
A test suite is a collection of tests that is ran to verify that a feature is working correctly.
My opinions on what makes a good test suite is influenced by my experiences with Google Test (C++), Jest (TS, JS), Mocha (TS, JS), PyTest (Python), and HUnit (Haskell).
Each programming language has competing frameworks for writing tests. The languages themselves seldom come with adequate frameworks. Learning the specific framework becomes yet another thing, but framework share some commonalities, e.g. pre-test setup functions, hierarchical tests, etc..
There are 3 modes of testing: unit tests, integration tests and black-box tests. The first two are aware of the source code details, while the third one is performed on an already-compiled program.
The size of a test refers to the resources required to run a test, e.g. memory, processes, time, etc.. The scope of a test refers to the specific code paths that are verified (and not just executed!) in the test.
A taxonomy based on size (rather than “unit”, “integration”, etc.) is more
useful because speed and determinism of the test suite matter, regardless of the
scope of the test. Sources of test slowness and/or non-determinism (flakiness)
include blocking calls (e.g.
sleep), clock time, thread scheduling, network
access and latency, disk access, and third-party processes.
Small tests, must run on the same process as the code being tested. Any
network or disk access must be to a hermetic in-memory implementation. Medium
tests must be contained within a single machine, but can span multiple
processes (e.g. a real database instance), use threads, make blocking calls, and
make network calls to
localhost. Large tests do not have the
restriction, and can span multiple machines. Tooling can be configured to
enforce size constraints, e.g. a custom test runner that fails small tests that
attempt to establish a network connection.
Narrow-scoped tests (unit tests) validate the logic in an individual class or method. Medium-scoped tests (integration tests) verify interactions between a small number of components, e.g. server and database. Large-scoped tests (functional tests, end-to-end tests, system tests) validate interaction of several distinct parts of the system, and emergent behaviors that aren’t expressed in a single class or method.
Ideally, most of the tests should be narrow-scoped unit tests that validate business logic, and then medium-scoped integration tests, and finally few large-scoped end-to-end tests. Suites with many end-to-end tests but few integration and unit tests are a symptom of rushing to production without addressing testing debt. Suites with many end-to-end and unit-tests but few integration tests may suffer from tightly coupled components.
A good test suite should support running a subset of tests (multiple times if need be), hooks for setting up and tearing down a test case, independence of various test cases (the order of running the tests shouldn’t matter), and temporarily disabling tests.
The test suite should support analyses such as: identifying flaky tests, generating code coverage reports, surfacing troubleshooting info (e.g. logs, repro commands), etc.
Flaky tests are expensive. If each test has a 0.1% of failing when it should not, and one runs 10,000 tests per day, then there will be 10 flakes to investigate each day. As flakiness increases (past 1%), devs stop reacting to test failures.
Attaining 100% test coverage, especially covering all possible code paths, is difficult in large code bases. Are there tools that help developers be more judicious as to which code paths are tested?
Code coverage can flag untested code, but it shouldn’t be used as a proxy for how well a system is tested. Code coverage measures code that was invoked, and not code that was validated. Coverage should be used for small tests to avoid coverage inflation. Even further, devs tend to treat coverage targets as ceilings rather than floors, i.e. not adding tests as soon as the coverage threshold is met. Instead of relying on coverage numbers, assess your confidence that everything your customers expect to work will work, given that the tests are passing.
Choosing what to test is sometimes not obvious. Some tests are almost tautological or paranoid.
The Beyoncé Rule: If you liked it, then you shoulda put a test on it. Everything that you don’t want to break should be in a test, e.g. performance, behavioral correctness, accessibility, security, etc..
Sometimes it’s hard to get a 100% true representation of what the end-user sees. Tests frequently mock out parts of the systems that are being tested.
Running tests before checking in code is usually provided by some paid service like Travis CI. Some projects, e.g. personal ones, may not have such funds.
Improving Testing Culture
During orientation, new engineers were taught about testing as if it were standard practice at Google. The new engineers would quickly outnumber existing team members, and helped bring about cultural change.
A “Test Certified” program with levels 1 to 5, and each level having goals achievable in a quarter. Level 1: set up a continuous build; start tracking code coverage; classify all tests as small, medium, or large; identify flaky tests; create a set of fast tests. Level 5: all tests are automated; fast tests ran before every commit; all non-determinism removed; every behavior covered. An internal dashboard showing the level of each team sparked competition. “Test Certified” was later replaced with an automated approach: Project Health (pH).
Testing on the Toilet (TotT). Weekly flyers on how to improve testing were placed in restroom stalls. An email newsletter would have been lost in the noise. TotT’s reception was polarized (e.g. invasion of personal space), but the uproar subsided. has online versions of the flyers.
- 7 Absolute Truths I Unlearned as Junior Developer. Monica Lent. monicalent.com . Jun 3, 2019. Accessed Jun 6, 2022.
- Absolute truths I unlearned as junior developer (2019) | Hacker News. news.ycombinator.com . Accessed Jun 6, 2022.
- Google Testing Blog: TotT. testing.googleblog.com . Accessed Jun 13, 2022.