Why test?

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Tests are the difference between “I think it works” and “I can prove it still works tomorrow after I change something else” — but writing too many of the wrong tests can cost more than they save.

In plain English

“Testing” sounds like a chore. You wrote the code, you tested it once manually, it works — why write MORE code just to verify the first code?

Because software changes. A feature you wrote three months ago works today. Next week, you refactor a utility function it depends on. The next morning, a user reports it’s broken. You stare at the diff for an hour, trying to remember what the code was supposed to do, before you realize a tiny edge case stopped working.

A test is code that exercises your real code and checks the result. Run the test; if the check passes, the behavior still works. If it fails, you know — INSTANTLY, before the user reports it — that you broke something.

Tests are not about proving your code is correct. They’re about catching the moment your code STOPS being correct. They’re a tripwire.

The fundamental promise of testing:

“I can change this code with confidence. If I break something, I’ll know in seconds, not weeks.”

That confidence is what unlocks fearless refactoring, fast iteration, and shipping at speed without piling up regressions. Teams without tests are slower over time, not faster — because every change has to be hand-verified across the whole app.

The catch: tests are also code. They have bugs. They get out of date. They can lock you into specific implementations. Writing too many, or testing the wrong things, can make a project HARDER to change rather than easier.

The art is knowing what to test, at what level, and when — and accepting that some things shouldn’t be tested at all.

Why it matters

Three concrete reasons tests pay for themselves:

They catch regressions. A change that fixes one bug often introduces another somewhere else. Tests turn “did I break anything?” from a hand-checking nightmare into a 30-second CI run.
They document the intent. A well-named test reads like a specification: “when the cart total exceeds $100, free shipping kicks in.” Six months from now, when you’ve forgotten how the code works, the tests tell you what it’s SUPPOSED to do.
They enable refactoring. Without tests, every refactor is a leap of faith. With tests, you can rip out the inside of a function and trust the green tests to confirm the outside still behaves correctly.

And concretely, the costs of NOT testing scale with project size:

For a 1-week prototype: testing is overhead. Skip most of it.
For a 3-month side project: light tests on critical paths pay off.
For a year-long codebase you’ll maintain: tests are the difference between maintainable and unmaintainable.
For a multi-person team: tests are how you share understanding of “what should work.”

The testing pyramid

A widely-used mental model for the different LEVELS of tests:

                    ┌──────────────┐
                    │   E2E tests  │   slow, expensive, brittle
                    │   (browser)  │   few of them
                    └──────────────┘
                  ┌──────────────────┐
                  │  Integration     │   medium speed
                  │  tests           │   moderate number
                  └──────────────────┘
              ┌────────────────────────┐
              │   Unit tests           │   fast, cheap, focused
              │   (single function)    │   many of them
              └────────────────────────┘

The pyramid principle: have many cheap, fast unit tests at the base. Fewer integration tests in the middle. Few end-to-end tests at the top.

Why? Because tests at the top are expensive:

They run slowly (E2E: 10s-2min per test; unit: 5ms)
They’re brittle (UI changes break them; race conditions in real browsers)
They’re hard to debug when they fail (everything is involved; where exactly is the bug?)

Unit tests are the opposite — fast, isolated, easy to debug. So lean heavily on them.

That said: this is a guideline, not a law. The right shape depends on your project:

Heavily UI-driven apps may have a “trophy” shape (more E2E, less unit) because the UI logic IS the value.
Pure utility libraries may be a pillar (only unit tests).
Microservice backends often want lots of integration tests to verify cross-service behavior.

Read the pyramid as: “don’t try to do everything at the top level.”

What’s worth testing (and what isn’t)

A working heuristic, in order of priority:

Definitely test

Business logic with branching (pricing, permissions, calculations)
Critical user paths (signup, login, payment)
Bug fixes — write a test that fails BEFORE the fix and passes AFTER. Keeps the bug from reappearing.
Anything that has broken before. Tests are a memory of past pain.
Edge cases in pure functions (empty arrays, zero, negative numbers, unicode)
Anything that goes to the database or external API in mutation paths

Probably test

Important UI components in isolation
Data transformations (formatters, parsers, validators)
API endpoint contracts (input/output shape stays stable)

Don’t bother

Third-party libraries’ behavior (test YOUR usage, not their code)
Trivial getters/setters
Pure framework wiring with no logic
Visual styling (use visual regression tools if you must)
Implementation details that aren’t behavior

The rule: test BEHAVIOR, not IMPLEMENTATION. A test that says “this function calls this other function” couples your tests to your code in a way that makes refactoring painful. A test that says “given input X, return output Y” survives any refactor that doesn’t change behavior.

A concrete example: a small business rule

Suppose you have a shipping cost calculator:

export function calculateShipping(weight: number, country: string): number {
  if (country === "AU" && weight > 5) return 25;
  if (country === "AU") return 15;
  if (country === "US") return 30;
  return 50;
}

A reasonable unit test (using Vitest):

import { describe, it, expect } from "vitest";
import { calculateShipping } from "./shipping";
 
describe("calculateShipping", () => {
  it("charges $15 for light AU packages", () => {
    expect(calculateShipping(2, "AU")).toBe(15);
  });
 
  it("charges $25 for heavy AU packages", () => {
    expect(calculateShipping(6, "AU")).toBe(25);
  });
 
  it("charges $30 for US packages regardless of weight", () => {
    expect(calculateShipping(2, "US")).toBe(30);
    expect(calculateShipping(20, "US")).toBe(30);
  });
 
  it("charges $50 for other countries", () => {
    expect(calculateShipping(2, "DE")).toBe(50);
  });
 
  it("treats weight at exactly the threshold as light", () => {
    expect(calculateShipping(5, "AU")).toBe(15);  // ≤ 5 is light
  });
});

Five tests. Cover the branches. Specifically test the boundary case (weight === 5). Run in milliseconds.

Now, if someone changes the threshold from 5 to 10, the test fails immediately. If they accidentally swap the order of the AU checks, the tests catch it.

That’s the value: a 60-second investment in tests catches a class of bugs forever.

The trap of testing too much

Tests have a cost too. A test you write today is a test you maintain forever:

When the code changes, the test must change too (or fail)
A flaky test (sometimes passes, sometimes fails) is worse than no test — people learn to ignore failures
Test code can have bugs that mask real bugs (a broken assertion that never actually checks anything)
100% coverage doesn’t mean 100% safety — it means every line was executed by SOME test, not that every behavior is correct

A few specific anti-patterns to avoid:

Coverage chasing. Hitting 100% coverage by writing tests that don’t actually assert meaningful things. A test that calls a function and ignores the result is worse than no test.
Implementation testing. Tests that mock 80% of the code under test, then assert “the function called this other function.” When you refactor, these tests fail without any real behavior changing.
Snapshot abuse. Snapshot tests (compare to a saved output blob) are easy to write but provide little signal — when they fail, people often just update the snapshot without checking what changed.
Test-everything-equally. A bug in your logo’s SVG doesn’t deserve a test. A bug in your payment calculation does. Triage what’s worth the cost.

Roughly: the goal of testing isn’t more tests. It’s confidence with minimal upkeep cost.

TDD — test-driven development

A specific discipline: write the test FIRST, then write the code to make it pass.

1. Write a failing test
2. Write the minimum code to pass it
3. Refactor freely (the test catches regressions)
4. Repeat

This forces you to think about the interface before the implementation. It produces highly testable code by default. It’s also slower in the moment.

For solo prototyping work, full TDD is often overkill. For pure utility functions, bug fixes, and any logic with branching, it’s genuinely useful. The middle ground that most modern teams adopt: “tests-along-with-code” — write code and tests in the same session, but don’t religiously test-first every line.

Tests as documentation

A well-named test reads like a spec:

describe("authentication", () => {
  it("rejects empty passwords", () => { ... });
  it("rejects passwords shorter than 8 characters", () => { ... });
  it("rejects passwords without a digit", () => { ... });
  it("accepts a valid password and returns a session token", () => { ... });
  it("expires session tokens after 30 days", () => { ... });
});

Reading this list tells you exactly what the authentication system does, in plain language, with executable verification attached. This is often the BEST documentation a codebase has — because docs go out of date silently, but tests fail loudly when they don’t match the code.

For this reason, prefer test names that describe BEHAVIOR (“rejects empty passwords”) over names that describe MECHANISM (“calls password validator”).

What testing CAN’T do

A few things tests don’t catch:

Misunderstood requirements. A test verifies “what we built.” If “what we built” is the wrong thing, the tests are green and the product is wrong.
Visual / UX issues. A button overlapping a banner doesn’t break any test. Visual regression tools help but never fully.
Performance regressions. A function that’s 100× slower still returns the right value. Tests pass; the app is unusable.
Race conditions you didn’t anticipate. Most tests are deterministic. Real systems are not.
Security vulnerabilities. A SQL injection that bypasses your validation works in production; your tests don’t notice.

This is why testing is ONE quality layer among many: types, linting, manual review, code review, observability, security scans, production monitoring. No single layer catches everything.

The Bible Quest stack and testing

For a Next.js + Supabase + Vercel project like Bible Quest:

Unit tests (Vitest) — pure functions, utility logic, formatters, validators
Component tests (Vitest + React Testing Library) — React components in isolation
Integration tests — server actions, API routes against a test database
E2E tests (Playwright) — critical paths in a real browser against a preview deploy
Type checking (tsc --noEmit) — catches a vast swath of bugs without writing any tests
Lint (ESLint) — style + common mistakes
CI runs all of the above on every PR

For a solo side project, “type checking + lint + a handful of unit tests for the trickiest logic + 1-2 Playwright E2E tests for the critical paths” is a high-value baseline that doesn’t drown you in test maintenance.

Common gotchas

Flaky tests undermine the whole system. A test that fails 5% of the time will fail every 20 CI runs. People learn to “retry until green,” and real failures get ignored. Fix or quarantine flaky tests aggressively.
Tests that rely on real time are brittle. A test that uses new Date() will eventually fail when the date matters. Use fake timers (Vitest’s vi.useFakeTimers()).
Tests that rely on network access are slow and flaky. Mock external APIs (MSW), or run integration tests in a CI environment with stable services.
Tests that share state between runs are footguns. A test that creates a user with email test@example.com and forgets to clean up will break the next run that tries to create the same email. Reset state between tests; use unique fixtures.
Test data drifts from production data. Your tests use a perfectly-shaped fixture; production has nulls, weird characters, missing fields. Cover edge cases your fixture doesn’t represent.
Mocks lie. A mock that returns { user: { id: 1, name: "test" } } doesn’t reflect what the real database returns. Integration tests against real services catch what mocks miss.
“Coverage” can be a vanity metric. 90% coverage on getters and setters; 30% coverage on the critical pricing logic. Coverage is a starting point, not a measure of safety.
Tests slow down PR cycles. A test suite that takes 15 minutes makes everyone slower. Parallelize, split, skip non-critical tests on rapid iteration branches, run full suite on merge to main.
Tests that need a particular env get ignored. A test that only works on a Mac, or only when a specific tool is installed, gets skipped by everyone else. Make tests platform-neutral or document the setup.
Snapshot tests can silently rot. A 500-line snapshot becomes “yeah just update it” every time it fails. Use snapshots only for small, meaningful outputs you’d actually review.
The “Arrange-Act-Assert” pattern is a real win. Structure each test as three sections: set up state (Arrange), do the thing (Act), check the result (Assert). Tests that don’t follow this often have hidden setup or unclear assertions.
Don’t test framework internals. expect(component).toBeInstanceOf(React.Component) is testing React, not your code. Test what YOUR code does.
Tests can catch the wrong bug. A test that asserts result === 42 may pass even when the underlying logic is wrong (the wrong calculation happens to produce 42). Verify with multiple inputs.
Refactor tests too. When the code under test changes, the tests may need to adapt. But aim for tests that survive refactors — focus on behavior, not implementation.
Production != test environment. Your test database may have different indexes, different settings, different versions than production. A test passing isn’t a guarantee the same code works in prod.
AI-generated tests need scrutiny. A test that AI writes can have hardcoded expected values that are wrong, or that mirror the bug in the code. Review carefully.
Don’t measure success by test count. “We added 200 tests this sprint” is meaningless. “We have confidence that our critical paths still work” is the goal.
Test the boundary cases. Off-by-one errors, empty input, single-element input, very large input, unicode characters, null/undefined. These are where bugs hide.
Test the error paths. Most bugs aren’t in the happy path — they’re in error handling. Write tests for what happens when things go wrong.
Tests are a leading indicator of code quality. Hard-to-test code is often poorly-designed code. If a function takes 10 setup steps to test, it probably does too much.
Tests are not a substitute for thinking. A green test suite isn’t proof of correctness. It’s evidence of “the cases we thought to test work.” There may be cases you didn’t think of.

Sources

Kent C. Dodds — The Testing Trophy — the “trophy” alternative to the pyramid
Martin Fowler — TestPyramid — the foundational essay
Martin Fowler — Mocks Aren’t Stubs — the canonical mocks discussion
Kent Beck — Test-Driven Development by Example — the original TDD book
Google Testing Blog — practical patterns from a large engineering org

Tech & AI, Explained

Explorer

why-test