Opinion6 min read

Hypothesis theatre: when teams "test" things without a falsifiable claim

TL;DR

A hypothesis isn’t a formality — it’s a commitment device. Without one written down before you look at the data, every team will find a narrative that fits the result they already wanted.

Key takeaways

A pre-registered hypothesis is a commitment device, not a scientific formality. It forces you to state what you expect before the data arrives, which is the only thing that prevents you from retrofitting the conclusion afterwards.
High-ego teams are especially vulnerable. When nobody wants to be wrong, the absence of a written prediction means the goalposts move silently — and everyone remembers having predicted whatever actually happened.
Write your hypothesis in the form “If X, then Y will change by Z, because [mechanism]” before touching any tooling — if you can’t fill in the mechanism, you don’t have a hypothesis yet.
Underpowered tests compound the problem — sample sizes too small to detect real effects give teams a scientific-looking excuse to confirm whatever they already believed.
Exploratory testing is legitimate, but label it honestly. Exploration generates hypotheses; it doesn’t confirm them. The danger is when teams treat exploratory results as confirmatory evidence.

The hypothesis is the commitment, not the formality

Most product teams that call themselves data-driven are not running experiments. They’re running tests — and those are different things.

A test produces a number. An experiment produces an inference. The difference is whether you had a falsifiable claim written down before you looked at the data.

Without that written claim, something predictable happens: the conclusion bends to fit whatever the team already wanted. A metric goes up? “We expected that.” It goes down? “Well, that’s not the metric we were really optimising for.” It stays flat? “We need more time.” Every outcome confirms the prior belief, because there was never a stated prediction that could be contradicted.

A pre-registered hypothesis is not a scientific formality. It is a commitment device. It forces a team to say, on the record, “we believe X will happen, and if it doesn’t, we were wrong.” That sentence is the entire point. Everything else — the flags, the dashboards, the p-values — is infrastructure in service of that commitment.

With vs Without Pre-Commitment

Criteria	Pre-Committed Hypothesis	Hypothesis Theatre
Prediction recorded	Yes	No
Goalposts fixed	Yes	No
Null result meaningful	Yes	No
Motivated reasoning blocked	Yes	No
Learning produced	Yes	No

High-ego teams retrofit conclusions without noticing

The commitment problem gets worse in proportion to the seniority and confidence in the room. In teams where people have strong opinions and reputations attached to their ideas, the absence of a written hypothesis creates a silent rewriting of history.

Here is the pattern: a senior person advocates for a change. The team ships it behind a flag. The results come back ambiguous or negative. Without a pre-registered prediction, nobody can point to a document and say “we expected X and got Y.” Instead, the narrative shifts. The metrics that moved become the ones that matter. The ones that didn’t get reframed as lagging indicators. Everyone remembers having predicted roughly what happened.

This is not dishonesty. It is ordinary motivated reasoning, and it happens in every team where status is tied to being right. The only reliable defence is a written prediction made before the data arrives — not because writing things down is virtuous, but because it removes the room to manoeuvre afterwards.

The uncomfortable corollary: if your team resists pre-registering hypotheses, that resistance is itself evidence of the problem. A team confident in its reasoning should welcome the chance to be proven right on the record. A team that prefers to keep predictions vague is preserving the option to claim any outcome as a win.

What a falsifiable hypothesis actually requires

A falsifiable hypothesis has three parts: a predicted direction, a predicted magnitude (or at least a minimum detectable effect you care about), and a stated mechanism.

Direction: “This change will increase checkout completion rate” — not “this change might affect checkout.” If both an increase and a decrease would lead to the same decision (“keep testing”), you don’t have a hypothesis.

Magnitude: Before running the test, you need to answer: what effect size would change my decision? If a 0.1% lift is indistinguishable from noise at your traffic volume and wouldn’t change your roadmap anyway, your experiment cannot produce a meaningful result regardless of outcome. Tools like Evan Miller’s sample size calculator or the statsmodels.stats.power module in Python let you compute this before writing a single line of flag logic.

Mechanism: This is the part almost everyone skips. If you believe a change will increase checkout completion, why? Is it because the new copy reduces cognitive load at decision point? Because removing the promo code field eliminates a known drop-off? The mechanism matters because it determines what else you should observe — and whether a confirming result is actually evidence of your belief or an artefact.

Without a mechanism, you cannot replicate results, cannot transfer learning to related decisions, and cannot distinguish “this worked because our theory is correct” from “this worked by coincidence this quarter.”

How underpowered tests launder bad decisions

Once you understand hypothesis theatre as a commitment problem, the statistical failures become easier to see as symptoms rather than root causes.

The most damaging form is the underpowered test, because it gives motivated reasoning a scientific costume. A test with 60% power — common in teams that split traffic 50/50 for a week without calculating sample size first — will miss real effects 40% of the time. When the test comes back flat, the team records “no signal” and moves on. But “no signal” at n=800 is not the same claim as “no signal” at n=80,000. One is evidence of absence. The other is absence of evidence.

This creates a laundering mechanism: a team runs an experiment that couldn’t have detected anything short of a 15% effect, finds nothing, and uses that result to justify inaction or continued investment — whichever was politically preferred before the test started. The experiment absorbed the discomfort of the decision without producing any information.

The inverse is also real. Underpowered tests that happen to hit significance are almost certainly inflated — this is the Winner’s Curse in A/B testing. A 5% lift detected in a low-powered test is much more likely to be a 1–2% effect with noise than a genuine 5% effect. Teams that track these results as wins and project them forward build roadmaps on phantom data.

Bayesian approaches (as implemented in tools like VWO’s Bayesian engine or the pymc library) partially address this by reporting posterior distributions rather than binary significance — but they don’t fix the upstream problem. If the team never committed to a prediction, no amount of statistical sophistication prevents the conclusion from being retrofitted after the fact.

The Theatre Loop

No Written Prediction

Underpowered Test

Ambiguous Result

Narrative Retrofitted

Original Decision Confirmed

Experiment CitedAs evidence

The honest case for exploratory testing

There is a legitimate version of running tests without a sharp hypothesis: exploratory analysis on a new surface where you genuinely don’t know what to measure. If you’ve just shipped a feature no one has built before, or you’re in a new market with no comparable data, pre-specifying a hypothesis can be epistemic overconfidence.

This is fine. The error isn’t exploration — it’s calling exploration an experiment. Exploratory tests should be labelled as such, expected to generate hypotheses rather than confirm them, and held to a different evidentiary standard. If your team treats exploratory results as confirmatory, you’ve done the same thing: dressed up a less rigorous process in the language of the more rigorous one.

The practical rule: if you would change your decision based on a flat result, it’s an experiment. If you wouldn’t, it’s a measurement. Label it accordingly and don’t pretend the measurement validated a hypothesis it couldn’t have tested.

The fix is not more process

Adding pre-registration templates, experiment review boards, or mandatory hypothesis fields to your tracking tool will not fix this if the underlying incentive is to use experimentation as political cover rather than as a learning mechanism.

The fix is simpler and more uncomfortable: commit in writing before you run the test, and make null results count. If a well-powered, well-specified test comes back flat, that is a strong result. Record the inference (“we now have evidence that X does not materially affect Y at our current scale”), update your model of the product, and let it influence future decisions.

Teams that resist writing predictions down, or that treat flat results as wasted sprints, are revealing that they were never running experiments. They were running validation ceremonies for decisions already made.

Run fewer tests. Commit to predictions before you start. And when one comes back flat, say what you learned.

measurement evals product

Product, measurement, and decision quality