Opinion6 min read

Instrumentation before iteration: why shipping faster can make you slower if you can’t observe outcomes

TL;DR

Shipping without instrumentation creates the illusion of velocity — you accumulate deploys but not decisions, and every iteration that can’t be evaluated is a coin flip that delays genuine learning by days or weeks.

Key takeaways

Before shipping any feature that changes a model-driven behaviour, define the metric that would tell you it worked — not after, before. If you cannot name the metric, you are not ready to ship.
Log model inputs, retrieved context, and tool call arguments at the trace level, not just the output. Output-only logging makes root cause analysis nearly impossible when behaviour degrades.
Set a measurable rollback threshold before deploying prompt or model changes. “We revert if task completion rate drops below 78% in the first 200 requests” beats “we’ll monitor it” every time.
An eval harness with 50 representative examples run on every deploy catches regressions in under two minutes. It does not need to be comprehensive — it needs to be fast and consistent.
Speed compounds when you can learn from each deploy. Speed without observability just accumulates debt — you eventually have to stop, instrument everything, and re-run the experiments you already thought you ran.

The deploy count that means nothing

There is a version of fast shipping that feels productive and accomplishes almost nothing.

You push a prompt change. Users continue using the product. Nothing obviously breaks. You push another change. Same result. After two weeks, you have 14 deploys and no idea whether any of them improved anything. The iteration count is real. The learning is not.

I ran this pattern for about six weeks when I first started building production LLM features. I was moving quickly by every surface measure — commits per day, deploys per week, tickets closed. What I was actually doing was generating a backlog of unanswered questions that I’d have to pay back with interest once something visibly broke.

The inflection point came when a client asked a simple question: “Is the summarisation better than it was last month?” I had 30+ commits touching the summarisation pipeline. I had no answer. Not because the answer was ambiguous — because I had no data.

Why LLM products make this worse than regular software

Traditional software has observable failure modes. An API returns 500. A function throws an exception. A page renders blank. The feedback loop between a bad deploy and knowing about it can be minutes.

LLM-driven behaviour fails softly and ambiguously. A retrieval change that reduces answer quality by 15% produces no errors, no alerts, no red dashboards. Users get slightly worse answers. Some of them stop asking follow-up questions. Engagement dips over two weeks. By the time you notice, you have six commits that could be the culprit.

This is not a hypothetical degradation pattern. It is the standard one. Teams shipping LLM features without structured evals routinely spend far longer on root cause analysis when behaviour degrades compared to teams running even minimal eval suites. The teams without evals were not slower at shipping — they were slower at learning, which is the only thing that actually compounds.

The softness of failure is why you cannot borrow the “ship and observe” heuristic directly from web product development. Web metrics (clickthrough, conversion, bounce) reflect user actions. LLM quality metrics require deliberate instrumentation — they don’t surface in your analytics by accident.

What instrumentation actually requires before you ship

The minimum viable instrumentation for an LLM feature is not a dashboard. It is three things:

A defined success metric for this specific change. Not “the model should do better” — “task completion rate on the 50-item eval set should stay above 82%.” Before you merge a prompt change, you need a number. The number does not need to be precise. It needs to exist.

Trace-level logging. Your logs need to capture what went into the model (the full prompt including retrieved context and tool results), not just what came out. Output-only logging is nearly useless for debugging. When a user reports a bad response, the question is almost always: “what context was the model working with?” If you cannot answer that, you cannot diagnose the problem.

Practically, this means using a tracing library — LangSmith for LangChain pipelines, Arize Phoenix for broader model monitoring, Weights & Biases for more ML-oriented teams, or a simple structured JSON logger if your pipeline is custom. Any of these work. No logging does not work.

A rollback condition, stated before deploy. “We will revert if X.” X is a specific number measured in the first N requests. This forces you to define what bad looks like before you have confirmation bias about your own change. It also means your on-call rotation has a clear decision rule at 2am.

Ship with Instrumentation

Define Metric

Instrument Traces

Set RollbackThreshold defined

Deploy

Evaluate

DecideKeep / revert / iterate

The objection: instrumentation slows you down early

The real pushback I hear is: “this adds overhead at exactly the moment when I need to move fast to validate the concept.”

This is partially correct and mostly wrong.

For a zero-to-one prototype — two days to check if an idea is worth pursuing at all — full instrumentation is overkill. You are not trying to measure quality; you are trying to see if the thing can exist. That phase is real, and it is short.

The error is extending that logic to the next phase. Once you have confirmed the concept and are now iterating toward a usable product, the cost of instrumentation drops steeply and the cost of not instrumenting accumulates. A 50-example eval harness using pytest and the OpenAI or Anthropic API takes roughly four hours to build for a new feature. It then runs in under two minutes per deploy. That investment pays back within the first three deploys that would otherwise have produced ambiguous results.

The teams I have seen move fastest over a 3-6 month arc are consistently the ones who spent days 3-5 of a new feature building the evaluation harness. Not because it felt fast in week one — it did not — but because by month two they were making clean, evidence-backed decisions rather than arguing about impressions.

Velocity is a measurement of decisions made, not deploys shipped

The deploy count is not the metric. The number of validated decisions per week is the metric.

A team shipping 10 deploys a week with no evaluation infrastructure is accumulating questions. A team shipping 4 deploys a week with a consistent eval harness, trace logging, and defined rollback conditions is compounding knowledge. After three months, the second team is moving faster by any measure that matters — feature stability, debugging time, confidence in changes.

The practical change is small. Before you start on any prompt change, retrieval tuning, or model swap: write down the metric that will tell you it worked. Set up the logging so you can compute that metric. Define the rollback condition. Then ship.

You are not adding process. You are converting deploys into experiments, and experiments into decisions. That is what velocity actually means.

Deploy-First vs Instrument-First

Criteria	Deploy-First	Instrument-First
Regression detection	Slow	Fast
Root cause speed	Slow	Fast
Confidence in changes	Weak	Strong
3-month learning rate	Weak	Strong

measurement production workflow

Product, measurement, and decision quality