Opinion7 min read

The trust arc: why AI tools need to show their work before showing a diagram

TL;DR

AI tools that surface a diagram or plan before showing their reasoning get abandoned faster than tools that make the thinking visible first — because users can't verify a conclusion they didn't watch form.

Key takeaways

Show reasoning before showing output: users who see the model's logic before the result have significantly higher output acceptance rates than those who only see the final artefact.
Design for the trust arc explicitly — scepticism → engagement → verification → confidence — rather than optimising for time-to-first-output as a proxy for good UX.
Pushback handling is a first-class UX feature: when a user says 'that's wrong', the response pattern (acknowledge, explain, offer alternatives) determines whether they abandon the tool or adjust their mental model.
Trade-off surfacing — showing two options and naming what each one sacrifices — builds more durable trust than presenting a single confident answer, even when users initially prefer the confident version.
The diagram (or plan, or recommendation) is the proof that reasoning happened. Ship it last in the interaction sequence, not first.

The diagram is not the product

Most AI design tools are built around a simple interaction loop: user provides input, AI produces visual output, user edits or accepts. The diagram appears in under three seconds. The reasoning that produced it is invisible.

This is the wrong loop. Not because fast output is bad, but because the diagram appearing before any reasoning is visible trains users to treat the AI as a vending machine — input goes in, output comes out, and the machine either works or it doesn't. When it doesn't, users don't debug. They abandon.

Consider what consistent session analysis of itinerary planning tools reveals. The tools that get abandoned fastest aren't the ones that produce bad outputs. They are the ones that produce outputs the user can't interrogate. The diagram arrives. The user stares at it. There is no handle.

The tools with higher continuation rates — where users actually revised and refined rather than quit — had one thing in common: by the time the output appeared, the user had already been part of forming it. The reasoning was visible. The diagram was the confirmation of a conclusion they'd already provisionally accepted.

What the trust arc actually looks like in sessions

Session recordings from AI tools with visible reasoning show a consistent four-stage pattern:

Scepticism. The first 20-40 seconds of a session, regardless of onboarding quality. The user is waiting to be disappointed. They've been disappointed before. Pilot projects at three different companies I've worked with showed >60% of first-session users expecting to find a fundamental flaw within the first interaction.

Engagement. Something in the reasoning display captures attention. Not the output — the reasoning. A statement like "I'm prioritising morning activities before 11am because you mentioned fatigue" or "I'm avoiding the harbour-front options because they have a 90-minute average queue time in March" signals that the model understood something specific. Engagement is when the user starts leaning forward.

Verification. The user tests one of the model's claims. Did it really account for that constraint? Is the timing realistic? This is not adversarial — it's healthy. Users who verify are users who are deciding whether to trust, not users who've already decided to leave.

Confidence. The user has tested a claim, found it was correct (or found that the model responded reasonably when challenged), and can now proceed. This is when the diagram becomes useful. Before this stage, a diagram is just noise to verify against.

Voya Genie's itinerary builder sequences this deliberately. Before surfacing the day plan, it shows a brief "here's what I understood about your trip" summary — constraints, preferences, stated priorities — and offers a one-click correction mechanism. The itinerary only renders after the user has either confirmed or adjusted those inputs. Session completion rate is roughly 3x higher than tools that skip this step.

The Trust Arc

Scepticism0–40 seconds

specific signal

EngagementReasoning visible

curiosity

VerificationUser tests a claim

claim confirmed

ConfidenceOutput accepted

Designing the trust arc: three concrete patterns

The trust arc isn't a feeling. It's an interaction sequence you can design for. Three patterns that actually move users through it:

Pattern 1: Reasoning-before-output sequencing

Don't show the final artefact until you've shown the 2-3 key decisions that produced it. This doesn't mean a wall of text — a structured summary works: "Based on [X], I've prioritised [Y] over [Z] because [reason]. Here's what that produces."

The delay is 1-2 seconds of additional rendering time. The benefit is that users arrive at the output with a formed opinion about whether the reasoning is sensible. Acceptance rates in A/B tests I've run on this pattern have been consistently 25-40% higher than the output-first variant. The exact number varies by domain; the direction is consistent.

Pattern 2: Trade-off surfacing

Single confident answers feel efficient but erode trust faster than acknowledged uncertainty. When the model has made a genuine trade-off — faster vs. more thorough, cheaper vs. higher quality, conventional vs. unusual — name it explicitly and offer both.

"Option A is more direct but skips the coastal route you mentioned enjoying. Option B takes 40 minutes longer but includes it. Which fits your energy today?" This feels more work for the user. It is, fractionally. But it also demonstrates that the model understood the constraint well enough to recognise it as a real trade-off rather than a clear optimisation. That demonstration is worth more to trust formation than the time saved.

Pattern 3: Structured pushback handling

Every AI tool will produce something a user considers wrong. The interaction at that moment is the highest-stakes UX event in the entire session. Most tools handle it poorly — either the model capitulates immediately ("You're right, let me redo that entirely") or it defends rigidly.

The pattern that works: acknowledge, explain, offer alternatives. "That's a fair point — I weighted early start time more heavily than travel comfort because you mentioned the 10am opening time as a constraint. If travel comfort is the higher priority, I can shift the morning slot to 10:30 and add a closer option. Do you want to try that?" This treats the user as a collaborator in resolving genuine ambiguity rather than a judge who has delivered a verdict.

Output-First vs Reasoning-First

Criteria	Output-first	Reasoning-first
User acceptance rate	Low	High
Abandonment rate	High	Low
Revision depth	Shallow	Deep
Trust after error	Weak	Strong

The case for speed — and where it fails

The counterargument is real: users say they want fast outputs. Latency is consistently cited as a top complaint in AI tool surveys. OpenAI's own research on ChatGPT usage shows that users who get responses in under 2 seconds have higher session continuation rates than those who wait 5+ seconds. Speed matters.

But these findings apply to chat interfaces where the user is asking questions and expecting text. They transfer poorly to generative design tools where the output is a complex artefact — a diagram, an itinerary, a plan, a layout — that requires evaluation rather than reading.

The failure mode of speed-optimised generative tools is specific: users accept the output without engaging with it, implement it, encounter a problem the model would have flagged if asked, and conclude the AI "doesn't work." The tool got a 1.8 second time-to-output and a churn event at week three.

The Figma AI features shipped in 2024 learned this lesson the hard way. The initial "generate UI from description" flow was fast and produced plausible-looking outputs. Usage dropped off sharply after first session. The revised version, which shows style decisions and layout rationale before rendering, has better retention even with a longer generation time. Speed is a feature. It is not the feature.

There's also a second failure mode worth naming: users who accept AI outputs too quickly build no mental model of how the tool makes decisions. They can't prompt it better over time. They can't catch systematic errors. They become dependent on output quality they can't evaluate. That's a fragile user, not a successful one.

Ship the reasoning. The diagram can wait 1.5 seconds.

The tools that will have durable user relationships aren't the fastest generators. They're the ones that make users feel like participants in the reasoning rather than recipients of output.

This is a design decision, not a model capability question. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro — any capable model can produce the same output slightly faster or slightly slower. The differentiation is in what you show between "input received" and "output rendered."

The concrete change is small: add a 1-2 sentence reasoning summary before the output renders, with the 1-2 key decisions the model made and why. Add a correction mechanism with specific, labelled options. When the user pushes back, acknowledge, explain, and offer alternatives rather than immediately complying or defending.

None of this requires a new model. It requires treating the reasoning as part of the product rather than an implementation detail hidden in the system prompt.

The diagram is the proof that good thinking happened. Show the thinking first.

product trust architecture

Product, measurement, and decision quality