Blog

Coming Soon

Tutorials, essays, and the longer-form thinking behind the work — drawn from a year of building full-time with AI, six months on Claude Max 20x, maxed out every week (~200M+ tokens).

These are my own perspectives from hands-on building. I've used AI at scale to help draft and organise them.

98 topics across 13 categories

Product, measurement, and decision quality

The "missing traffic" trap: why most conversion fixes fail when the real issue is distribution

Most teams optimise the funnel they can see. But if nobody’s reaching your signup page, conversion rate improvements are solving the wrong problem entirely.

Metric overload as a product bug: dashboards that look "rigorous" but reduce decision accuracy

When your dashboard has 47 metrics, nobody makes faster decisions — they make fewer. Rigour without focus is just noise with better formatting.

Build–Measure–Learn done properly: what "measure" actually means (and what it doesn’t)

Everyone cites the lean loop, but most teams skip the hard part: defining what “measure” means before they build anything.

Hypothesis theatre: when teams "test" things without a falsifiable claim

Running A/B tests without a falsifiable hypothesis isn’t experimentation — it’s a performance of rigour that produces no learning.

Instrumentation before iteration: why shipping faster can make you slower if you can’t observe outcomes

You can’t ship faster if you can’t observe what happened. Velocity without observability is just moving faster in the dark.

Leading vs lagging indicators in SaaS: what to track when revenue is too slow to teach you anything

Revenue tells you what happened months ago. If that’s your primary signal, you’re steering by the wake, not the road ahead.

User feedback as a noisy sensor: how to treat anecdotes like data, without ignoring them

Anecdotes aren’t useless, but they’re not data either. The skill is treating them as signals without over-weighting any single one.

The hidden cost of "one more metric": cognitive load and analysis paralysis as measurable waste

Every metric you add competes for attention. At some point, the dashboard itself becomes the bottleneck to good decisions.

Designing reports that force decisions: how to structure outputs so "next action" is unavoidable

A report that doesn’t make “what do we do next?” obvious has failed at its only job — regardless of how well it visualises the data.

When customers ask for features they already have: UX discoverability as the real problem

Before you build it, check whether the real problem is that nobody can find the feature you already shipped.

LLMs, prompting, and getting useful output

The "Illuminate → Solve → Refine" pattern: a minimalist loop that beats most fancy prompting

Forget elaborate prompting frameworks. This three-step loop handles 90% of real tasks better than any chain-of-thought template.

Why LLMs mis-handle time: what that breaks in planning, estimation, and scheduling

Models treat “tomorrow” and “next quarter” with roughly equal weight. Understanding why helps you work around it in planning tasks.

Prompt chains as product UX: the difference between "chatting" and "operational workflows"

The gap between chatting with an AI and building an operational workflow is prompt chaining — and most products haven’t made that leap yet.

Self-optimising prompts: letting the model critique and rewrite its own system instructions (and the failure modes)

Having the model critique and rewrite its own instructions sounds clever until you hit the failure modes. Here’s what actually works.

Caching answers without lying: how to precompute response options while staying honest and flexible

Precomputing responses saves tokens and latency, but only if you don’t pretend the cached answer was freshly reasoned. The honesty problem is real.

The token economy: practical ways to reduce cost without degrading output quality

Practical techniques for cutting LLM costs — from prompt compression to model routing — without making the output noticeably worse.

Avoiding abstraction drift: keeping an LLM anchored to concrete artefacts, not vibes

Left unchecked, LLMs drift toward vague, abstract language. Keeping outputs anchored to concrete artefacts requires deliberate prompting.

When the model is confidently wrong: designing "trust brakes" instead of hoping users notice

The most dangerous failure mode isn’t hallucination — it’s hallucination delivered with authority. Designing systems that catch this before users do.

Perplexity as a trust signal: when uncertainty estimates help — and when they mislead

Model uncertainty can estimate how reliable an output is, but only in specific contexts. Here’s when it helps and when it actively misleads.

LLMs as editors vs authors: why "rewrite" workflows often outperform "generate from scratch"

Asking a model to improve existing text consistently outperforms generating from scratch. The implications for product design are significant.

Agentic systems, tools, and engineering patterns

Single-agent vs multi-agent architectures: when "more agents" is actually worse

Adding agents feels like adding capability. In practice, coordination overhead often makes multi-agent systems slower and less reliable.

Separation of concerns for agent apps: brain vs voice vs UI vs tools (why it matters)

Brain, voice, UI, and tools are different responsibilities. Mixing them in one component is the fastest path to unmaintainable agent code.

Ephemeral storage in AI apps: the sharp edges nobody notices until production

Everything works until your agent’s memory disappears mid-task. The storage problems nobody mentions until you’re debugging in production.

PoC MVP vs scalable MVP: what to deliberately not build early (and why founders struggle with this)

Founders consistently over-build the first version. What to deliberately leave out — and why the instinct to “do it properly” is often wrong early on.

Why LangChain-style wrappers can slow you down: when direct API integration is cleaner and safer

Abstraction layers promise speed. When the abstraction doesn’t match your use case, you spend more time fighting the framework than the problem.

“Computer control” as an enterprise workaround: when APIs don’t exist but outcomes still matter

When there’s no API, there’s often a screen. Browser and desktop automation as a legitimate integration pattern, not a hack.

Trustworthy agent design: guardrails, audit logs, and "why this action happened"

“It just does stuff” isn’t an architecture. Guardrails, audit logs, and explainability are what make agents deployable, not optional extras.

Reusable foundations for startups: auth, billing, AI layer, observability — the minimum kit

The minimum infrastructure kit that prevents you from rebuilding the same plumbing for every new project.

Prompt-auditing as QA: treating prompt outputs like software artefacts with tests

Treating prompt changes like code changes — with version control, tests, and rollback — instead of hoping the new wording works.

Split-brain agents — separate thinking from doing: why one model trying to plan and execute increases hallucinated actions

Two passes — planner outputs a task graph and acceptance tests, executor only performs mapped tasks — is more reliable than one model doing both.

Multi-model committees (the verifier pattern): the biggest gain isn’t better ideas — it’s catching failures systematically

Use Model A to generate, Model B to audit against a rubric: security, correctness, completeness, edge cases. Verification beats “smartest model.”

Growth, traffic loops, and go-to-market

Why most "growth advice" is second-tier: anything that assumes existing traffic is not a startup plan

Any strategy that assumes you already have traffic is useless at zero. Most growth content is written for companies past the hardest stage.

Traffic loops that work from zero: directories, integrations, marketplaces, partner listings

The unglamorous channels that actually generate first users. No audience required, no ad budget needed.

An agent that generates a traffic plan from your website: what it should output to be genuinely executable

What an AI-generated distribution plan should actually contain to be executable — not just a list of “try social media.”

Conversion is not growth: why being good at funnels doesn’t solve acquisition

A perfect funnel with no traffic is a perfect funnel with no customers. Acquisition and conversion are fundamentally different problems.

Pricing early-stage B2B AI: what customers actually pay for (hint: risk reduction)

Customers don’t pay for features; they pay for risk reduction. Pricing frameworks that reflect what buyers actually value.

Consulting → e-learning → product: a staged path to revenue that doesn’t require VC fantasy

A staged revenue path that funds product development without requiring venture capital or betting everything on launch day.

The "vertical or die" constraint: why generalist AI tools lose — even if they’re better engineered

Horizontal AI tools compete on capability. Vertical tools compete on understanding. At early stage, the latter wins almost every time.

Building trust artefacts for AI products: security posture, reliability proof, and sample outputs

Security pages, reliability metrics, and sample outputs. The tangible proof that makes enterprise buyers comfortable saying yes.

Trust, privacy, and high-stakes AI usage

Privacy boundaries in the workplace: how medical info leaks happen (and how to reduce the risk)

How medical and personal information leaks into AI workflows, and practical patterns for reducing that exposure.

Designing AI for compliance-first clients: what "reliability" and "security" mean in practice

“Reliable” and “secure” mean specific things to regulated industries. What you need to demonstrate, not just claim.

When the process harms the person: systems that are "procedurally correct" but psychologically destructive

Some systems follow every rule perfectly while making people’s lives worse. The gap between procedural correctness and human outcomes.

AI in legal/process-heavy domains: where automation is genuinely feasible vs where nuance dominates

Where automation genuinely works in high-nuance environments, and where the complexity makes it actively dangerous.

The ethics of human-like scraping: capability vs consent vs platform rules (and the business risk)

Just because an AI can crawl and extract like a human doesn’t mean consent, platform rules, and business risk disappear.

How LLMs actually work

Transformers and attention in plain language: what you actually need to know vs what you can safely ignore

What builders actually need to understand about the architecture, and what’s safely ignorable academic detail.

Tokenisation matters more than you think: why your costs, context limits, and edge cases all trace back to BPE

Your costs, context limits, and weird edge cases all trace back to how text gets split into tokens. Understanding BPE changes how you build.

Pre-training → fine-tuning → RLHF: the three-stage pipeline and what each stage actually contributes

The pipeline that turns raw text prediction into a useful assistant. What each stage contributes and where things go wrong.

Embeddings are the real foundation: why search, RAG, clustering, and recommendations all start here

Search, RAG, clustering, and recommendations all start with vector representations. If you don’t understand embeddings, everything downstream is a black box.

Context windows explained honestly: what 4K vs 128K vs 1M tokens actually means for your app (and what it doesn’t)

128K tokens sounds huge until you realise how fast files, conversation history, and system prompts eat it. What the numbers actually mean.

Scaling laws and why "just make it bigger" stopped being the only strategy

The era of pure parameter scaling is over. What’s replacing it — test-time compute, synthetic data, architecture changes — and why builders should care.

Synthetic data and the data wall: how labs are breaking through now that the internet has been read

The internet has been read. How labs are generating training data now, and what that means for model quality and availability.

RAG, search, and grounding AI in reality

RAG end to end: chunking, embedding, retrieval, reranking, generation — and where each step actually fails

The dominant pattern for grounding LLMs in your own data. Where each step actually fails and what that means for your pipeline’s reliability.

Vector databases compared honestly: Pinecone vs Weaviate vs Qdrant vs Chroma vs pgvector

When you genuinely need a dedicated vector DB and when PostgreSQL with pgvector is perfectly fine.

Hallucination is a retrieval problem: why grounding beats guardrails for reducing confabulation

Guardrails and disclaimers don’t fix confabulation. Grounding the model in retrieved facts does. Retrieval quality is the real lever.

AI search vs traditional search: why "10 blue links" is dying and what replaces it

The link-based search paradigm is dying. Answer engines that synthesise and cite are replacing it, and that changes how people discover information.

Citations as a trust mechanism: what Perplexity got right about making AI answers verifiable

Every claim linked to a source. This pattern should be standard, not novel. Why citation-first design builds user trust faster than disclaimers.

Fine-tuning vs RAG vs prompt engineering: the actual decision framework for when to use which

When to invest in each approach based on cost, accuracy, latency, and maintenance burden. The decision framework most tutorials skip.

Context engineering — the evolution beyond prompt engineering

It’s less about finding the right words and more about curating the right context configuration for each task. The shift from prompting to engineering.

Evals, benchmarks, and quality assurance

Why vibes-based evaluation will kill your product: how to build systematic evals that catch regressions

“It seems fine” isn’t a quality bar. How to build eval suites that catch regressions before your users do.

AI benchmarks are a game — and the industry is cheating: how to read MMLU, Arena, and LiveBench critically

MMLU is saturated, Arena scores get gamed, and providers cherry-pick results. How to read benchmark claims without being misled.

Testing non-deterministic outputs: snapshot testing, eval suites, and drift detection for LLM features

When the output is different every time, traditional testing breaks. Snapshot testing, eval suites, and drift detection as alternatives.

The eval gap nobody talks about: why your model works in the playground but fails in production

The gap between playground and production is usually context, not capability. How deployment conditions change model behaviour.

Measuring faithfulness in chain-of-thought: when the model’s reasoning doesn’t match its answer

Research shows visible reasoning doesn’t always match the actual decision process. What this means for trust and AI oversight.

Release volatility and trust: how to defend your workflow from silent model regressions

When models silently change, workflows break. A 10-prompt benchmark suite and versioned system prompts are your defence.

The model landscape

Open-weight vs closed API: Llama, Mistral, DeepSeek and Qwen are closing the gap — the real tradeoffs

The capability gap is closing fast. The real tradeoffs are cost, control, privacy, and operational burden.

Reasoning models changed everything: o1, o3, DeepSeek-R1, and why test-time compute is the new frontier

The frontier shifted from parameter scaling to test-time compute. What reasoning models mean for how you build and what becomes possible.

Multimodal AI is now table stakes: vision, audio, and cross-modal generation aren’t future features

Vision, audio, and cross-modal generation are baseline expectations for any frontier model. What this unlocks for product builders.

Small language models and edge AI: why the next leap may come from models getting smaller, not bigger

On-device, sub-7B, quantised models that run locally. What they enable and when they’re good enough to replace API calls.

Choosing the right model for your use case: a practical map of latency, cost, capability, and privacy tradeoffs

GPT-4o vs Claude vs Gemini vs open-weight — mapped to real scenarios instead of synthetic benchmarks.

Model lock-in is real: how to architect for switching providers without rewriting your app

The abstraction layer that pays for itself. How to structure your codebase so changing models doesn’t mean changing everything.

The build vs buy vs fine-tune decision: when to use APIs, when to self-host, when to invest in custom training

When hosted APIs are enough, when self-hosting open-weight models is justified, and when custom training is the right call.

AI security, regulation, and compliance

Prompt injection is the #1 unsolved security problem: direct injection, indirect injection, and what actually works as defence

Direct and indirect injection attacks, and the defence strategies that actually work today — not just the ones that sound plausible.

The EU AI Act is now in force: risk tiers, GPAI obligations, open-source exemptions, and what builders need to do

The world’s first comprehensive AI law. What builders concretely need to do, not just know about.

Constitutional AI and alignment: how Anthropic trains models to follow principles instead of pattern-matching preferences

How training models against a written constitution differs from RLHF alone, and why this matters for product builders, not just researchers.

Red teaming your own product: why adversarial testing shouldn’t be left to the model provider

Your model provider tests their model. You need to test your product — the combination of model, prompt, tools, and user behaviour.

When safety training breaks: sleeper agents, alignment faking, and reward hacking — what the research actually shows

The limits of current safety techniques, grounded in actual research findings rather than speculation.

Sycophancy as a product risk: when the model tells users what they want to hear instead of what’s accurate

A design problem, not just a training problem. How sycophancy degrades decision quality and what product teams can do about it.

AI economics, cost curves, and developer experience

The 1000x cost collapse: inference went from $60 to $0.06 per million tokens in four years

What this cost curve means for which business models become viable — and which pricing strategies expire.

AI coding tools hit 85% adoption: what the productivity data actually shows (and what the METR study says it doesn’t)

Developers think they’re faster. The METR study says otherwise. What this tension means for engineering workflows.

MCP and tool use: how Anthropic’s open protocol became the standard for connecting LLMs to external systems

From Anthropic project to Linux Foundation standard. What MCP means for agent architecture and why it won.

Structured output and JSON mode: getting LLMs to reliably produce machine-parseable responses

Function calling, tool use schemas, and constrained generation. The techniques that work and the failure modes that don’t get covered in tutorials.

Inference optimisation for real people: quantisation, speculative decoding, and KV-cache tricks that cut costs

Practical techniques for cutting inference costs by 50–70% without degrading output quality.

API design patterns for LLM products: streaming, retries, fallbacks, rate limiting, and multi-model routing

The production patterns nobody covers in tutorials. What your LLM-powered API layer actually needs to handle.

AI and jobs — the nuanced version: augmentation vs replacement, and why the answer depends on the task, not the role

The answer isn’t “AI will replace X jobs.” It’s that specific tasks within roles are being transformed. The granularity matters.

Claude Code, skills, and the practitioner’s edge

The Claude Code skills framework — what most people get wrong

Skills ≠ slash commands ≠ MCP. The invocation hierarchy, trigger descriptions, and token efficiency of on-demand loading vs always-loaded context.

System prompt leaks and what they reveal about Claude’s design philosophy

Most coverage treats leaks as scandal. The prompt is actually good engineering — the ~15K token overhead, classifier stack, and deliberate tradeoffs.

MCP one year later — the USB-C of AI (or is it?)

When MCP genuinely adds value vs when native tools are simpler. Security realities: tool poisoning, exfiltration risks, and the code execution pattern.

Context window management is the real skill

Why “stay under 60%” isn’t arbitrary, how handoff patterns keep sessions clean, and the token math of CLAUDE.md vs on-demand skills.

Extended thinking — the feature most people misuse

Think tool vs extended thinking, interleaved thinking in Claude 4+, and why token budget calibration matters more than just enabling it.

The "Claude is dead" cycle: why community perception oscillates while usage grows 5.5x

Recency bias in comparison posts, how system prompt conservatism frustrates casual users but delights power users, and what builders should take from it.

Solo SaaS + Claude Code — the real economics

Actual cost breakdowns from building real products. Where Claude Code excels (greenfield, refactoring) vs where it struggles (legacy edge cases).

What I learned: practical lessons from building with LLMs

Getting Claude to actually follow rules: enforceable constraints, not conversational prose

Good prompts aren’t prose — they’re contracts. Always/Never/If–then constraints with the top 3 restated at the end.

Context is not memory: why uploading documents doesn’t mean the model retrieves the right parts

Maintain a living brief plus retrieval shards. Ask it to cite chunk IDs from your own structure, not hope it finds the right paragraph.

Stop telling it to "just read" your PDF: chunk by decision purpose, not headings

The bottleneck isn’t file size — it’s selection accuracy. Generate a table of contents with IDs, then query by ID.

Conversation simulation doesn’t scale: why 50-turn dialogues collapse, and how state machines fix it

Long simulated dialogues collapse into repetition and lost constraints. Convert sims into state machines: scenario → state → inputs → expected outputs.

Model quality matters less than workflow robustness

Tooling flakiness, UI artefacts, and environment quirks invalidate results even when the model is performing well. Build checkpoint protocols.

A practical Claude Code SOP: the boring discipline that actually works

Scope locks, checkpoints, explicit file ownership, and the plan → patch → test → summarise loop. Not exciting, but reliable.

Limits, tokens, and the hidden cost curve

“Stuff more context in” is the fastest way to burn budget and degrade output quality simultaneously. Prefer summaries and pointers.

Coming soon