Reference · 2025–2026 · evidence-graded

How the leading AI-native startups actually build.

A single map of the beliefs, architecture, engine, sequence, economics, risks and tools behind AI-first companies. Built from primary sources and named case studies, then adversarially fact-checked. Generic and shareable: every example is a public company.

10 axioms 20 concepts · 5 layers 1 self-improving loop 8 build phases 12 edge cases 25 claims verified · 5 killed

1 Product 2 Intelligence 3 Moats 4 Economics & Org 5 Discipline

The Axioms

Ten irreducible truths. If you disagree with these, the rest will not help you.

AI is the building layer, not a feature. The agent wraps deterministic tools; tools do not wrap the agent.

The model is a commodity; the moat is everything around it. Data, workflow, evals, distribution. Never the prompt.

Context engineering is the core discipline. Most agent failures are context failures, not reasoning failures.

Compounding requires a closed loop. Sense, decide, act, evaluate, improve. Only the closed loop compounds.

Legibility precedes intelligence. If it was not recorded, it did not happen to your AI.

The constraint shifts from headcount towards inference — directionally, and contested. A direction of travel, not a settled law.

Verticals win on depth. Depth = proprietary data + codified workflow + evals, accumulated through deployment.

Evaluation is the steering wheel. Without evals you are driving blind at speed.

Code is ephemeral; context is permanent. Regenerate code as models improve; the durable asset is what you know.

A10

Start with one named user and one bulletproof loop. Everything else is premature.

The Stack — 20 concepts, 5 layers

The concepts form an architecture, not a list. Build from the bottom up. Click a layer to expand.

▲ build upward — each layer rests on the one below ▲

LAYER 1

Product Architecture — what the product is

1 · AI as the building layerIf removing the AI leaves a working product, you built a co-pilot. The shadow: the “horseless carriage” anti-pattern (Pete Koomen).

2 · Agent-native propertiesParity, granularity, composability, emergent capability, self-improvement (Dan Shipper).

3 · Machine-readable interfacesBuild for agents: APIs, MCPs, CLIs — not human-first GUIs.

4 · LLM-land vs code-landJudgment in the LLM; deterministic actions in code. Confusing them is the No.1 failure.

5 · Execution → ideation shiftWhen building is near-free, “what to build” becomes the bottleneck (a16z).

LAYER 2

The Intelligence Engine — how it gets smart

6 · The self-improving loopSense → decide → act → evaluate → improve, continuously. See §03.

7 · Context engineeringCurate the finite context window; progressive disclosure, just-in-time retrieval. Beware “context rot”.

8 · Thin harness, fat skillsReuse the harness; put all intelligence in markdown skills (SKILL.md, progressive disclosure).

9 · Legibility — record everythingConversations, tickets, calls, decisions. Cannot be retrofitted.

10 · Company brain / world modelTwo queryable models: how the company works + everything about customers.

11 · Skill self-improvementFeed usage transcripts back as metaprompts; the skill surpasses any individual.

12 · Evals as the steering wheelMeasure consistency across runs. Single-run metrics massively overstate reliability.

LAYER 3

Moats — why it is defensible

13 · Domain-first verticalsOwn a vertical end-to-end. Sierra, Harvey, Decagon, Hippocratic.

14 · Data + workflow moatNot the model, not the prompt. Better models make the app layer more capable, not thinner.

15 · Forward-deployed engineeringEmbed in the customer; extract tacit knowledge; build evals; merge reusable features back.

LAYER 4

Economics & Organisation — how it is run

16 · Burn tokens, not headcountDirectional, contested. Numbers are real; formal metrics were refuted and rolled back.

17 · Flat, egalitarian, trust-by-defaultEveryone gets the infrastructure; conversations visible. A startup-stage edge.

18 · Outcome-based pricingCharge per resolution, not per seat (Sierra). Forces eval discipline.

LAYER 5

Discipline — what keeps it honest

19 · The named-user gate (Phase 0)PR-FAQ + 3 falsifiable pillars + one named real user + pre-mortem + kill criteria. No name, no build.

20 · Tokenmaxxing (with discipline)Spend tokens on high-leverage work — research, evals, hard reasoning. Track cost-per-outcome.

The Engine — the self-improving loop

This loop lives inside Layer 2. Its five motions each map to one architectural layer. When all five run with minimal human intervention, the system improves with every cycle.

Sense · sensor layer

Read the world: customer messages, tickets, cancellations, telemetry, code changes.

Decide · policy layer

Rules for autonomy: what the agent may do alone, what needs approval, what must be logged.

Act · tool layer

Execute deterministic real actions through tools. Reversible & audited where it matters.

Evaluate · quality gate

Grounding check, safety, brand/policy. Plus production sampling for the side effects evals miss.

Improve · learning

Detect what failed; propose skill updates; feed back into the sensor layer. The nightly cycle.

Proof it is real: Cognition’s Devin reportedly became up to 4× faster and 2× more resource-efficient through deployment, lifting merged-PR rate from ~34% to ~67% — a loop that turned on itself.

The Build Sequence

Each step is a prerequisite for the next. Parameterise it: choose a vertical V, a named user U, a core workflow W.

WEEK 0 · GATE

Phase 0 — named user

PR-FAQ, three falsifiable pillars, one named user U who confirms they would pay, pre-mortem, kill criteria.

→ Layer 5 discipline

WEEKS 1–2

Legibility

One database. Record everything from day one. Append-only event log. Cannot be retrofitted.

→ Layer 2 · concept 9

WEEKS 2–4

Harness + first skill

Choose one harness; do not build your own. Write workflow W as a SKILL.md. Judgment in skill, actions in code.

→ Layer 2 · 7, 8

WEEKS 3–5

Policy layer

Define autonomy boundaries. Mark every action reversible or not. Reversible + low-stakes can be autonomous.

→ loop · decide

WEEKS 5–6

Quality gate + evals

Stand up evals before scaling usage. Grounding, safety, brand. Add production sampling. Measure across runs.

→ Layer 2 · 12

WEEKS 6–8

Learning loop

Nightly cycle reads transcripts, proposes skill improvements. Measure resolution, escalation, satisfaction.

→ loop · improve

MONTHS 2–3

Company brain

Queryable world models over everything recorded: vector store + event log. Prune for context rot.

→ Layer 2 · 10

MONTHS 4–6

Moat & expansion

Deepen data + workflow moat. Adjacent workflows on the same substrate, outcome pricing, FDE for high-value verticals.

→ Layer 3 · 13–15

Evaluation — what each concept is for

Some concepts are entry tickets, some are durable advantages, some are powerful levers that misfire if misapplied.

Table stakes

Do these or you are AI-assisted, not AI-first

AI as building layer (1)
LLM-land vs code-land (4)
Context engineering (7)
Thin harness, fat skills (8)
Legibility (9)
Evals (12)
Named-user gate (19)

Moats

Compounding, hard to copy

Self-improving loop (6) — 3–6 mo
Company brain (10) — 6–12 mo
Skill self-improvement (11) — 2–3 mo
Data + workflow moat (14) — per deploy
Forward-deployed engineering (15)

Tactical

High leverage, apply with discipline

Tokenmaxxing (20)
Burn tokens not headcount (16)
Outcome pricing (18)
Flat / trust-by-default org (17)

Edge Cases & Mitigations

Where most implementations fail. Each is a real failure mode with a concrete mitigation.

E1Thin-wrapper trap

Next model release commoditises you. Fix: moat must be owned data + workflow + evals from deployment, never the prompt.

E2Eval blind spot

Evals only catch what you thought to measure. Fix: production sampling + adversarial evals; monitor downstream effects.

E3Runaway loop

A loop compounds errors as fast as wins. Fix: human gate on high-stakes; reversibility; kill switch; real policy layer.

E4Token-cost explosion

Tokenmaxxing everything burns cash. Fix: only high-leverage tasks; track cost-per-outcome; per-workflow budgets.

E5Legibility vs privacy

“Record everything” hits GDPR/PII. Fix: consent receipts, data minimisation, PII vaults; record process, not raw data.

E6Over-automation

Staff feel replaced; bad autonomous calls. Fix: humans at the edges; approval thresholds; audit trails; reversibility.

E7Model lock-in

One lab’s model; they raise prices or deprecate. Fix: provider-agnostic abstraction; evals catch swap regressions.

E8Flat-org limit

Trust-by-default breaks at scale / regulated. Fix: treat as stage-specific; formalise selectively as you grow.

E9FDE becomes consulting

Bespoke one-offs never merge back. Fix: 70%+ of FDE code in main repo by month 12; one reusable feature per engagement.

E10Building before a named user

Cheaper execution makes it more tempting. Fix: the Phase 0 gate. No name, no build.

E11“What to build” thrash

Execution is cheap; teams build everything. Fix: tight loops with real users; exploration over more pipelines.

E12Context rot

World model accumulates stale, contradictory data. Fix: decay, contradiction detection, periodic distillation — prune as well as add.

The Tool Stack (generic)

The categories you assemble, with public examples. Pick one per category and resist rebuilding the commodity layers.

HARNESS

Agent runtime

The core loop: input → LLM → tool calls. Do not build your own.

e.g. Claude Code · Cursor · Codex · open agent runtimes

SKILLS

Skill files

Markdown workflows loaded by progressive disclosure.

e.g. Anthropic Agent Skills (SKILL.md)

INTERFACES

Machine-readable contracts

How agents and tools talk. Built for agents, not humans.

e.g. MCP servers · APIs · CLIs

CONTEXT

Retrieval & memory

Just-in-time loading; semantic + keyword retrieval; keep the window clean.

e.g. vector store (pgvector) · hybrid RAG · RRF re-rank

WORLD MODEL

Company brain store

Queryable customer + company models over everything recorded.

e.g. Postgres + vector DB · append-only event log

EVALS

Evaluation harness

Multi-run consistency, adversarial cases, production sampling.

e.g. eval frameworks · golden sets · tau-bench-style suites

OBSERVABILITY

Traces & transcripts

Every conversation and tool call logged; raw material for the learning loop.

e.g. agent-trace logging · transcript stores

ACTIONS

Deterministic tools

Code-land: payments, writes, deploys. Reversible & audited.

e.g. payment APIs · DB writes · deploy hooks

What Is Not Settled

Built with adversarial verification: every major claim was challenged, and five popular claims were killed. Share these caveats alongside the guide — they separate a credible reference from hype.

Claims that failed verification — do not repeat

~~Nvidia formally budgets ~$250K tokens per $500K engineer.~~ refuted 0-3
~~Founders compress idea-to-ship from 6 months to 1 day.~~ refuted 0-3
~~Cal AI: $50M ARR with 7 people, no VC.~~ refuted 0-3
~~“Software for Agents” as an explicit YC investment thesis.~~ refuted 1-2 — pattern real, attribution not
~~Eval cost now rivals or exceeds training cost.~~ refuted 1-2

Genuinely open questions

Is token spend a durable metric or a 2026 fad? Numbers real; formal adoption refuted; leaderboards rolled back.
Does the data moat actually defend verticals? Domain-first is confirmed; durable defensibility vs lab commoditisation is unproven. Treat as a hypothesis to test.
Does the named-user discipline hold empirically? Sound practice, but untested by any surviving source. A principle, not a finding.
True capital efficiency of AI-native firms? The leanness narrative outran the verifiable evidence.

Time-sensitivity: nearly every source dates from Dec 2025 – May 2026 in a fast-moving field. Re-verify before betting heavily.

Sources

Primary sources and named case studies. Method: 5-angle web deep-research (24 sources, 115 claims, 25 adversarially verified — 20 confirmed, 5 killed), plus practitioner search, 1 June 2026.

PRI

YC — The Playbook for Building an AI-Native CompanyDiana Hu · AI as the OS, self-improving loop, flat orgs

PRI

YC — Requests for StartupsAI-native services, the “company brain” primitive

SEC

a16z — Notes on AI Apps in 2026building layer, thick apps, execution→ideation

SEC

a16z — Good News: AI Will Eat Application Softwaredata moat from enterprise deployment

PRI

Anthropic — The Founder’s Playbook4-stage map: Idea, MVP, Launch, Scale

PRI

Anthropic — Agent SkillsSKILL.md + progressive disclosure

PRI

Anthropic — Effective Context Engineeringcontext as a finite resource; context rot

PRI

Anthropic — Effective Harnesses for Long-Running Agentsinitializer + coding agent; state in git

PRI

Cursor — The Third Era of Software DevelopmentMichael Truell · agent fleets; usage inversion

SEC

InfoQ — Cursor 3 agent-first interfaceIDE as fallback; 2× agent vs tab users

PRI

Pete Koomen — Horseless Carriagesthe AI-as-feature anti-pattern

SEC

Contrary — Cognition / Devin breakdownthe self-improving loop, made real

SEC

Sacra — Sierraoutcome pricing; agents that act; FDE

BLOG

Perspective — Harvey AI & Forward-Deployed EngineeringFDE as GTM; firm-specific data moat

BLOG

Perspective — FDE Founder’s Playbook 202670%+ code in main repo by month 12

BLOG

VC Cafe — Vertical AI in 2026: the good, the bad, the uglycontrarian: thin wrappers commoditise

BLOG

Hugging Face EvalEval — the eval-cost bottleneckagent benchmarks resist compression; tau-bench 60→25%

SEC

Kingy — Claudeonomicstoken spend numbers (with rollback caveat)

Dan Shipper — agent-native software pillarsparity, granularity, composability, emergent, self-improve

Aaron Levie — context as the moatdomain understanding flywheel

Compiled 2026-06-01, last reviewed 2026-06-02 · evidence-graded and adversarially verified · every example is a public company.

Take it with you

The whole field guide, as a PDF.

Six pages: the ten axioms, the five-layer stack, the self-improving loop and the build sequence. Free, no email.

Download the PDF

↗

Related resource

Methodology · Evidence-graded The Five-F Methodology for Building an AI-First Organisation A repeatable, gated, self-correcting methodology for building or transforming into an AI-first organisation of any size: five stages with hard gates, seven organisation archetypes, two modes, four scaling tiers, the failure modes that sink most attempts, and an implementation runbook (the build sequence, the five layers to put in place, and a maturity self-score). Generic and shareable; every example is a public framework. 16 min read · Updated 2026-06-06 Read it →