Reference · 2025–2026 · evidence-graded
How the leading AI-native startups actually build.
A single map of the beliefs, architecture, engine, sequence, economics, risks and tools behind AI-first companies. Built from primary sources and named case studies, then adversarially fact-checked. Generic and shareable: every example is a public company.
The Axioms
Ten irreducible truths. If you disagree with these, the rest will not help you.
AI is the building layer, not a feature. The agent wraps deterministic tools; tools do not wrap the agent.
The model is a commodity; the moat is everything around it. Data, workflow, evals, distribution. Never the prompt.
Context engineering is the core discipline. Most agent failures are context failures, not reasoning failures.
Compounding requires a closed loop. Sense, decide, act, evaluate, improve. Only the closed loop compounds.
Legibility precedes intelligence. If it was not recorded, it did not happen to your AI.
The constraint shifts from headcount towards inference — directionally, and contested. A direction of travel, not a settled law.
Verticals win on depth. Depth = proprietary data + codified workflow + evals, accumulated through deployment.
Evaluation is the steering wheel. Without evals you are driving blind at speed.
Code is ephemeral; context is permanent. Regenerate code as models improve; the durable asset is what you know.
Start with one named user and one bulletproof loop. Everything else is premature.
The Stack — 20 concepts, 5 layers
The concepts form an architecture, not a list. Build from the bottom up. Click a layer to expand.
▲ build upward — each layer rests on the one below ▲
LAYER 1Product Architecture — what the product is
+
LAYER 2The Intelligence Engine — how it gets smart
+
LAYER 3Moats — why it is defensible
+
LAYER 4Economics & Organisation — how it is run
+
LAYER 5Discipline — what keeps it honest
+
The Engine — the self-improving loop
This loop lives inside Layer 2. Its five motions each map to one architectural layer. When all five run with minimal human intervention, the system improves with every cycle.
Read the world: customer messages, tickets, cancellations, telemetry, code changes.
Rules for autonomy: what the agent may do alone, what needs approval, what must be logged.
Execute deterministic real actions through tools. Reversible & audited where it matters.
Grounding check, safety, brand/policy. Plus production sampling for the side effects evals miss.
Detect what failed; propose skill updates; feed back into the sensor layer. The nightly cycle.
The Build Sequence
Each step is a prerequisite for the next. Parameterise it: choose a vertical V, a named user U, a core workflow W.
Phase 0 — named user
PR-FAQ, three falsifiable pillars, one named user U who confirms they would pay, pre-mortem, kill criteria.
Legibility
One database. Record everything from day one. Append-only event log. Cannot be retrofitted.
Harness + first skill
Choose one harness; do not build your own. Write workflow W as a SKILL.md. Judgment in skill, actions in code.
Policy layer
Define autonomy boundaries. Mark every action reversible or not. Reversible + low-stakes can be autonomous.
Quality gate + evals
Stand up evals before scaling usage. Grounding, safety, brand. Add production sampling. Measure across runs.
Learning loop
Nightly cycle reads transcripts, proposes skill improvements. Measure resolution, escalation, satisfaction.
Company brain
Queryable world models over everything recorded: vector store + event log. Prune for context rot.
Moat & expansion
Deepen data + workflow moat. Adjacent workflows on the same substrate, outcome pricing, FDE for high-value verticals.
Evaluation — what each concept is for
Some concepts are entry tickets, some are durable advantages, some are powerful levers that misfire if misapplied.
Table stakes
- AI as building layer (1)
- LLM-land vs code-land (4)
- Context engineering (7)
- Thin harness, fat skills (8)
- Legibility (9)
- Evals (12)
- Named-user gate (19)
Moats
- Self-improving loop (6) — 3–6 mo
- Company brain (10) — 6–12 mo
- Skill self-improvement (11) — 2–3 mo
- Data + workflow moat (14) — per deploy
- Forward-deployed engineering (15)
Tactical
- Tokenmaxxing (20)
- Burn tokens not headcount (16)
- Outcome pricing (18)
- Flat / trust-by-default org (17)
Edge Cases & Mitigations
Where most implementations fail. Each is a real failure mode with a concrete mitigation.
Next model release commoditises you. Fix: moat must be owned data + workflow + evals from deployment, never the prompt.
Evals only catch what you thought to measure. Fix: production sampling + adversarial evals; monitor downstream effects.
A loop compounds errors as fast as wins. Fix: human gate on high-stakes; reversibility; kill switch; real policy layer.
Tokenmaxxing everything burns cash. Fix: only high-leverage tasks; track cost-per-outcome; per-workflow budgets.
“Record everything” hits GDPR/PII. Fix: consent receipts, data minimisation, PII vaults; record process, not raw data.
Staff feel replaced; bad autonomous calls. Fix: humans at the edges; approval thresholds; audit trails; reversibility.
One lab’s model; they raise prices or deprecate. Fix: provider-agnostic abstraction; evals catch swap regressions.
Trust-by-default breaks at scale / regulated. Fix: treat as stage-specific; formalise selectively as you grow.
Bespoke one-offs never merge back. Fix: 70%+ of FDE code in main repo by month 12; one reusable feature per engagement.
Cheaper execution makes it more tempting. Fix: the Phase 0 gate. No name, no build.
Execution is cheap; teams build everything. Fix: tight loops with real users; exploration over more pipelines.
World model accumulates stale, contradictory data. Fix: decay, contradiction detection, periodic distillation — prune as well as add.
The Tool Stack (generic)
The categories you assemble, with public examples. Pick one per category and resist rebuilding the commodity layers.
Agent runtime
The core loop: input → LLM → tool calls. Do not build your own.
Skill files
Markdown workflows loaded by progressive disclosure.
Machine-readable contracts
How agents and tools talk. Built for agents, not humans.
Retrieval & memory
Just-in-time loading; semantic + keyword retrieval; keep the window clean.
Company brain store
Queryable customer + company models over everything recorded.
Evaluation harness
Multi-run consistency, adversarial cases, production sampling.
Traces & transcripts
Every conversation and tool call logged; raw material for the learning loop.
Deterministic tools
Code-land: payments, writes, deploys. Reversible & audited.
What Is Not Settled
Built with adversarial verification: every major claim was challenged, and five popular claims were killed. Share these caveats alongside the guide — they separate a credible reference from hype.
Claims that failed verification — do not repeat
Nvidia formally budgets ~$250K tokens per $500K engineer.refuted 0-3Founders compress idea-to-ship from 6 months to 1 day.refuted 0-3Cal AI: $50M ARR with 7 people, no VC.refuted 0-3“Software for Agents” as an explicit YC investment thesis.refuted 1-2 — pattern real, attribution notEval cost now rivals or exceeds training cost.refuted 1-2
Genuinely open questions
- Is token spend a durable metric or a 2026 fad? Numbers real; formal adoption refuted; leaderboards rolled back.
- Does the data moat actually defend verticals? Domain-first is confirmed; durable defensibility vs lab commoditisation is unproven. Treat as a hypothesis to test.
- Does the named-user discipline hold empirically? Sound practice, but untested by any surviving source. A principle, not a finding.
- True capital efficiency of AI-native firms? The leanness narrative outran the verifiable evidence.
Time-sensitivity: nearly every source dates from Dec 2025 – May 2026 in a fast-moving field. Re-verify before betting heavily.
Sources
Primary sources and named case studies. Method: 5-angle web deep-research (24 sources, 115 claims, 25 adversarially verified — 20 confirmed, 5 killed), plus practitioner search, 1 June 2026.
Compiled 2026-06-01, last reviewed 2026-06-02 · evidence-graded and adversarially verified · every example is a public company.