Methodology · 2026 · evidence-graded

Building an AI-first organisation, small or large.

A repeatable, gated, self-correcting methodology — five stages wrapped in a loop that keeps the method itself from going stale. Read by organisation type, run the stages with their gates and templates, and use the implementation runbook to know exactly what to put in place. Generic and shareable: every example is a public framework.

5 gated stages 7 org types 2 modes 4 tiers 5 layers to put in place

Frame Found Forge Flywheel Federate Meta-loop

The one idea it hangs on

The thesis

An AI-first organisation is a set of closed loops that compound a proprietary asset — data + codified workflow + evals — while spending inference instead of headcount.

It is mostly a work-design problem, not a tech problem

Roughly 10% of the effort is algorithms, 20% data and tech, 70% people, process and change. Treat AI as an IT project and it fails.

BCG 10-20-70 (directional)

The moat is a result of execution, not a precondition

At 0-to-1 the only thing that matters is execution. You build defensibility by deploying, not by planning it.

YC, adapting Helmer’s Seven Powers

The spine

Five stages, each separated by a hard gate you cannot pass without evidence. A continuous meta-loop wraps all of them. The gates are the methodology; the stages are just where they sit.

◀ META-LOOP · re-verify · stress-test vs model upgrades · re-run evals on live traffic · prune ▶

Frame

decide what to build & why it compounds

→

Found

build the control plane first

→

Forge

build the first closed loop

→

Flywheel

compound the asset; run AI-first

→

Federate

scale the substrate to new domains

The five stages: gate, metrics, template

Each stage names its concepts, the single gate that lets you move on, what to track, and a copyable artefact.

STAGE 1

Frame

decide what to build

Workflow-first: map an end-to-end workflow, decompose into segments, find the one that is high-volume, repetitive, money-adjacent and painful. Run it manually yourself before automating it.

Gate (any red stops): (1) a named user who’ll pay — validate 5+ pre-sales; (2) a measurable outcome; (3) feasibility (20–50 real inputs, hand-scored); (4) a moat hypothesis.

Track · named user confirmed (y/n) · outcome KPI + sample size · feasibility accuracy % · pre-sales count

named-user disciplinemeasurable-outcome gateSeven Powers moattwo-sentence pitch

Template · the one-page Frame charter

NAMED USER:     [name / role] — spoken to? [y/n] — would pay? [y/n]
PROBLEM:        [the pain, in their words]
OUTCOME (KPI):  [the number that defines success] · sample size: [n]
FEASIBILITY:    [accuracy on 20-50 hand-scored test inputs] = [__%]
MOAT (in 18mo): [data / workflow / switching cost we will own]
PITCH:          [what] for [customer], solving [problem]. [the hook].
KILL CRITERIA:  stop if [no paying user in 8 weeks / accuracy < X]

STAGE 2

Found

make it loop-ready

Build the control plane before any autonomy: legibility (record everything), owned memory (not vendor-locked), permissions, observability, rollback, human override, cost management. Reuse an existing harness; don’t rebuild the loop.

Gate: governance + control plane exist before you orchestrate multiple agents. Every action tagged reversible-or-not.

Track · control-plane checklist complete · % of actions classified reversible/irreversible · is everything recorded? (y/n)

legibility precedes intelligenceown your memorypolicy layer + reversibilitythin harness, fat skills

Template · the action policy / reversibility table

ACTION                | REVERSIBLE? | AUTONOMY      | GATE
read / retrieve       | n/a         | autonomous    | grounding check
draft / suggest       | yes         | autonomous    | quality gate
send external comms    | partly      | needs review  | human approve
move money / refund   | no          | NEVER auto    | human + audit log
delete / overwrite     | no          | NEVER auto    | human + backup

STAGE 3

Forge

build the first closed loop

One workflow, fully closed-loop. Evals before code: 20–50 tasks from real failures define the capability first; split into capability evals (hard, low pass) and regression evals (must stay high). Three-agent harness — Planner → Generator → Evaluator — with the judge separated from the builder. One feature at a time. Test as a human would.

Gate: the loop measurably improves the Stage-1 outcome at pass^k reliability (all of k repeated trials succeed — not one lucky run).

Track · pass^k reliability · capability vs regression pass rates · single-feature discipline held? · bugs caught by human-style testing

eval-driven developmentPlanner / Generator / EvaluatorLLM-land vs code-landSKILL.md packaging

Template · a SKILL.md skeleton + an eval task

--- SKILL.md ---
name: resolve-customer-refund
description: Use when a customer asks for a refund. Decides eligibility
  from policy, NOT when the request is a complaint with no refund ask.
---
1. get_order(id)            # code-land: deterministic
2. judge eligibility        # LLM-land: against policy in references/
3. if eligible & < £X: process_refund()   # else: escalate_to_human()
4. confirm warmly; log decision + reason

--- eval task ---
input:    "order 8841 arrived broken, I want my money back"
expect:   eligible=true, action=refund, amount<=order_total
score:    grounded (each claim from a tool call)? policy-correct? tone?

STAGE 4

Flywheel

compound the asset

Close the loop so it runs with minimal intervention: traces → human + LLM feedback → reusable automated evals (a gate) → rank → implement → repeat. Build the company brain; counter data entropy. Spend tokens not headcount; shift metrics from labour-savings to developmental ones.

Gate: the asset demonstrably compounds — measurably better this month than last, with the eval gate holding.

Track · week-over-week outcome trend · resolution & escalation rate · eval-gate hold rate · cost-per-outcome (not per-token) · context retained / rework eliminated

improvement flywheelmetaprompt / dream cyclecompany braintokens not headcountoutcome pricing (a16z prediction)forward-deployed engineering

Template · the nightly improvement loop

EVERY NIGHT (automated):
  1. read the day's transcripts + low-scoring eval spans
  2. classify each miss: skill/context gap  vs  model gap
  3. propose skill edits (accept only if a held-out eval improves)
  4. write a short report of proposed changes
WEEKLY (human, 30 min):
  5. review proposals; read a transcript sample; approve/reject
  6. calibrate the LLM-judge against your human grades

STAGE 5

Federate

scale the substrate

The same substrate runs adjacent workflows and verticals; the organisation runs AI-first. Federated design: central guardrails + domain-level execution. Roles compress to IC (builder/operator) and DRI (directly responsible individual). Fund expansion on adoption + outcomes, not technical milestones.

Gate: a second domain runs on the same substrate without re-architecture; the collaboration layer becomes the moat.

Track · domains live on the substrate · adoption % · revenue-per-employee trend · skills reused across domains

federated operating modelIC + DRI rolessix capabilitiescollaboration-layer moat

By organisation type

The same methodology reads differently depending on who you are. Each archetype has a binding constraint, a place to start, and a first move. Mode and tier compose with this.

Solo builder / indie

1 person · greenfield · minimum tier

Binding constraint: your time and focus.

Where to start: Frame → one Forge loop. One harness, a few skills, manual evals.

Skip: Federate, company brain, big registries. Over-building is the solo killer.

This week: name one user, run the workflow by hand once, write one SKILL.md.

Early startup

2–15 · greenfield

Binding constraint: finding the compounding loop before runway ends.

Where to start: all five stages in order; forward-deploy for your first big accounts.

Skip: waiting for a moat before building — execution first.

This week: pre-sell to 5; if 0–1 bite, change the idea, not the pitch.

Scale-up / growth

company · greenfield maturing

Binding constraint: the loop calcifying as you grow.

Where to start: deepen the Flywheel; begin Federate to a second domain.

Skip: nothing — but run the Meta-loop hard so success doesn’t ossify.

This week: stand up the nightly improvement loop and a real eval-gate on prod traffic.

SME / mid-market

non-tech-native · transform

Binding constraint: no in-house AI muscle; risk-aversion.

Where to start: Transform mode on ONE beachhead workflow. Buy the harness, don’t build it.

Skip: a big platform play. Prove one loop, show the ROI, then expand.

This week: pick the one painful, measurable, high-volume workflow and put a named owner on it.

Large enterprise / incumbent

500+ · transform

Binding constraint: 70% is people/process/change; legacy architecture; governance.

Where to start: Found (control plane + governance) then Forge on a GM-owned beachhead. Federated, not centralised.

Skip: a company-wide rollout before one loop works. You do not need to re-platform the whole stack.

This week: name an executive sponsor and one beachhead; stand up the control plane before any autonomy.

Agency / services firm

turning engagements into IP

Binding constraint: margins; bespoke work that doesn’t compound.

Where to start: the Flywheel’s forward-deployed motion — embed, build evals from client data before code, hardwire in, productise the pattern back.

Watch: the unit economics floor — below a certain deal size the model doesn’t pay.

This week: pick one repeatable client problem; build the eval suite from their labelled data first.

Regulated / high-stakes

health · finance · legal · safety

Binding constraint: reversibility, auditability, harm avoidance.

Where to start: Found is non-negotiable and heavy — human gates, audit trail, all-pass (not partial-credit) evals, closed-universe pilots.

Skip: autonomy on irreversible actions. Keep a human in the loop on anything that can’t be undone.

This week: write the policy/approval gates and the all-pass eval rubric before building anything.

The Meta-Loop — why the method survives a moving field

AI playbooks rot fast — most organisations say AI project speed already outpaces their governance, and a frontier lab removed a core harness construct the moment a better model shipped. So the methodology re-verifies itself on a cadence: the same sense → evaluate → improve discipline, pointed at the playbook.

1 · Tag assumptions

Every load-bearing assumption (model capability, a vendor figure, a tool) gets a source and a re-check date.

2 · Stress-test the harness

On each model upgrade, ask what scaffolding the new model made unnecessary — and delete it.

3 · Re-run evals on live traffic

Catch silent drift: the prompt unchanged for 30 days while the model shifts underneath.

4 · Re-verify claims

Adversarially check claims before they harden into “fact”. Plausible is not the same as true.

5 · Prune

Counter data entropy and playbook rot: remove the stale, keep the living.

Two modes

Mode (greenfield vs transform) composes with your org type and tier. Same five stages, run differently.

Greenfield

a new AI-first business

Run Frame → Found → Forge → Flywheel → Federate in order.
AI removed the old bottlenecks (capital, headcount, technical skill).
Move fast; let execution build the moat.
Flat and trust-by-default from day zero.

Transformation

an incumbent going AI-first

Don’t reframe the whole company; you do not need to re-platform the entire stack.
Start in Forge on one beachhead workflow (escalations, procurement, claims, coordination, compliance).
A GM owns it, not the CIO/CTO.
Skip the slow interim layer; go straight to the agentic workflow. Prove one loop, then spread.

Scaling tiers — fit the weight to the size

The biggest early mistake is over-building. Run only the tier you’re at.

Tier	Minimum viable version	Defer
Solo builder	One harness, a few SKILL.md skills, manual evals, one closed loop.	Tool registry, company brain, org design.
Team (3–7)	Eval discipline + small skill registry + control-plane basics + a DRI per outcome.	Federation, centres of excellence.
Company	The six capabilities; federated guardrails; forward-deployed for high-value accounts.	—
Platform	Open skill registry; collaboration-layer moat; third-party packs.	—

Failure modes — and the gate that catches each

Most attempts die in predictable ways. Each gate exists because of one of these.

Building before a user

Months of beautiful product for no one.

Caught by: the Frame named-user gate (5+ pre-sales or stop).

No measurable outcome

The loop can’t compound; outcome pricing is impossible.

Caught by: the Frame measurable-outcome gate — proxy/calibrated rubric, or exit.

Pilot purgatory

Endless demos that never reach production or compound.

Caught by: the Forge gate — the loop must improve the real outcome at pass^k before you proceed.

Over-automation harm

An autonomous agent makes a bad, irreversible call.

Caught by: the Found control plane — reversibility tagging + human gates before any autonomy.

Vendor-locked memory

Your institutional knowledge lives in someone else’s platform.

Caught by: Found — own your memory layer.

Solo over-engineering

A one-person team builds enterprise machinery and ships nothing.

Caught by: the Solo tier — minimum loop only; defer Federate.

Services-becomes-consulting

Bespoke client work that never productises.

Caught by: the FDE discipline — every engagement merges a reusable feature back.

Playbook rot

The method ages as models and the field move.

Caught by: the Meta-loop — re-verify, stress-test, prune on a cadence.

“It’s an IT project”

Treating transformation as tech, not work-design.

Caught by: the 10-20-70 reality — a GM owns it; 70% is people/process.

Implementation — what to put in place

The stages give you the shape; this is the runbook. First the order to build in, then the concrete artefacts each layer needs, then a self-score that names your next move.

A · The build sequence — minimum order

Do these in order. Step 6 is a hard gate: never pass it on a single lucky run.

Frame charter — named user, measurable KPI, feasibility test, moat, kill criteria.

Stand up the store + tracing — record every conversation and tool call.

Choose the harness; write the policy / reversibility table.

Write the evals first — 20–50 cases drawn from real failures.

Build one loop — Planner → Generator → Evaluator, judge separated.

Gate: the loop improves the KPI at pass^k. Only then proceed.

Turn on online evals + the nightly improvement loop.

Metrics dashboard + a weekly human transcript review.

Company brain over recorded data; outcome-aligned pricing.

Federate to a second workflow on the same substrate.

B · The five layers to put in place

An AI-first organisation is these five layers, standing. The first two are foundations; the third is the steering wheel most teams skip.

L1 Control planeOwner · founder / platform · build first
A single system of record — everything written down
Tracing & observability on every agent run and tool call
An owned memory layer (not vendor-locked)
Policy / permissions layer; every action tagged reversible-or-not
Cost controls — budgets + cost-per-outcome tracking
Harness behind a provider-agnostic abstraction
L2 Skill & prompt layerOwner · builders
Skill registry — SKILL.md in git, indexed, with a DRY/MECE resolver
Prompt store separated from app code, versioned
Staging → prod promotion + rollback for prompts
Tool definitions versioned (MCP), with strict contracts
LLM-land vs code-land split documented per workflow
L3 Eval & feedback layerOwner · an evals DRI · the steering wheel
Golden dataset from 20–50 real failures, growing over time
Capability + regression evals, run at pass^k
Online evals on a prod sample; scores written back to spans
An LLM-judge calibrated against human grades
The nightly improvement loop (transcript → proposed edits)
A metrics dashboard (resolution, cost-per-outcome, trend)
L4 People & cadencesOwner · leadership
A DRI per outcome / workflow
A named evals owner
(Enterprise) exec sponsor + GM beachhead owner; FDE function
Nightly loop automated; weekly transcript review (human)
Harness stress-test on every model upgrade
Periodic assumption re-verification (the Meta-loop)
L5 Decision recordsOwner · founder / sponsor · write once, revisit
Charter: named user + KPI + sample size
Moat hypothesis
Harness + model provider (+ agnostic abstraction)
Policy / reversibility table — what’s autonomous vs gated
The metrics you steer by
Pricing model (outcome vs seat) + kill criteria

C · Maturity self-score

Rate each layer 0 (nothing in place) to 5 (solid, owned, automated). Your lowest layer is what to put in place next.

Control planerecord-everything, memory, reversibility, cost

Skill & prompt layerregistry, versioned prompts, tool contracts

Eval & feedback layergolden set, pass^k, online evals, nightly loop

People & cadencesDRIs, evals owner, weekly review, stress-test

Decision recordscharter, moat, policy table, pricing, kill criteria

What’s directional, and when not to use it

Read before copying

Directional, not proven: outcome-based pricing and the collaboration-layer moat are forward-looking predictions. The 70-20-10 split and forward-deployed unit economics come from enterprise cases and may not generalise.
Verified deepest: the build-loop material (eval-driven development, the Planner/Generator/Evaluator harness) is primary and reproducible; the strategy/transformation frameworks are credible but partly marketing-adjacent.
A common myth, refuted: transforming an incumbent does not require re-platforming the entire stack. Start on one workflow.
When NOT to use this: if the outcome is genuinely unmeasurable; if most actions are irreversible and can’t be gated; or if there’s no path to a proprietary data/workflow asset — then you’re building a wrapper that gets commoditised.

Sources

Synthesised from primary lab/VC/consultancy material, adversarially fact-checked (24 sources, top claims verified, 1 killed). Vendor and VC frameworks are interested-party sources; cross-source convergence is the main reason for confidence.

YC — The 7 Most Powerful Moats for AI StartupsSeven Powers; moat is execution-result; FDE switching costs

a16z — Big Ideas 2026outcome pricing; data entropy; collaboration-layer moat

PRI

Anthropic — Demystifying Evals; Harness Designeval-driven development; Planner/Generator/Evaluator

PRI

OpenAI — Agent Improvement Loop (cookbook)the flywheel: traces → feedback → evals → rank → implement

CON

Bain — Roadmap to Reality; AI’s Next Operating Modelphase-gate; control plane early; beachhead workflows; GM ownership

CON

BCG — 10-20-70 operating model70% people/process; federated model

CON

McKinsey — Rewired; The Agentic Organizationsix capabilities; org paradigm

CON

IBM — 7-step stage-gating frameworkworkflow-first selection; KPI gate per tranche

PRI

FDE playbooks (Palantir lineage)five-phase motion; unit economics; productise back

PRI

AI Product Validation Framework5+ pre-sales gate; 20-50 hand-scored feasibility test

Compiled 2026-06-05, last reviewed 2026-06-06 · evidence-graded and adversarially verified · generic and shareable — every example is a public framework or a neutral illustration.

Take it with you

The whole methodology, as a field guide.

Eight pages: the spine, the five gated stages, the organisation-type matrix, the build sequence, the five layers to put in place, and the failure modes. Free, no email.

Download the PDF

↗

Related resource

Reference · Evidence-graded Building an AI-First Startup: A Practitioner’s Guide An evidence-graded guide to how the leading AI-native startups build in 2025–2026: ten axioms, a five-layer concept stack, the self-improving loop, the build sequence, moats, edge cases and the tool stack. Every example is a public company; every claim is verified or flagged. 18 min read · Updated 2026-06-02 Read it →