Methodology · 2026 · evidence-graded

Building an AI-first organisation, small or large.

A repeatable, gated, self-correcting methodology — five stages wrapped in a loop that keeps the method itself from going stale. Read by organisation type, run the stages with their gates and templates, and use the implementation runbook to know exactly what to put in place. Generic and shareable: every example is a public framework.

5 gated stages 7 org types 2 modes 4 tiers 5 layers to put in place
Frame Found Forge Flywheel Federate Meta-loop
01

The one idea it hangs on

The thesis

An AI-first organisation is a set of closed loops that compound a proprietary asset — data + codified workflow + evals — while spending inference instead of headcount.

It is mostly a work-design problem, not a tech problem
Roughly 10% of the effort is algorithms, 20% data and tech, 70% people, process and change. Treat AI as an IT project and it fails.
BCG 10-20-70 (directional)
The moat is a result of execution, not a precondition
At 0-to-1 the only thing that matters is execution. You build defensibility by deploying, not by planning it.
YC, adapting Helmer’s Seven Powers
02

The spine

Five stages, each separated by a hard gate you cannot pass without evidence. A continuous meta-loop wraps all of them. The gates are the methodology; the stages are just where they sit.

◀ META-LOOP · re-verify · stress-test vs model upgrades · re-run evals on live traffic · prune ▶
01
Frame
decide what to build & why it compounds
02
Found
build the control plane first
03
Forge
build the first closed loop
04
Flywheel
compound the asset; run AI-first
05
Federate
scale the substrate to new domains
03

The five stages: gate, metrics, template

Each stage names its concepts, the single gate that lets you move on, what to track, and a copyable artefact.

STAGE 1

Frame

decide what to build

Workflow-first: map an end-to-end workflow, decompose into segments, find the one that is high-volume, repetitive, money-adjacent and painful. Run it manually yourself before automating it.

Gate (any red stops): (1) a named user who’ll pay — validate 5+ pre-sales; (2) a measurable outcome; (3) feasibility (20–50 real inputs, hand-scored); (4) a moat hypothesis.
Track · named user confirmed (y/n) · outcome KPI + sample size · feasibility accuracy % · pre-sales count
named-user disciplinemeasurable-outcome gateSeven Powers moattwo-sentence pitch
Template · the one-page Frame charter
NAMED USER:     [name / role] — spoken to? [y/n] — would pay? [y/n]
PROBLEM:        [the pain, in their words]
OUTCOME (KPI):  [the number that defines success] · sample size: [n]
FEASIBILITY:    [accuracy on 20-50 hand-scored test inputs] = [__%]
MOAT (in 18mo): [data / workflow / switching cost we will own]
PITCH:          [what] for [customer], solving [problem]. [the hook].
KILL CRITERIA:  stop if [no paying user in 8 weeks / accuracy < X]
STAGE 2

Found

make it loop-ready

Build the control plane before any autonomy: legibility (record everything), owned memory (not vendor-locked), permissions, observability, rollback, human override, cost management. Reuse an existing harness; don’t rebuild the loop.

Gate: governance + control plane exist before you orchestrate multiple agents. Every action tagged reversible-or-not.
Track · control-plane checklist complete · % of actions classified reversible/irreversible · is everything recorded? (y/n)
legibility precedes intelligenceown your memorypolicy layer + reversibilitythin harness, fat skills
Template · the action policy / reversibility table
ACTION                | REVERSIBLE? | AUTONOMY      | GATE
read / retrieve       | n/a         | autonomous    | grounding check
draft / suggest       | yes         | autonomous    | quality gate
send external comms    | partly      | needs review  | human approve
move money / refund   | no          | NEVER auto    | human + audit log
delete / overwrite     | no          | NEVER auto    | human + backup
STAGE 3

Forge

build the first closed loop

One workflow, fully closed-loop. Evals before code: 20–50 tasks from real failures define the capability first; split into capability evals (hard, low pass) and regression evals (must stay high). Three-agent harness — Planner → Generator → Evaluator — with the judge separated from the builder. One feature at a time. Test as a human would.

Gate: the loop measurably improves the Stage-1 outcome at pass^k reliability (all of k repeated trials succeed — not one lucky run).
Track · pass^k reliability · capability vs regression pass rates · single-feature discipline held? · bugs caught by human-style testing
eval-driven developmentPlanner / Generator / EvaluatorLLM-land vs code-landSKILL.md packaging
Template · a SKILL.md skeleton + an eval task
--- SKILL.md ---
name: resolve-customer-refund
description: Use when a customer asks for a refund. Decides eligibility
  from policy, NOT when the request is a complaint with no refund ask.
---
1. get_order(id)            # code-land: deterministic
2. judge eligibility        # LLM-land: against policy in references/
3. if eligible & < £X: process_refund()   # else: escalate_to_human()
4. confirm warmly; log decision + reason

--- eval task ---
input:    "order 8841 arrived broken, I want my money back"
expect:   eligible=true, action=refund, amount<=order_total
score:    grounded (each claim from a tool call)? policy-correct? tone?
STAGE 4

Flywheel

compound the asset

Close the loop so it runs with minimal intervention: traces → human + LLM feedback → reusable automated evals (a gate) → rank → implement → repeat. Build the company brain; counter data entropy. Spend tokens not headcount; shift metrics from labour-savings to developmental ones.

Gate: the asset demonstrably compounds — measurably better this month than last, with the eval gate holding.
Track · week-over-week outcome trend · resolution & escalation rate · eval-gate hold rate · cost-per-outcome (not per-token) · context retained / rework eliminated
improvement flywheelmetaprompt / dream cyclecompany braintokens not headcountoutcome pricing (a16z prediction)forward-deployed engineering
Template · the nightly improvement loop
EVERY NIGHT (automated):
  1. read the day's transcripts + low-scoring eval spans
  2. classify each miss: skill/context gap  vs  model gap
  3. propose skill edits (accept only if a held-out eval improves)
  4. write a short report of proposed changes
WEEKLY (human, 30 min):
  5. review proposals; read a transcript sample; approve/reject
  6. calibrate the LLM-judge against your human grades
STAGE 5

Federate

scale the substrate

The same substrate runs adjacent workflows and verticals; the organisation runs AI-first. Federated design: central guardrails + domain-level execution. Roles compress to IC (builder/operator) and DRI (directly responsible individual). Fund expansion on adoption + outcomes, not technical milestones.

Gate: a second domain runs on the same substrate without re-architecture; the collaboration layer becomes the moat.
Track · domains live on the substrate · adoption % · revenue-per-employee trend · skills reused across domains
federated operating modelIC + DRI rolessix capabilitiescollaboration-layer moat
04

By organisation type

The same methodology reads differently depending on who you are. Each archetype has a binding constraint, a place to start, and a first move. Mode and tier compose with this.

Solo builder / indie

1 person · greenfield · minimum tier
Binding constraint: your time and focus.
Where to start: Frame → one Forge loop. One harness, a few skills, manual evals.
Skip: Federate, company brain, big registries. Over-building is the solo killer.
This week: name one user, run the workflow by hand once, write one SKILL.md.

Early startup

2–15 · greenfield
Binding constraint: finding the compounding loop before runway ends.
Where to start: all five stages in order; forward-deploy for your first big accounts.
Skip: waiting for a moat before building — execution first.
This week: pre-sell to 5; if 0–1 bite, change the idea, not the pitch.

Scale-up / growth

company · greenfield maturing
Binding constraint: the loop calcifying as you grow.
Where to start: deepen the Flywheel; begin Federate to a second domain.
Skip: nothing — but run the Meta-loop hard so success doesn’t ossify.
This week: stand up the nightly improvement loop and a real eval-gate on prod traffic.

SME / mid-market

non-tech-native · transform
Binding constraint: no in-house AI muscle; risk-aversion.
Where to start: Transform mode on ONE beachhead workflow. Buy the harness, don’t build it.
Skip: a big platform play. Prove one loop, show the ROI, then expand.
This week: pick the one painful, measurable, high-volume workflow and put a named owner on it.

Large enterprise / incumbent

500+ · transform
Binding constraint: 70% is people/process/change; legacy architecture; governance.
Where to start: Found (control plane + governance) then Forge on a GM-owned beachhead. Federated, not centralised.
Skip: a company-wide rollout before one loop works. You do not need to re-platform the whole stack.
This week: name an executive sponsor and one beachhead; stand up the control plane before any autonomy.

Agency / services firm

turning engagements into IP
Binding constraint: margins; bespoke work that doesn’t compound.
Where to start: the Flywheel’s forward-deployed motion — embed, build evals from client data before code, hardwire in, productise the pattern back.
Watch: the unit economics floor — below a certain deal size the model doesn’t pay.
This week: pick one repeatable client problem; build the eval suite from their labelled data first.

Regulated / high-stakes

health · finance · legal · safety
Binding constraint: reversibility, auditability, harm avoidance.
Where to start: Found is non-negotiable and heavy — human gates, audit trail, all-pass (not partial-credit) evals, closed-universe pilots.
Skip: autonomy on irreversible actions. Keep a human in the loop on anything that can’t be undone.
This week: write the policy/approval gates and the all-pass eval rubric before building anything.
05

The Meta-Loop — why the method survives a moving field

AI playbooks rot fast — most organisations say AI project speed already outpaces their governance, and a frontier lab removed a core harness construct the moment a better model shipped. So the methodology re-verifies itself on a cadence: the same sense → evaluate → improve discipline, pointed at the playbook.

RE-VERIFYstrategy is a living system ↻1tag2stress-test3re-run evals4re-verify5prune
1 · Tag assumptions
Every load-bearing assumption (model capability, a vendor figure, a tool) gets a source and a re-check date.
2 · Stress-test the harness
On each model upgrade, ask what scaffolding the new model made unnecessary — and delete it.
3 · Re-run evals on live traffic
Catch silent drift: the prompt unchanged for 30 days while the model shifts underneath.
4 · Re-verify claims
Adversarially check claims before they harden into “fact”. Plausible is not the same as true.
5 · Prune
Counter data entropy and playbook rot: remove the stale, keep the living.
06

Two modes

Mode (greenfield vs transform) composes with your org type and tier. Same five stages, run differently.

Greenfield

a new AI-first business
  • Run Frame → Found → Forge → Flywheel → Federate in order.
  • AI removed the old bottlenecks (capital, headcount, technical skill).
  • Move fast; let execution build the moat.
  • Flat and trust-by-default from day zero.

Transformation

an incumbent going AI-first
  • Don’t reframe the whole company; you do not need to re-platform the entire stack.
  • Start in Forge on one beachhead workflow (escalations, procurement, claims, coordination, compliance).
  • A GM owns it, not the CIO/CTO.
  • Skip the slow interim layer; go straight to the agentic workflow. Prove one loop, then spread.
07

Scaling tiers — fit the weight to the size

The biggest early mistake is over-building. Run only the tier you’re at.

TierMinimum viable versionDefer
Solo builderOne harness, a few SKILL.md skills, manual evals, one closed loop.Tool registry, company brain, org design.
Team (3–7)Eval discipline + small skill registry + control-plane basics + a DRI per outcome.Federation, centres of excellence.
CompanyThe six capabilities; federated guardrails; forward-deployed for high-value accounts.
PlatformOpen skill registry; collaboration-layer moat; third-party packs.
08

Failure modes — and the gate that catches each

Most attempts die in predictable ways. Each gate exists because of one of these.

Building before a user
Months of beautiful product for no one.
Caught by: the Frame named-user gate (5+ pre-sales or stop).
No measurable outcome
The loop can’t compound; outcome pricing is impossible.
Caught by: the Frame measurable-outcome gate — proxy/calibrated rubric, or exit.
Pilot purgatory
Endless demos that never reach production or compound.
Caught by: the Forge gate — the loop must improve the real outcome at pass^k before you proceed.
Over-automation harm
An autonomous agent makes a bad, irreversible call.
Caught by: the Found control plane — reversibility tagging + human gates before any autonomy.
Vendor-locked memory
Your institutional knowledge lives in someone else’s platform.
Caught by: Found — own your memory layer.
Solo over-engineering
A one-person team builds enterprise machinery and ships nothing.
Caught by: the Solo tier — minimum loop only; defer Federate.
Services-becomes-consulting
Bespoke client work that never productises.
Caught by: the FDE discipline — every engagement merges a reusable feature back.
Playbook rot
The method ages as models and the field move.
Caught by: the Meta-loop — re-verify, stress-test, prune on a cadence.
“It’s an IT project”
Treating transformation as tech, not work-design.
Caught by: the 10-20-70 reality — a GM owns it; 70% is people/process.
09

Implementation — what to put in place

The stages give you the shape; this is the runbook. First the order to build in, then the concrete artefacts each layer needs, then a self-score that names your next move.

A · The build sequence — minimum order

Do these in order. Step 6 is a hard gate: never pass it on a single lucky run.

Frame charter — named user, measurable KPI, feasibility test, moat, kill criteria.
Stand up the store + tracing — record every conversation and tool call.
Choose the harness; write the policy / reversibility table.
Write the evals first — 20–50 cases drawn from real failures.
Build one loop — Planner → Generator → Evaluator, judge separated.
Gate: the loop improves the KPI at pass^k. Only then proceed.
Turn on online evals + the nightly improvement loop.
Metrics dashboard + a weekly human transcript review.
Company brain over recorded data; outcome-aligned pricing.
Federate to a second workflow on the same substrate.

B · The five layers to put in place

An AI-first organisation is these five layers, standing. The first two are foundations; the third is the steering wheel most teams skip.

L1 Control plane

Owner · founder / platform · build first
  • A single system of record — everything written down
  • Tracing & observability on every agent run and tool call
  • An owned memory layer (not vendor-locked)
  • Policy / permissions layer; every action tagged reversible-or-not
  • Cost controls — budgets + cost-per-outcome tracking
  • Harness behind a provider-agnostic abstraction

L2 Skill & prompt layer

Owner · builders
  • Skill registry — SKILL.md in git, indexed, with a DRY/MECE resolver
  • Prompt store separated from app code, versioned
  • Staging → prod promotion + rollback for prompts
  • Tool definitions versioned (MCP), with strict contracts
  • LLM-land vs code-land split documented per workflow

L3 Eval & feedback layer

Owner · an evals DRI · the steering wheel
  • Golden dataset from 20–50 real failures, growing over time
  • Capability + regression evals, run at pass^k
  • Online evals on a prod sample; scores written back to spans
  • An LLM-judge calibrated against human grades
  • The nightly improvement loop (transcript → proposed edits)
  • A metrics dashboard (resolution, cost-per-outcome, trend)

L4 People & cadences

Owner · leadership
  • A DRI per outcome / workflow
  • A named evals owner
  • (Enterprise) exec sponsor + GM beachhead owner; FDE function
  • Nightly loop automated; weekly transcript review (human)
  • Harness stress-test on every model upgrade
  • Periodic assumption re-verification (the Meta-loop)

L5 Decision records

Owner · founder / sponsor · write once, revisit
  • Charter: named user + KPI + sample size
  • Moat hypothesis
  • Harness + model provider (+ agnostic abstraction)
  • Policy / reversibility table — what’s autonomous vs gated
  • The metrics you steer by
  • Pricing model (outcome vs seat) + kill criteria

C · Maturity self-score

Rate each layer 0 (nothing in place) to 5 (solid, owned, automated). Your lowest layer is what to put in place next.

Control planerecord-everything, memory, reversibility, cost
Skill & prompt layerregistry, versioned prompts, tool contracts
Eval & feedback layergolden set, pass^k, online evals, nightly loop
People & cadencesDRIs, evals owner, weekly review, stress-test
Decision recordscharter, moat, policy table, pricing, kill criteria
10

What’s directional, and when not to use it

Read before copying

  • Directional, not proven: outcome-based pricing and the collaboration-layer moat are forward-looking predictions. The 70-20-10 split and forward-deployed unit economics come from enterprise cases and may not generalise.
  • Verified deepest: the build-loop material (eval-driven development, the Planner/Generator/Evaluator harness) is primary and reproducible; the strategy/transformation frameworks are credible but partly marketing-adjacent.
  • A common myth, refuted: transforming an incumbent does not require re-platforming the entire stack. Start on one workflow.
  • When NOT to use this: if the outcome is genuinely unmeasurable; if most actions are irreversible and can’t be gated; or if there’s no path to a proprietary data/workflow asset — then you’re building a wrapper that gets commoditised.
11

Sources

Synthesised from primary lab/VC/consultancy material, adversarially fact-checked (24 sources, top claims verified, 1 killed). Vendor and VC frameworks are interested-party sources; cross-source convergence is the main reason for confidence.

VC
YC — The 7 Most Powerful Moats for AI StartupsSeven Powers; moat is execution-result; FDE switching costs
VC
a16z — Big Ideas 2026outcome pricing; data entropy; collaboration-layer moat
PRI
Anthropic — Demystifying Evals; Harness Designeval-driven development; Planner/Generator/Evaluator
PRI
OpenAI — Agent Improvement Loop (cookbook)the flywheel: traces → feedback → evals → rank → implement
CON
Bain — Roadmap to Reality; AI’s Next Operating Modelphase-gate; control plane early; beachhead workflows; GM ownership
CON
BCG — 10-20-70 operating model70% people/process; federated model
CON
McKinsey — Rewired; The Agentic Organizationsix capabilities; org paradigm
CON
IBM — 7-step stage-gating frameworkworkflow-first selection; KPI gate per tranche
PRI
FDE playbooks (Palantir lineage)five-phase motion; unit economics; productise back
PRI
AI Product Validation Framework5+ pre-sales gate; 20-50 hand-scored feasibility test

Compiled 2026-06-05, last reviewed 2026-06-06 · evidence-graded and adversarially verified · generic and shareable — every example is a public framework or a neutral illustration.

Take it with you

The whole methodology, as a field guide.

Eight pages: the spine, the five gated stages, the organisation-type matrix, the build sequence, the five layers to put in place, and the failure modes. Free, no email.

Download the PDF

Related resource