Methodology · 2026 · evidence-graded
Building an AI-first organisation, small or large.
A repeatable, gated, self-correcting methodology — five stages wrapped in a loop that keeps the method itself from going stale. Read by organisation type, run the stages with their gates and templates, and use the implementation runbook to know exactly what to put in place. Generic and shareable: every example is a public framework.
The one idea it hangs on
An AI-first organisation is a set of closed loops that compound a proprietary asset — data + codified workflow + evals — while spending inference instead of headcount.
The spine
Five stages, each separated by a hard gate you cannot pass without evidence. A continuous meta-loop wraps all of them. The gates are the methodology; the stages are just where they sit.
The five stages: gate, metrics, template
Each stage names its concepts, the single gate that lets you move on, what to track, and a copyable artefact.
Frame
decide what to buildWorkflow-first: map an end-to-end workflow, decompose into segments, find the one that is high-volume, repetitive, money-adjacent and painful. Run it manually yourself before automating it.
Template · the one-page Frame charter
NAMED USER: [name / role] — spoken to? [y/n] — would pay? [y/n] PROBLEM: [the pain, in their words] OUTCOME (KPI): [the number that defines success] · sample size: [n] FEASIBILITY: [accuracy on 20-50 hand-scored test inputs] = [__%] MOAT (in 18mo): [data / workflow / switching cost we will own] PITCH: [what] for [customer], solving [problem]. [the hook]. KILL CRITERIA: stop if [no paying user in 8 weeks / accuracy < X]
Found
make it loop-readyBuild the control plane before any autonomy: legibility (record everything), owned memory (not vendor-locked), permissions, observability, rollback, human override, cost management. Reuse an existing harness; don’t rebuild the loop.
Template · the action policy / reversibility table
ACTION | REVERSIBLE? | AUTONOMY | GATE read / retrieve | n/a | autonomous | grounding check draft / suggest | yes | autonomous | quality gate send external comms | partly | needs review | human approve move money / refund | no | NEVER auto | human + audit log delete / overwrite | no | NEVER auto | human + backup
Forge
build the first closed loopOne workflow, fully closed-loop. Evals before code: 20–50 tasks from real failures define the capability first; split into capability evals (hard, low pass) and regression evals (must stay high). Three-agent harness — Planner → Generator → Evaluator — with the judge separated from the builder. One feature at a time. Test as a human would.
Template · a SKILL.md skeleton + an eval task
--- SKILL.md --- name: resolve-customer-refund description: Use when a customer asks for a refund. Decides eligibility from policy, NOT when the request is a complaint with no refund ask. --- 1. get_order(id) # code-land: deterministic 2. judge eligibility # LLM-land: against policy in references/ 3. if eligible & < £X: process_refund() # else: escalate_to_human() 4. confirm warmly; log decision + reason --- eval task --- input: "order 8841 arrived broken, I want my money back" expect: eligible=true, action=refund, amount<=order_total score: grounded (each claim from a tool call)? policy-correct? tone?
Flywheel
compound the assetClose the loop so it runs with minimal intervention: traces → human + LLM feedback → reusable automated evals (a gate) → rank → implement → repeat. Build the company brain; counter data entropy. Spend tokens not headcount; shift metrics from labour-savings to developmental ones.
Template · the nightly improvement loop
EVERY NIGHT (automated): 1. read the day's transcripts + low-scoring eval spans 2. classify each miss: skill/context gap vs model gap 3. propose skill edits (accept only if a held-out eval improves) 4. write a short report of proposed changes WEEKLY (human, 30 min): 5. review proposals; read a transcript sample; approve/reject 6. calibrate the LLM-judge against your human grades
Federate
scale the substrateThe same substrate runs adjacent workflows and verticals; the organisation runs AI-first. Federated design: central guardrails + domain-level execution. Roles compress to IC (builder/operator) and DRI (directly responsible individual). Fund expansion on adoption + outcomes, not technical milestones.
By organisation type
The same methodology reads differently depending on who you are. Each archetype has a binding constraint, a place to start, and a first move. Mode and tier compose with this.
Solo builder / indie
Early startup
Scale-up / growth
SME / mid-market
Large enterprise / incumbent
Agency / services firm
Regulated / high-stakes
The Meta-Loop — why the method survives a moving field
AI playbooks rot fast — most organisations say AI project speed already outpaces their governance, and a frontier lab removed a core harness construct the moment a better model shipped. So the methodology re-verifies itself on a cadence: the same sense → evaluate → improve discipline, pointed at the playbook.
Two modes
Mode (greenfield vs transform) composes with your org type and tier. Same five stages, run differently.
Greenfield
- Run Frame → Found → Forge → Flywheel → Federate in order.
- AI removed the old bottlenecks (capital, headcount, technical skill).
- Move fast; let execution build the moat.
- Flat and trust-by-default from day zero.
Transformation
- Don’t reframe the whole company; you do not need to re-platform the entire stack.
- Start in Forge on one beachhead workflow (escalations, procurement, claims, coordination, compliance).
- A GM owns it, not the CIO/CTO.
- Skip the slow interim layer; go straight to the agentic workflow. Prove one loop, then spread.
Scaling tiers — fit the weight to the size
The biggest early mistake is over-building. Run only the tier you’re at.
| Tier | Minimum viable version | Defer |
|---|---|---|
| Solo builder | One harness, a few SKILL.md skills, manual evals, one closed loop. | Tool registry, company brain, org design. |
| Team (3–7) | Eval discipline + small skill registry + control-plane basics + a DRI per outcome. | Federation, centres of excellence. |
| Company | The six capabilities; federated guardrails; forward-deployed for high-value accounts. | — |
| Platform | Open skill registry; collaboration-layer moat; third-party packs. | — |
Failure modes — and the gate that catches each
Most attempts die in predictable ways. Each gate exists because of one of these.
Implementation — what to put in place
The stages give you the shape; this is the runbook. First the order to build in, then the concrete artefacts each layer needs, then a self-score that names your next move.
A · The build sequence — minimum order
Do these in order. Step 6 is a hard gate: never pass it on a single lucky run.
B · The five layers to put in place
An AI-first organisation is these five layers, standing. The first two are foundations; the third is the steering wheel most teams skip.
L1 Control plane
- A single system of record — everything written down
- Tracing & observability on every agent run and tool call
- An owned memory layer (not vendor-locked)
- Policy / permissions layer; every action tagged reversible-or-not
- Cost controls — budgets + cost-per-outcome tracking
- Harness behind a provider-agnostic abstraction
L2 Skill & prompt layer
- Skill registry — SKILL.md in git, indexed, with a DRY/MECE resolver
- Prompt store separated from app code, versioned
- Staging → prod promotion + rollback for prompts
- Tool definitions versioned (MCP), with strict contracts
- LLM-land vs code-land split documented per workflow
L3 Eval & feedback layer
- Golden dataset from 20–50 real failures, growing over time
- Capability + regression evals, run at pass^k
- Online evals on a prod sample; scores written back to spans
- An LLM-judge calibrated against human grades
- The nightly improvement loop (transcript → proposed edits)
- A metrics dashboard (resolution, cost-per-outcome, trend)
L4 People & cadences
- A DRI per outcome / workflow
- A named evals owner
- (Enterprise) exec sponsor + GM beachhead owner; FDE function
- Nightly loop automated; weekly transcript review (human)
- Harness stress-test on every model upgrade
- Periodic assumption re-verification (the Meta-loop)
L5 Decision records
- Charter: named user + KPI + sample size
- Moat hypothesis
- Harness + model provider (+ agnostic abstraction)
- Policy / reversibility table — what’s autonomous vs gated
- The metrics you steer by
- Pricing model (outcome vs seat) + kill criteria
C · Maturity self-score
Rate each layer 0 (nothing in place) to 5 (solid, owned, automated). Your lowest layer is what to put in place next.
What’s directional, and when not to use it
Read before copying
- Directional, not proven: outcome-based pricing and the collaboration-layer moat are forward-looking predictions. The 70-20-10 split and forward-deployed unit economics come from enterprise cases and may not generalise.
- Verified deepest: the build-loop material (eval-driven development, the Planner/Generator/Evaluator harness) is primary and reproducible; the strategy/transformation frameworks are credible but partly marketing-adjacent.
- A common myth, refuted: transforming an incumbent does not require re-platforming the entire stack. Start on one workflow.
- When NOT to use this: if the outcome is genuinely unmeasurable; if most actions are irreversible and can’t be gated; or if there’s no path to a proprietary data/workflow asset — then you’re building a wrapper that gets commoditised.
Sources
Synthesised from primary lab/VC/consultancy material, adversarially fact-checked (24 sources, top claims verified, 1 killed). Vendor and VC frameworks are interested-party sources; cross-source convergence is the main reason for confidence.
Compiled 2026-06-05, last reviewed 2026-06-06 · evidence-graded and adversarially verified · generic and shareable — every example is a public framework or a neutral illustration.
Take it with you
The whole methodology, as a field guide.
Eight pages: the spine, the five gated stages, the organisation-type matrix, the build sequence, the five layers to put in place, and the failure modes. Free, no email.
Download the PDF