June 3, 20269 min read

The wall is you: long-running agents and the 4 levels of AI

AI Agents
Opus 4.8
Autonomy

Matt Maher has a line that stuck with me: the model is plenty smart, and the wall between what we do with AI and what AI can do is the human standing in the middle, handing out tasks. Since May 2026 that wall has been movable, and this essay is about the specific machinery that moves it.

Maher describes four levels of working with AI. Level one is the single loop: you ask, it answers, you check, you ask again. Tab completion lives here, and so does the "type a question into ChatGPT" habit many people still mistake for the whole of AI. Level two is task automation: one request, and a harness fans it out into a hundred steps it runs for you. You ask for fifty features and walk away while the system writes the tests and deploys. Most capable teams read that description and nod, because level two is where they live.

My claim is narrow and, since May, testable: the long-running agent carries you out of level two. Opus 4.8 is the first release where you can run that climb instead of forecasting it.

Maher's levels as a staircase. The dashed line between two and three is the human who still decomposes the work by hand.

The ceiling of level two

Level two scales you while leaving your job the same shape. You still decompose the work, you still know where every piece goes, and you are still the one who looked at the objective, broke it into a to-do list, and parcelled the list out to a fleet of agents. The agents got faster. Your role stayed put.

Maher tells an anecdote that names the ceiling. He runs several projects at once. Four sessions, five, six, and by seven he tops out. Each session is itself a level-two ask running maybe a hundred steps underneath, so the picture is seven systems times a hundred steps: seven hundred parallel tasks, and that count is also his hard limit. The best move left to him is a bigger request, and past a point a bigger request stops helping.

700 things in parallel, and the wall is still me. That's not what AI promises us.
— Matt Maher, "Think Bigger to Climb the 4 Levels of AI"

Read his numbers and the upgrade you want is a system that removes you from the loop the work runs in. Long-running agents exist for that job.

What changed in 2026

For most of the level-two era the loop reset every session. You came back, re-explained the state, and pushed again. The breakthrough was boring and structural: a harness that lets an agent survive its own context window. Anthropic published the pattern. One initializer agent sets up the environment, writes a setup script, opens a progress log, makes a first commit. Then a coding agent works one increment at a time, commits to git, and leaves a summary for the next fresh-context session to pick up. Memory moves from the chat into the repo, where the agent that wakes up tomorrow reads what yesterday's agent wrote down.

The pattern sounds modest until you put numbers on it. Anthropic reports that at Rakuten, Claude Code implemented an activation-vector extraction method inside vLLM, a codebase of about 12.5 million lines across multiple languages, in a single autonomous run lasting roughly seven hours, landing at 99.9% numerical accuracy against the reference. Worth the caveat: that is a self-reported customer result, not an independently audited one. But the shape of it is the point. A seven-hour unattended run on a codebase no single human holds in their head is beyond a level-two task. The agent broke the work down itself; no engineer wrote the hundred-step list in advance.

The trajectory has a number on it

METR measures how long a task takes a human expert that a model can finish at a given reliability. They call it the time horizon, and it is the cleanest neutral measure we have. The horizon is doubling every four to seven months. Claude 3 Opus sat near four minutes. Claude 3.7 Sonnet reached about an hour. The best-measured frontier models now clear twelve to seventeen hours. Two orders of magnitude across a handful of generations.

One caveat: METR measures task difficulty, the human time a task represents, rather than how long an agent runs unattended. So the number bounds the trajectory rather than proving the endpoint. The direction, though, is unambiguous, and it points at the level Maher says no one has reached yet.

Opus 4.8 is the first release built for the level above

Anthropic positions Opus 4.8, released May 28, 2026, for this: the consistency and autonomy to keep working on long-running tasks. It plans before it acts. It carries memory across sessions instead of forgetting at the window boundary. It drives long work forward with minimal oversight. And it orchestrates: a single session can plan, then spin up hundreds of parallel sub-agents, each in its own isolated context, each returning a summary to the parent. The research-preview Dynamic Workflows feature caps that at sixteen concurrent and a thousand total sub-agents per run.

Put the harness, the horizon, and the orchestration together and you get the first version of Maher's level three you can run today. You hand the system an objective. It figures out which level-two jobs to spawn to meet it. You describe the outcome and let the system find the work.

The shape of a level-three ask

The climb is a change in what you type. At level two you ask for features, code, tasks. At level three you ask for an objective with the evidence that proves it is met.

State the objective, not the steps."Cut p95 checkout latency below 400ms" instead of "add an index to the orders table, then cache the cart lookup." The second list is your guess at the solution. Hand over the target and let the system find the list.
Write the evaluation first. A goal is only as good as the test it runs against. Name the measurement and the end-state evidence up front, and the agent keeps working against that check until it passes. This is what the goal features in agentic tools are for.
Set the guardrails, then leave. Bound the budget, the scope, the things it must not touch. Guardrails are what make walking away rational rather than reckless.

What you can do on Monday

Practising the orientation needs no perfect tool. Hand over outcomes instead of tasks, starting with these four moves.

Pick one objective you'd normally decompose and refuse to decompose it. Write the goal, write the test, set a budget, and let an agent run long against it. Judge the result, not the steps it took.
Run it overnight. The horizon is hours now. A task you'd babysit for an afternoon is a task you can start before you log off and review in the morning.
Raise your permission posture on purpose. Choose the highest trust mode the task and environment justify, so the agent works in longer uninterrupted stretches while dangerous or sensitive work stays gated. Most of getting out of the loop is getting out of the approval loop.
Make the agent leave a trail. Commits, a running log, a summary for the next session. The same artifacts that let an agent survive its context window let you trust a run you didn't watch.

You don't have to believe AI will run your company. The one belief this asks of you: the breaking-down and parcelling-out you do by hand at level two is itself work you can describe and delegate. The model has been ready for a while. As of Opus 4.8, the harness around it is too.