Instead of blaming the model for the token bill, let's fix how we deploy it

Every few days someone posts a version of the same complaint: “Opus/Fable burned through my whole budget in twenty minutes,” whether it was refactoring code or doing something else entirely. But a modern AI coding assistant isn’t one brain doing one thing -it’s more like an orchestra. Hand it a big job and it breaks that job down into dozens of smaller pieces - hunting down files, reading scripts, editing code, running commands, checking the output - and it can spin up sub-agents to tackle several of those pieces at once. Letting the most powerful, most expensive model do all of that is like hiring a Michelin-starred chef and then having them wash the dishes and answer the phone too.

First, take the model out of most of the job

But before you even start handing out models, there’s a step people tend to skip: separating the deterministic work -the part that follows fixed rules and needs no interpretation from an LLM- from the part that does. In most projects that can be well over 80% of the job, and none of it should go anywhere near a model. Once you’ve carved out what really calls for judgment, the rest is just common sense: give each subtask the model that fits it (Opus, Sonnet…). The most capable model earns its keep on judgment and orchestration -planning the approach, making the architectural calls, reviewing the final result- so that’s exactly where you want to save it. The routine, well-defined work -finding every place a function gets called, extending a script to five more countries, running the tests -should go to lighter, faster, cheaper models that handle it perfectly well and can run in parallel. The skill to learn isn’t “pick the best model”; it’s reserving the expensive reasoning for the decisions that need it and handing everything else to cheaper ones. And, of course, letting plain deterministic code deal with whatever doesn’t need a model at all.

How many agents should we launch?

Once you’re delegating work to agents, an unavoidable question shows up: how many do I launch? The answer isn’t “the more the better” or “the fewer the cheaper,” because both extremes are expensive for opposite reasons. Every agent pays an entry toll -loading its instructions and re-reading the project’s files before it can begin- so splitting a hundred tasks across a hundred tiny agents means paying that toll a hundred times. True, but cramming them all into a single agent doesn’t work either: its context keeps filling up with the trail of every previous step, until it’s lugging around an ever-heavier backpack that it has to re-read again and again, and it chokes. The cheap spot is in the middle -group the tasks that share context (the same files, the same goal) into one agent so they reuse the toll already paid, and keep separate agents for the ones that are truly independent. In practice it’s almost never one giant agent or a swarm of tiny ones, but a handful, well distributed.

The payoff

So the job that used to vaporize the budget in twenty minutes now runs for hours on the same allowance. Put the premium minds on the hard decisions and the quick hands on the mechanical work, set a clear cap on how many run at once, and the very models that looked ruinously expensive turn out to be surprisingly cheap.

First, take the model out of most of the job

How many agents should we launch?

The payoff

Keep reading

The bilingual brain confirms similarities between the brain and LLMs: meaning is not words, but coordinates

AI is not yet destroying jobs, according to several 2026 studies