The Orchestration Paradox • secrary [dot] com

We’re getting very good at remembering where the answer lives

This isn’t a new problem just because the interface now has chat bubbles.

Sparrow, Liu, and Wegner (2011) (the one people usually mean by the Google effect) found that when people expect information to be easy to retrieve later, they encode less of it. They remember where to find it, not the thing itself. Handy for trivia. Not so handy for systems work.

That lands a little too neatly on multi-agent coding workflows. Agent A has the threading analysis. Agent B has the migration gotchas. Agent C wrote the explanation of why the tests are “likely sufficient.” I’ve learned to hear that phrase the way a mechanic hears a rattle at 100 km/h. Over time, our brain adapts. Of course it does. Why retain the chain of reasoning when the transcript is sitting right there?

Context switching doesn’t become harmless just because we got good at it

There’s also the simple, ugly fact that hopping between tasks shreds attention.

Sophie Leroy’s work on attention residue (2009) is still one of the best descriptions of what this feels like from the inside. When we switch from Task A to Task B, part of our attention stays stuck on A. It doesn’t vanish because we opened a new window. It trails behind us like a shadow we forgot we were casting.

Researchers have measured this cost: task switching hits speed and accuracy even with practice, and heavy multitaskers tend to do worse at filtering what matters. They’re not cognitive superheroes. They’re just distracted. Faster, maybe, but shallower.

Now imagine that with three, four, six active agent threads. One is halfway through a refactor. One is suggesting tests. One is asking us to adjudicate between two plausible implementations. We’re not arriving fresh to each review. We’re dragging residue from the last five.

We become a dispatcher. A pretty fast one, maybe. But a dispatcher all the same.

I’ve noticed this in myself. After a morning of bouncing between agent threads, I’ll sit down to do something that requires actual concentration and find my attention refuses to settle. It wants tabs. It wants movement. It wants the next hit. Sitting with one problem starts to feel not boring but physically wrong, like wearing a coat that’s several sizes too tight.

Automation doesn’t remove the need for judgment. It makes that need nastier

The most relevant warning here is older than LLMs by decades.

Bainbridge (1983) was writing about industrial control rooms. But the shape is the same: automate the routine, push the human toward the exceptions, and then, here’s the part nobody talks about, leave the human least practiced at the exact moment practice matters.

AI coding agents create a softer version of the same problem. Their output is rarely pure garbage. It is often mostly right.

And “mostly right” is exactly where oversight gets hard. Obvious garbage is easy. A subtle omission in a migration step, an unstated assumption about ordering, a race condition hiding behind a perfectly readable explanation. That’s where human review is supposed to earn its keep. And if our own involvement has been reduced to skimming polished summaries, we are least prepared at the precise moment the system needs us most.

That’s the part that worries me.

Shallow review has a way of turning into a habit

Once our workday is chopped into little reactive slices, our brain starts expecting that pace.

Mark, Gonzalez, and Harris (2005) found that constant interruption leads people to self-interrupt. That one feels painfully familiar. After enough days of bouncing between Slack, CI, code review, and a little zoo of agents, uninterrupted thought starts to feel almost physically uncomfortable. We reach for the next tab before the current idea has had time to ripen. Five quiet minutes feels suspicious.

That matters because some kinds of understanding need idle space.

Not every useful thought arrives while staring at the code. Some show up on the walk to make coffee, or in the blank pause after reading something difficult and not immediately doing anything with it.

Carr’s The Shallows (2010) is useful here, even if we don’t buy every page of the argument. The pattern is hard to miss. If our day is built out of summaries, snippets, alerts, and fast approvals, we get better at summaries, snippets, alerts, and fast approvals. Funny how that works. But what happens when one of those summaries is hiding the bug that actually matters?

The real problem shows up when we need an actual model in our head

Good software judgment usually isn’t about spotting isolated facts. It’s about building what Kintsch (1998) would call a situation model: not a pile of facts, but a coherent sense of what’s happening. In engineering terms: not just what the code says, but what the system is doing, what it assumes, where it’s brittle, what happens if events arrive in the wrong order, what breaks first.

We do not get that from five-second skim passes.

We get it by staying with the problem long enough for the parts to connect. Reading the code ourselves. Walking the flow. Holding multiple conditions in working memory at once. Imagining the ugly cases, not just the happy path the agent laid out so politely.

This is where the current pattern gets perverse. The more agent output we supervise, the easier it is to lean on the agent’s model of the codebase instead of forming our own. And if the agent misses the one thing that actually matters (say, a timing issue, or a distributed-state edge case that static review won’t surface), the human is supposed to catch it. Supposed to. But that requires a thick mental model, not a scrapbook of summaries.

Each piece pushes the same way. Attention residue steals room for scenario testing. Offloading lets the details evaporate. Multitasking makes it harder to stay with the system long enough to understand it. Automation complacency nudges review toward trust-by-default. Put it together and the failure mode is nasty: the human overseer gets less practiced at oversight while feeling incredibly busy.

Busy is not the same as sharp.

So what do we do?

I’m not arguing that we should throw the tools in the sea. I use them. Gladly. They’re useful. Sometimes wildly useful.

There is a sane version of this workflow. Sometimes writing the prompt forces us to say the thing we were hand-waving. Sometimes an agent’s critique gives the design a wall to push against. Good. Use that. I want that version. The trouble starts when the prompt becomes a substitute for thinking instead of a pressure test for it.

Giuseppe Riva (2025) describes a broader version of this: the comfort-growth paradox. The friendlier AI gets, the less cognitive friction we encounter, and the less we grow. The fix isn’t avoiding the tools. It’s calibrating how much slack they get to give us.

I try to use agents the way you’d teach someone to ride a bike. Hand on the seat at first. Then running alongside. Then standing back. The prompt that hands me the full answer is the one I lean on hardest. The prompt that gives me just enough to figure the rest out myself is the one I learn from.

Practically: use agents for prep work. Let them gather context, summarize logs, point at suspicious files, propose test cases, draft boring scaffolding. Great. Then stop. Close the window. Rebuild the model in our own head.

Read the code ourselves. Trace the behavior ourselves. Argue with the design a little. Sit with the part that feels slippery.

If we’re making a call that depends on judgment, don’t outsource the formation of that judgment to a relay race of summaries.

The team version is uglier. If everyone is orchestrating from summaries, alignment starts to look better than it is. Standups sound crisp. PRs move. Planning docs accumulate. Meanwhile, nobody can quite explain the system without opening three transcripts and a diagram an agent made last Tuesday. That is not knowledge. It is memory with a search box.

The tools are getting smoother. That is the point, commercially. Less effort, more output. The danger is not that they will get good enough to replace us. The danger is that they will get good enough that we stop noticing we are being replaced by a thinner version of ourselves, one polished summary at a time.

The point is not to touch more threads per hour. The point is to understand at least one important thing well enough that our approval actually means something.

Because that’s the real risk here: not that agents will occasionally be wrong, but that a workflow built around constant orchestration will quietly train us out of the very cognitive habits needed to notice when they are. The degradation is subtle. No alarm goes off. We still feel productive. We’re shipping. Tabs are flying. Everyone looks busy.

And meanwhile the one part of the system that still has to handle ambiguity, the human part, gets softer.

I don’t know how to measure that. Nobody is shipping a dashboard for it. But I know what it feels like to spend a day orchestrating six agents and realize at 5pm that I never actually thought hard about anything.

That’s not efficiency. That’s atrophy with a clean UI.