← Dev Log
I run an AI loop workflow to fix my own bugs. I won't run one to build features yet.

I run an AI loop workflow to fix my own bugs. I won't run one to build features yet.

AI loop workflowautonomous bug fixingroot cause analysisfive whysregression testing

TL;DR: I run an AI loop workflow that watches my systems, and when something breaks it finds the cause, writes the fix, and proves it with a test before anything ships. I trust it with that because a machine can tell when the job is done. A red test goes green, the alert goes quiet, and none of that is a matter of opinion. I don’t trust it to go off and build a whole feature on its own, because the only test that counts there is a real person using the thing and deciding it’s any good, and you can’t automate that yet. So I use AI for one and not the other, and the line between them is the whole point of this post.

Boris Cherny doesn’t write prompts anymore

A clip from an interview with Boris Cherny, who built Claude Code, did the rounds recently. His line was that he doesn’t prompt Claude anymore, his job is to write loops. Peter Steinberger said something close around the same time, that you shouldn’t be prompting coding agents at all, you should be designing the loops that prompt them.

Boris Cherny on Acquired Unplugged (WorkOS): "I don't prompt Claude anymore, my job is to write loops."

The point underneath it is real. The level you work at keeps moving up, from punch cards to assembly to a language like Python to prompting a model in plain English. Writing the loop that drives the model is the next rung up. You stop typing each instruction and start building the thing that decides what the next instruction should be.

That’s the part I want to talk about, because I’ve been running something like it for a while. But two things get flattened when the idea gets passed around, and both of them are where the actual work lives.

This is not a Ralph Wiggum loop

There’s a cheap version of this that people reach for first. Geoffrey Huntley named it the Ralph Wiggum technique. You write one prompt, drop it in a while true loop, and run the agent over and over until it passes some check. It works more often than it has any right to, and it’s funny that it does, but it isn’t what I mean when I say a loop.

A loop worth trusting isn’t one prompt on repeat. It’s everything wrapped around the prompt. Boris said as much himself: the inputs, the rules, the tools it can reach, the memory it carries, the feedback it gets back, the conditions that make it stop, and the gate where a human signs off. Strip those away and you have a model guessing in a circle. Keep them and you have something closer to a software development lifecycle that happens to run on its own. Detect the problem, work out the cause, fix it, test it, ship it, with a check at every step.

A system that breaks 50 different ways

Here’s what I run it against. I have a set of systems that scrape data, classify images with a CLIP model, download those images, push everything through an ETL pipeline, and pre-calculate the results so they’re fast to read. It’s millions of records, all of it logged in Postgres, every row timestamped and fingerprinted. When something goes wrong, and something always does, I can see exactly what broke and when.

Catching the failures was never the hard part. The hard part is how many of them there are. I’m up to around 50 checks now, and each one exists because something broke once and I didn’t want it coming back quietly. Every check has a note next to it explaining why it’s there and what I decided at the time it went in.

Fifty checks is a lot to sit and watch. That’s the job I wanted off my desk.

Before the loop, I did it by hand

This didn’t start as a loop. It started as a skill and a lot of copy-paste.

A check would fail, I’d take it, drop it into a skill I’d written, and let the agent run. The skill pulls in the context for that specific issue, then tells the agent not to touch any code until it has found the root cause and worked through five whys. Five whys is an old Toyota idea, Sakichi Toyoda’s. You ask why something happened, then why that happened, and you keep going until you’re standing on the real cause instead of the symptom. It turns out it works on a broken data pipeline about as well as it worked on a factory floor.

One I actually hit looked like this. A batch of images came back unclassified. Why? The CLIP model was handed blank files. Why? The download step had saved zero-byte images. Why? The source had started returning 429s and nothing was retrying them. Why? There was a retry wrapper, but that one fetch path had been added later and never got wired into it. The root cause was a single missing retry, and the fix was tiny. The regression test was the real output: it asserts that a zero-byte download can never be queued for classification again, so that exact failure can’t come back without tripping an alarm.

I worked like that for a few weeks. Grab the failing check, run the skill, read what it came back with, commit, release. It worked well. It was also still me, sat in the middle of it, one issue at a time.

Taking myself out of the middle

The loop is that same process with me lifted out of the centre of it.

A scheduled prompt already checks the health of the systems on its own, and that part has been running for a while. The next step is the one I’m building toward. When it finds an issue, it opens a ticket for it. The ticket triggers a second automation that picks the work up, runs the same root-cause pass, writes the fix, runs the full test suite, and pushes to main once everything is green. Detect, ticket, fix, test, ship, with the five whys and the regression tests living inside the loop instead of in my hands. I move from the middle to the end.

Two loops, not one

This is where I part ways with the people selling it as one clean idea. There are two different loops here, and they don’t carry the same risk.

Keep it runningBuild the thing
What starts ita failed health checka spec I wrote
How you know it’s donea test goes greensomeone uses it and it’s right
Who signs offa machine, then meonly a human
Safe to loop todayyesno

The first one keeps my systems alive, and I trust it. A machine can tell when that job is done, because done means a red test going green and an alert going quiet. The proof is mechanical, so I don’t have to take anything on faith.

The second loop builds features. You write a spec, hand it over, and let the AI build the whole thing and ship it. I won’t close that one yet, and not because it can’t be done. For small, boring changes, a button here, a bit of copy there, it’s genuinely fine. But for anything that matters, the only test that counts is a person using the thing the way a person would and asking whether it’s actually any good. You can’t hand that to a test runner.

There’s also the way I build, which doesn’t fit that loop anyway. I don’t write a long spec and walk away from it. I work toward a goal I can picture and let the path bend as I go, because it always does. A loop that needs the whole feature pinned down before it starts is the opposite of how I think. So even where it’s possible, it isn’t for me, not yet.

What the loop actually rests on

The loop itself is the easy bit. Anyone can write a while loop. The reason mine doesn’t quietly wreck things is everything sitting underneath it, which is really just Boris’s list made concrete:

  • Monitoring. You can’t fix what you can’t see. The timestamps and fingerprints are what make a failure reproducible instead of a guess.
  • A full test suite for every system. This is the one people skip, and it’s the one that matters most. It proves the fix is real, and it stops the agent calling it done at 80% or breaking three things to fix one. Every fix adds a test, and that test is the stop condition.
  • Automation that can actually do the work. Not suggest a fix in a chat window. Run it, end to end.
  • Read-only access to the data. The agent can read everything and change nothing. It diagnoses without ever being able to break production.
  • A ticket system. This is the wiring. The health check writes a task, the automation reads it, and without it the two halves never meet.

And then there’s me, at the end, reading the fix before it ships. For now that part stays.

Where it can still go wrong

I’m not relaxed about any of this, and a few things keep me in the chair. An agent that games the test instead of fixing the bug. A fix that sails through a test that was wrong in the first place. Cost quietly running away while the thing grinds on an approach that was never going to work. And five whys is good for one contained failure, but it won’t reason its way out of a novel mess tangled across five systems at once.

So the loop runs where I can verify it, and I stay where I can’t. That isn’t caution for its own sake. Right now it’s the only honest place to draw the line.

The part you can’t shortcut

The loop is the easy part. Everyone can write one now, and Boris is right that the interesting work has moved up to designing the loop instead of typing every prompt into it. But the loop is the last thing you add, not the first.

What you can’t shortcut is everything it stands on. The monitoring, the tests, the clean reproducible data, the years of small decisions that turn a failure into something a machine can actually read. And for anything a person has to live with, a person still has to be the one who decides it’s right.


I run Inselnova, a free browser strategy game, on the same kind of AI workflow. Play here.