What I Learned Building My Own Coding Agent

Thorsten Ball just dropped a post called How To Build An Agent where he argues that a coding agent is just “an LLM, a loop, and enough tokens.” He’s right. You can build one in 200 lines of Go. Go read it. It’s the cleanest articulation of the core idea I’ve seen.

I read it and nodded along for the first half, and then started yelling at my screen for the second half.

I’ve been elbow-deep in this category of software, building Nomad, a coding agent I’m putting together for fun (and maybe more later). Building one of these things teaches you a lot of things very quickly, and Thorsten’s post triggered every “yes, but…” reflex I have.

Here’s the take: Thorsten is right about the first 5% and dangerously misleading about the other 95%.

The minimum viable agent is genuinely trivial. The gap between “minimum viable” and “an agent you’d actually trust to edit your codebase at 2am while you sleep” is where all the engineering lives. This post is a brain dump of what’s in that gap.

The naked loop is a money fire

Here’s what the Thorsten loop looks like, simplified:

while True:
    response = llm.stream(messages)
    if response.has_tool_calls:
        results = execute_tools(response.tool_calls)
        messages.append(results)
    else:
        return response.text

Beautiful. Elegant. Will set your credit card on fire within 24 hours.

Here’s what actually happens when you run this on real tasks:

The model gets stuck in a doom loop calling Edit on the same file with slightly different arguments forever, racking up $40 in 10 minutes.
A tool returns an error and the model retries the exact same call five times in a row before giving up. Or worse, doesn’t give up.
The conversation grows to 200K tokens, hits the context limit, and the model starts hallucinating tools that don’t exist.
The user hits Ctrl+C and the agent finishes its 30-second turn anyway because nothing was wired up to actually cancel.

My agent loop ended up at ~330 lines, not 50. The extra 280 lines are:

Doom-loop detection: same tool + same args 3x in a row → break the loop and inject “you’ve called this tool with these arguments three times. Try a different approach.”
Request cap: max 50 model calls per turn, configurable. Past that, you’ve lost the plot.
Error retry budgeting: track per-tool error counts, inject guidance (“this tool has failed 3 times. Read the file first instead of guessing”).
AbortSignal everywhere: every tool execution, every API call, every stream parse, all cancellable. Ctrl+C should mean Ctrl+C.
Transformer pipeline: composable request transforms with .when() guards, because every provider has its own dumb format quirks (more on this below).
Smart compaction: when context fills up, summarize old turns but preserve reasoning chains. Losing the model’s chain-of-thought mid-task makes it stupider on the next turn.

None of this is in any “build an agent” tutorial. All of it is the difference between a toy and a tool.

“Provider-agnostic” is more honest than you think

Every coding agent on the market is “provider-agnostic.” In practice this usually means: works great with Claude, kinda works with GPT, breaks weirdly with anything else.

I wanted Nomad to actually work with whatever model. Cheap models for cheap tasks, smart models for hard ones. Specifically I wanted Kimi K2.5 because it’s stupidly good at tool calling for the price.

Here’s what I found out: you only need two provider adapters to cover 110+ providers.

anthropic.ts: Claude’s native Messages API. ~200 lines.
openai-compat.ts: OpenAI-compatible adapter. Covers GPT, Kimi, Deepseek, OpenRouter, Groq, Together, Fireworks, every local llama.cpp server, and ~100 more. ~200 lines.

That’s it. Two adapters. The OpenAI-compat protocol is the de facto standard and nearly everyone implements it. The dynamic model catalog is fetched from models.dev: 4,171 models, 110 providers, refreshed daily. I don’t hardcode model definitions because the catalog changes every week and I refuse to ship updates every time someone releases a new fine-tune.

But here’s the gotcha that ate a frustrating amount of my life: tool_use IDs aren’t standardized across providers.

Claude: toolu_01ABC123XYZ
Kimi: Tool:0, functions.Edit:3
GPT: call_abc123

When you resume a Claude session that was originally run on Kimi, Claude’s API rejects the request with a 400 because the IDs contain colons and dots, which violate ^[a-zA-Z0-9_-]+$. The fix is one function: sanitizeToolUseIds() that runs at API call time and replaces invalid chars with underscores. One day to find the bug. Five minutes to fix. That ratio is normal in agent work.

Tools are documentation, not code

The single biggest mental shift from building Nomad: the tool’s description field matters more than its implementation.

I have four tools. Read, Write, Edit, Bash. Each is ~300 lines of TypeScript. Boring, mechanical, completely standard. The model has never been confused about what they do, but only because each tool’s description includes:

What the tool does in one sentence
What it’s good at
What it’s NOT good at (e.g. “do not use Bash to read files, use Read”)
A worked example
Common failure modes and how to recover

The Edit tool’s description alone is longer than the Edit tool’s implementation. Because the implementation is “do a string replacement, validate uniqueness, return a diff.” But the description is teaching the model when to reach for this tool versus Write or Bash, and that’s the actual hard problem.

I spent more time iterating on tool descriptions than on tool code. By like 5x.

Don’t test the model

I wrote 26 tests for Nomad. Three of them are “agentic”, meaning they hit a real model and assert on real behavior:

Agent creates a file when asked
Agent reads a file and answers correctly
Agent resumes a session and continues the conversation

That’s it. Three. Not three hundred.

I almost wrote more. I had this beautiful spec with 10 example multi-turn scenarios: scaffold a project, debug a failing test, refactor a module, explore a codebase. I considered turning each one into a test.

Then I sat with it and realized: those tests would test Kimi K2.5, not my agent. If Kimi has a bad day or OpenRouter changes routing, my CI fails for reasons unrelated to my code. I’d be debugging provider flakiness in perpetuity.

The three tests I kept all assert on deterministic agent-loop behavior: did a file get created? did the session ID match? did the message count grow? Things that are about my code, not about whether the LLM is feeling smart today.

For the rest (model quality, tool composition, multi-turn reasoning) there’s Terminal Bench. That’s literally what it’s for. Don’t reinvent the benchmark in your test suite.

The “elbow grease” is the entire product

Thorsten’s article ends with a line I keep coming back to:

The real engineering challenge lies not in core concepts but in practical engineering and elbow grease.

He waves at this like it’s a footnote. It’s the entire moat.

Forge is #1 on Terminal Bench at 81.8%. Their loop is not 100x cleverer than mine. What they have is years of elbow grease: prompt engineering, tool descriptions, error recovery patterns, sub-agent orchestration, model routing heuristics, telemetry that tells them which prompts fail and why.

You can copy their architecture in a weekend. You cannot copy their elbow grease in a weekend. Or a month. Or a year.

The same is true for the agent built into your favorite IDE. Its loop is, mechanically, what Thorsten described. Its prompt is a multi-thousand-token treatise on how to be a good engineer, refined over countless internal dogfood sessions. The prompt is the product.

This was hard for me to internalize because I’m an architecture nerd. I want to believe the clean abstraction is what matters. It’s not. The clean abstraction is the precondition for being able to iterate on the prompt and the tool descriptions, which is where the actual quality comes from.

What I’d tell past me

If I could go back to day one and slip myself a note, it would say:

Build the loop in an afternoon. Don’t overthink it. Thorsten’s right about this part.
Then spend forever on everything that isn’t the loop. Doom detection, error recovery, cancellation, compaction, telemetry, prompt iteration. This is the actual work and it never ends.
Use two provider adapters, no more. anthropic and openai-compat. Fight the urge to “abstract this properly”. It’s already abstract enough.
Write fewer tests than you think. Three agentic tests, lots of unit tests for tools and parsers. Don’t try to test the model.
Read your own tool descriptions out loud. If you can’t explain when to use a tool in one sentence, the model can’t either.
Static config beats dynamic config. Every clever runtime config option I added, I eventually wished was just a constant.
The web UI is a renderer. Don’t put business logic in it. I built Nomad with shared React hooks so the CLI (Ink) and Web (DOM) consume the same logic. Best decision I made.

So is building your own agent worth it?

If you want to ship a commercial product to compete with the incumbents: probably not. They have a giant head start on the elbow grease that actually matters.

If you want to deeply understand what’s in this category of software, why some agents feel magical and others feel broken, what the actual hard problems are: absolutely yes. Reading Thorsten’s article gives you the intuition. Building one yourself gives you the scar tissue. They are not the same thing.

I’m building mine anyway. Nomad isn’t trying to compete with anyone. It runs on the models I want, costs what I want, and I can change anything in 10 minutes because I wrote it. That’s worth a lot to me even if it never beats Forge on a benchmark.

And as a bonus: I now have opinions about coding agents. Strong ones. Annoying ones. The kind you only earn by building the thing yourself.

Which is the real point, isn’t it.

Nomad is a coding agent I’m building. Clean-room implementation in TypeScript + Bun, designed to work with any model that supports tool calling. Just a thing I’m making. If you want to nerd out about agent architecture, email me.