I Don’t Know Rust. My AI Is Rewriting PHP in It Anyway.

A few nights ago I watched my terminal print out a 26 KB WordPress front page — <title>Phargo Test Site</title>, the block-library CSS, “Hello world!” pulled from a SQLite database, and a clean </html> at the bottom. Completely unremarkable output, except for one detail: the PHP engine that served it contains zero lines of PHP’s actual source code. It’s a from-scratch interpreter written in Rust.

Here’s the part I need you to sit with: I don’t know Rust. I have never written a lexer. I could not explain to you what a “tree-walking evaluator” is without reading the Wikipedia article in another tab. If you cornered me at a party and asked how PHP’s garbage collector works, I would fake a phone call.

The engine is called Phargo, and my contribution to it is, roughly, aiming. An AI writes the code. I point it at a target, read what comes back like a medieval king reviewing naval charts — solemn nod, zero comprehension — and type the most powerful phrase in modern software development:

“looks good, continue.”

The Experiment: Radical Honesty as a Build System

Everyone and their houseplant has an AI-built project right now, and every single one comes with the same unfalsifiable claim: “it works!” Works according to whom? The AI that wrote it? The demo that was recorded on the fourth take?

So the whole experiment rests on one idea, borrowed from watching the Bun team drive their JavaScript runtime against real-world test suites: don’t let the AI grade its own homework.

PHP ships with its own test suite — about 22,000 .phpt files written by the PHP internals team over three decades. Tests I didn’t write. Tests the AI didn’t write. Tests that encode every cursed corner of the language, from DateTime daylight-saving math to what exactly var_dump() prints for a float. That suite is the oracle. The scoreboard runs all of it, and the pass rate is auto-generated into the repo after every run.

The number cannot be flattered, negotiated with, or prompted into a better mood. Either bug40261.phpt passes or it doesn’t.

Current score: 3,844 of 22,037 — 17.4% of the entire upstream PHP test suite. And before you snort-laugh at 17%: the realistic ceiling is around 40–45%, because the rest of the suite tests C extensions (GD, curl, SOAP, intl, MySQL drivers…) that are explicitly out of scope. Within the actual playing field, the climb is very real — it started at zero.

My loop as the human is almost embarrassingly thin:

The AI runs a failure histogram over the whole corpus to find the biggest cluster of failing tests it can actually fix
It implements the thing
It runs the ~22,000-test scoreboard (about 7 minutes of fan noise)
If the number went up: commit, push, repeat
If the number went down: I get to say my other line, “hmm, that regressed, look again”

That’s it. That’s the job. I’ve achieved peak delegation and I’m not even sorry.

The Oracle Cannot Be Bribed. The Harness, However, Lied to My Face.

Early on, the pass rate plateaued in a way that felt wrong. Whole categories of tests — obviously simple ones — kept failing with diffs that looked identical to the expected output. I stared at those diffs like a man staring at two identical photos in a spot-the-difference puzzle, finding nothing.

The difference was invisible because it was literally invisible: carriage returns. The test corpus had been checked out on Windows with CRLF line endings, and our scoreboard compared output byte-for-byte. PHP’s own test runner normalizes line endings before comparing. Ours didn’t. Which means the harness was silently failing essentially every multi-line test in the corpus on line endings alone, and had been for weeks.

One line of normalization code. Hundreds of tests flipped to green instantly.

The lesson tattooed itself onto the project: measure your measurement. Your oracle is only as honest as the plumbing that connects you to it. We now normalize exactly the way run-tests.php does, and every suspicious plateau since has triggered the same question first — is the engine wrong, or is the scoreboard lying?

PHP’s Test Suite Is a Minefield With a Readme

Here’s something nobody tells you about running somebody else’s 22,000-file test corpus: some of those files are bombs. Not malicious ones — accidental ones. Regression tests for ancient memory bugs that allocate absurd structures, generator tests that expand into infinity, tests that were only ever meant to run inside PHP’s own carefully-fenced CI.

I found this out the way all great discoveries are made: my development machine hard-restarted. Not “the program crashed.” Not “the terminal froze.” The entire computer went black and rebooted, because a generator test convinced our engine to eat every byte of RAM in the house like a rocket-powered shopping cart with no brakes.

The aftermath turned the engine paranoid, and honestly it wears paranoia well:

A capped global allocator — the engine physically cannot allocate more than 6 GiB, no matter how creative the test gets
A step limit, so infinite loops die with an error instead of a space heater
Caps on string sizes, array nodes, output length, generator expansion
A breadcrumb file the scoreboard updates with the current test name, so when something hangs, we know exactly which file to glare at

None of this is glamorous language-implementation work. All of it is the difference between “research project” and “thing that can safely chew through 22,000 hostile files unattended while I make coffee.”

The Features That Were Quietly Lying

My favorite genre of bug — and the corpus finds them relentlessly — is the feature that exists, parses, runs without error, and does absolutely nothing. The Potemkin builtin. Over the months the suite has exposed, among others:

clone — parsed fine, evaluated to NULL. Engine-wide. Every DateTimeImmutable in every test had been silently broken forever, because immutable date math is made of clone
unset($arr[$key]) — a total no-op. The key simply… stayed
trim($str, $charlist) — ignored the charlist argument since the beginning of time and trimmed whitespace anyway
$$variableVariables — didn’t exist
static function variables — didn’t exist
spl_autoload_register() — accepted your autoloader with a warm smile and never called it
catch (\Throwable) — matched nothing, which is a very funny property for a catch-all to have

Each of these would survive a demo. Each would survive a code review by someone (me) who doesn’t read Rust. None of them survived the corpus. That’s the entire thesis of the experiment in one bullet list: I can’t audit the code, so the 22,000 tests audit it for me, with a thoroughness no human reviewer could sustain past lunch.

And Then It Served a WordPress Page

The north star was always WordPress — it’s the final boss of PHP compatibility, a codebase old enough to contain sedimentary layers of every PHP idiom since 2003.

Getting wp-load.php to even bootstrap burned through a chain of blockers that reads like a language-lawyer’s fever dream: goto (yes, WordPress’s HTML parser uses goto), str_replace’s by-reference $count parameter, \xNN escapes inside regex character classes, function_exists() being blind to half our builtins. The installer then corrupted its own database because preg_split was ignoring the PREG_SPLIT_DELIM_CAPTURE flag inside wpdb::prepare — a bug four layers below anything I could have diagnosed, found and fixed by the AI while I supervised with the confidence of a man watching open-heart surgery through frosted glass.

And then one evening: wp_install() completes. Admin user created, options table populated, three posts in SQLite. The front page renders — real theme, real posts, real permalinks.

Full disclosure, because radical honesty is the whole bit:

✅ Fresh install runs, front page renders from the database
✅ /wp-admin/ renders too — the actual dashboard, without any issues, which frankly surprised me more than the front page did
⚠️ The REST API is unexplored territory
⚠️ It’s currently ~55x slower than real PHP on that page (7.1 s vs 126 ms) — though the shiny new bytecode VM already runs micro-benchmarks at 1–3x of PHP 8.5 and is coming for that number next

What This Is Actually About

I went in expecting to learn whether an AI could write a language engine. The surprising answer is that this was never really the question. Of course it can write a lexer — it’s read every lexer ever published. The real question was always: how does a person who can’t verify the code keep the project honest? And the answer turned out to be old-fashioned, boring, and beautiful: tests somebody else wrote, a number that only moves when reality moves, and every result pushed to a public repo whether it flatters us or not.

I still don’t know Rust. The engine is now ~24,000 lines of it. The scoreboard goes up a few dozen tests at a time, the dev log collects war stories, and somewhere out there is a version of this where WordPress runs in your browser on a Rust engine compiled to WASM.

Watch the number climb: github.com/ekinertac/Phargo. Or don’t, and just enjoy knowing that somewhere, a man who cannot write a parser is shipping one.

Oh, and obviously: this post was drafted with an LLM too, then edited by me until it sounded like me. The machine writes, I aim. It’s the whole point.