I Let My AI Assistant Run My Entire Dev Pipeline — Here's What Happened

Yesterday morning I told my AI assistant Sully to “keep rolling through the Trello cards” for a project.

Three hours later, 8 PRs were merged. ~3,500 lines of production code. Zero lines written by me.

Not generated-and-reviewed-by-me. Actually shipped. Through a gated pipeline with independent verification at every step.

Let me explain how this works, because it’s not what most people think when they hear “AI writes code.”

The Problem With AI Code Generation

Here’s what everyone does: ask an AI to write code, review it, paste it in. Maybe you use Claude Code or Copilot inline. That’s fine for small stuff.

But when you’re trying to ship features across a real project — Trello cards with acceptance criteria, branches, PRs, code review — the copy-paste loop breaks down. You become the bottleneck.

I wanted to remove myself from the loop entirely. Not from oversight — from execution.

The Full Pipeline — Visualized

AI-Orchestrated Dev Pipeline

The Architecture

The system has three actors:

Me — I say “work the board” and go make coffee
Sully (orchestrator) — an AI running on a $200 Raspberry Pi via OpenClaw. He manages the whole pipeline
Claude Code (worker) — headless sessions that write the actual code

Sully doesn’t write code. He’s the project manager. He spawns Claude Code sessions, gives them tasks, verifies their work independently, and only merges when everything checks out.

Here’s the flow:

Trello Card → Pre-Flight → Setup → Implement → Ship → Review → Merge → Done

Each phase is a separate, isolated Claude Code session. No session ever sees what the previous one did — they only see the repo state.

The Phases

Pre-Flight — Sully checks the Trello card exists, has proper structure (a ## Problem heading), and the repo is clean. No session spawned yet.

Setup — A Claude Code session fetches the card, moves it to “Doing” on Trello, creates a feature branch. Takes about 60 seconds.

Implement — This is the big one. A fresh Claude Code session gets the full card description and implements it. It uses the /feature-dev plugin (from Anthropic’s official plugin marketplace) which runs a structured 7-phase development workflow with specialized agents — a code explorer analyzes the codebase, a code architect designs the approach, then it implements, tests, and validates. Usually 5-15 minutes.

Ship — Another session verifies acceptance criteria against the diff, runs all project checks, pushes the branch, and creates a PR with a structured body.

Review — This is where it gets interesting. A completely separate session reviews the PR. No self-grading. It launches up to 5 specialized review agents in parallel:

Code Reviewer — bugs, logic errors, CLAUDE.md compliance (confidence-scored, only reports findings ≥80% confidence)
Silent Failure Hunter — swallowed errors, inadequate error handling
Test Analyzer — only runs if test files changed
Comment Analyzer — only runs if docs were added
Type Design Analyzer — only runs if new types were introduced

The review agents come from the pr-review-toolkit plugin. They run in parallel, and their findings get cross-validated against the actual source files to eliminate false positives.

Merge — Only Sully merges. Never the sessions. He does a final audit (PR size, mergeable status, validation passes), squash merges, moves the card to Done, and notifies me on Telegram.

The Key Insight: Trust Nothing

The entire system is built on one principle: Sully never trusts session output.

Every gate is verified independently with direct commands — git log, gh pr view, Trello API calls. If a session says “I created the PR,” Sully runs gh pr list --head {branch} to confirm. If a session says “tests pass,” Sully runs the test suite himself.

This matters because AI sessions lie. Not maliciously — they just have an optimism bias. They’ll say “done” when they’re 90% done. They’ll say “tests pass” when they ran the wrong test suite. Independent verification catches all of this.

What Happened on March 5th

I was building Meeting Transcriber — a native macOS app that records meetings, transcribes them with whisper.cpp, and generates AI summaries. The Trello board had 13 cards.

By 7:30 AM, 5 cards were already done from previous sessions. I said “keep rolling” and went to work on client stuff.

By 2 PM, all 13 cards were done:

Native macOS notifications (UserNotifications framework)
DMG installer with code signing and GitHub Actions release pipeline
CLI preservation with app bundle integration
AI meeting summaries via Claude API
Settings UI with macOS Keychain integration
Meeting history with search and metadata
Two Claude Code plugins

8 PRs merged in a single session. ~3,500 lines of production Swift, TypeScript, shell scripts, and GitHub Actions YAML. Every PR went through the full pipeline — implement, review with parallel agents, independent audit, merge.

What This Isn’t

This isn’t AGI. This isn’t replacing developers. The cards were well-written with clear acceptance criteria, the architecture was already established, and the project had a solid CLAUDE.md with conventions.

This is more like having a very fast, very tireless junior dev who follows instructions precisely, paired with a project manager who never rubber-stamps anything.

The human work was upstream: deciding what to build, writing the cards, establishing the architecture. The execution was automated.

The Stack

Orchestrator: OpenClaw on Raspberry Pi 5 ($200)
Worker: Claude Code via ACP (Agent Communication Protocol)
Plugins: feature-dev and pr-review-toolkit from Anthropic’s plugin marketplace
Project management: Trello API
Code hosting: GitHub (PRs, Actions)
Communication: Telegram (Sully notifies me after each merge)

Total infrastructure cost: the Pi, a Claude Max subscription, and a Trello board.

What I’d Tell You If You’re Considering This

Start with well-structured cards. The pipeline is only as good as the input. A vague card produces vague code.

Use independent verification. If your AI reviews its own code, you have zero actual review. Separate sessions, separate context.

Kill zombie sessions aggressively. AI sessions love to sit idle pretending they’re working. Track commits over time — no new commits in 20 minutes means it’s dead.

And most importantly: the goal isn’t to remove humans from software development. It’s to move humans upstream to where they add the most value — architecture, product decisions, quality standards — and let the machines handle the execution.

That’s what happened on March 5th. I made the decisions. Sully ran the pipeline. Claude Code wrote the code. And a meeting transcription app went from 5/13 cards to 13/13 in a single day.