LLM coding agents are moving from novelty to utility. If you want one that can write code, run it, debug failures, and verify with tests—safely—here’s a practical blueprint you can ship fast. For background, see Simon Willison’s write-up on this trend: https://simonwillison.net/2026/Jul/2/llm-coding-agent.
What a modern coding agent should do
- Plan tasks, generate code, and explain its approach in plain English.
- Run code and tests in a locked-down sandbox; read errors and iterate.
- Write or fix unit tests; only declare success when tests pass.
- Propose minimal diffs/patches and create PR-ready summaries.
- Keep cost/time budgets; avoid risky actions without approval.
Minimal architecture you can ship fast
- LLM with tool use: Enable functions for read/write files, run commands, run tests, and apply patches.
- Sandbox runner: Docker (non-root) with CPU/memory/time limits and restricted network egress.
- Workspace: A clean directory the agent can modify, versioned via git to review diffs.
- Test harness: pytest or equivalent to define the “green = done” objective.
- Orchestrator loop: Observe → Act (tool) → Reflect (errors) → Iterate, with retry/backoff.
- Guardrails: Whitelisted commands, path allowlist, file size caps, and explicit approvals for sensitive actions.
- Telemetry: Structured logs of prompts, tools, outputs, cost, and iteration counts.
Quickstart (Python + Docker)
- Create a base image: python:3.11-slim with build tools you need. Disable root, set working dir to /workspace.
- Lock it down: set ulimits; Docker –cpus, –memory, and –pids-limit; disable outbound network unless explicitly needed.
- Expose tools: run_code (bash with timeout), run_tests (pytest -q), read_file, write_file, list_files, apply_patch (unified diff).
- Enforce timeouts: e.g., 30s for code runs, 90s for test runs; kill processes on timeout and return stderr.
- Prefer patches over full rewrites; commit after each passing iteration for traceability.
- Make tests first-class: seed with at least one failing test that encodes the task; success = all green.
- Persist context: summarize state after each iteration to keep the LLM focused and cut token spend.
- Human-in-the-loop: require approval for package installs or network access.
Safety guardrails that matter
- Network policy: default deny egress; allow specific package mirrors only if needed.
- Resource quotas: strict CPU, memory, file descriptors, and process limits.
- Filesystem jail: path allowlist under /workspace; block writes elsewhere.
- Secrets hygiene: clear env vars; scan outputs to prevent leaking tokens or keys.
- Command allowlist: python, pytest, grep, sed, diff, patch; block apt, curl, ssh, and background daemons.
- Cost/time budgets: stop after N iterations or $ budget; surface partial progress with diffs.
Measure what matters
- Pass rate: percentage of tasks where all tests pass.
- Time-to-green: median time from start to passing tests.
- Iterations per task and tool success rate.
- Token and compute cost per successful task.
- Reproducibility: same prompt, same repo, same result.
Where this shines
- Maintenance: fix bugs from failing tests; generate minimal patches with clear PR descriptions.
- Greenfield scaffolding: generate starter modules plus tests to lock behavior.
- Refactors: migrate to new APIs with tests guarding regressions.
- Porting: translate small utilities between languages with test parity.
Limitations (and fixes)
- Hallucinated APIs → require the agent to cite docs and compile a minimal proof before large changes.
- Flaky tests → pin versions and random seeds; re-run failed tests to confirm.
- Long loops → enforce tight timeouts and iteration caps; summarize aggressively.
Resources
- Background: Simon Willison on LLM coding agents — https://simonwillison.net/2026/Jul/2/llm-coding-agent
- Function calling & tool use (OpenAI) — https://platform.openai.com/docs/guides/function-calling
- Tool use (Anthropic Claude) — https://docs.anthropic.com/claude/docs/tool-use
- SWE-bench benchmark — https://www.swe-bench.com/
- Docker resource constraints — https://docs.docker.com/config/containers/resource_constraints/
- pytest docs — https://docs.pytest.org/
Takeaway: Treat tests as the contract, keep the agent boxed in, and measure time-to-green. That’s how you get a useful, safe coding agent—fast.
Like this? Get weekly, no-fluff playbooks from The AI Nuggets. Subscribe: https://theainuggets.com/newsletter

