Build a Safe LLM Coding Agent That Writes, Runs, Debugs, and Tests

LLM coding agents are moving from novelty to utility. If you want one that can write code, run it, debug failures, and verify with tests—safely—here’s a practical blueprint you can ship fast. For background, see Simon Willison’s write-up on this trend: https://simonwillison.net/2026/Jul/2/llm-coding-agent.

What a modern coding agent should do

Plan tasks, generate code, and explain its approach in plain English.
Run code and tests in a locked-down sandbox; read errors and iterate.
Write or fix unit tests; only declare success when tests pass.
Propose minimal diffs/patches and create PR-ready summaries.
Keep cost/time budgets; avoid risky actions without approval.

Minimal architecture you can ship fast

LLM with tool use: Enable functions for read/write files, run commands, run tests, and apply patches.
Sandbox runner: Docker (non-root) with CPU/memory/time limits and restricted network egress.
Workspace: A clean directory the agent can modify, versioned via git to review diffs.
Test harness: pytest or equivalent to define the “green = done” objective.
Orchestrator loop: Observe → Act (tool) → Reflect (errors) → Iterate, with retry/backoff.
Guardrails: Whitelisted commands, path allowlist, file size caps, and explicit approvals for sensitive actions.
Telemetry: Structured logs of prompts, tools, outputs, cost, and iteration counts.

Quickstart (Python + Docker)

Create a base image: python:3.11-slim with build tools you need. Disable root, set working dir to /workspace.
Lock it down: set ulimits; Docker –cpus, –memory, and –pids-limit; disable outbound network unless explicitly needed.
Expose tools: run_code (bash with timeout), run_tests (pytest -q), read_file, write_file, list_files, apply_patch (unified diff).
Enforce timeouts: e.g., 30s for code runs, 90s for test runs; kill processes on timeout and return stderr.
Prefer patches over full rewrites; commit after each passing iteration for traceability.
Make tests first-class: seed with at least one failing test that encodes the task; success = all green.
Persist context: summarize state after each iteration to keep the LLM focused and cut token spend.
Human-in-the-loop: require approval for package installs or network access.

Safety guardrails that matter

Network policy: default deny egress; allow specific package mirrors only if needed.
Resource quotas: strict CPU, memory, file descriptors, and process limits.
Filesystem jail: path allowlist under /workspace; block writes elsewhere.
Secrets hygiene: clear env vars; scan outputs to prevent leaking tokens or keys.
Command allowlist: python, pytest, grep, sed, diff, patch; block apt, curl, ssh, and background daemons.
Cost/time budgets: stop after N iterations or $ budget; surface partial progress with diffs.

Measure what matters

Pass rate: percentage of tasks where all tests pass.
Time-to-green: median time from start to passing tests.
Iterations per task and tool success rate.
Token and compute cost per successful task.
Reproducibility: same prompt, same repo, same result.

Where this shines

Maintenance: fix bugs from failing tests; generate minimal patches with clear PR descriptions.
Greenfield scaffolding: generate starter modules plus tests to lock behavior.
Refactors: migrate to new APIs with tests guarding regressions.
Porting: translate small utilities between languages with test parity.

Limitations (and fixes)

Hallucinated APIs → require the agent to cite docs and compile a minimal proof before large changes.
Flaky tests → pin versions and random seeds; re-run failed tests to confirm.
Long loops → enforce tight timeouts and iteration caps; summarize aggressively.

Resources

Background: Simon Willison on LLM coding agents — https://simonwillison.net/2026/Jul/2/llm-coding-agent
Function calling & tool use (OpenAI) — https://platform.openai.com/docs/guides/function-calling
Tool use (Anthropic Claude) — https://docs.anthropic.com/claude/docs/tool-use
SWE-bench benchmark — https://www.swe-bench.com/
Docker resource constraints — https://docs.docker.com/config/containers/resource_constraints/
pytest docs — https://docs.pytest.org/

Takeaway: Treat tests as the contract, keep the agent boxed in, and measure time-to-green. That’s how you get a useful, safe coding agent—fast.

Like this? Get weekly, no-fluff playbooks from The AI Nuggets. Subscribe: https://theainuggets.com/newsletter

Subscribe

What's Hot