OpenAI showed how to build self-improving tax agents with Codex by letting the model write small programs, run them, check results, and iterate. Here’s the reusable playbook and how to rebuild it with today’s models.
Note: Codex has been retired. You can replicate this pattern with GPT-4o or GPT-4.1 plus function calling and a secure execution sandbox.
The core loop OpenAI used
- Generate: The model proposes code/tools to parse tax instructions, fill forms, or apply rules.
- Execute: Run code on real examples in a sandboxed environment.
- Evaluate: Compare outputs to tests, ground truth, or heuristics; collect errors.
- Reflect: Have the model explain failures and propose fixes.
- Iterate: Regenerate code with feedback and try again.
- Promote: Keep successful tools/strategies for future tasks.
This tight generate–execute–evaluate loop turns vague instructions into reliable, testable behaviors for complex tax tasks.
Blueprint: build your own domain agent
- Domain corpus + retrieval: Index statutes, forms, and FAQs. Use retrieval to ground the model on cited text.
- Task decomposition: Break goals (e.g., “determine deductible amount”) into checkable sub-steps.
- Program synthesis: Ask the model to write small, pure functions for parsing, validation, and calculations.
- Sandboxed execution: Run generated code in a locked container with time/memory limits.
- Automatic evaluator: Create unit tests, gold labels, or rule checkers to score outputs automatically.
- Experience memory: Store prompts, attempts, fixes, and passing solutions; prefer known-good tools first.
- Reflection prompts: After a fail, ask: “Why did this fail? What minimal change would pass the test without breaking others?”
Quickstart with modern APIs
- Model: Use GPT-4o/4.1 with function calling to constrain outputs to well-typed tool specs (docs).
- Tools: Provide functions for retrieval, file I/O on whitelisted docs, and code execution in a sandbox.
- Evaluator: Write tests from official examples and edge cases; auto-grade every attempt.
- Controller: A simple loop that routes between generate, execute, evaluate, and reflect until pass or budget limit.
- Logging: Capture prompts, tool calls, code, results, errors, and scores for replay and regression testing.
Why this works (and evidence)
Grounded tool use and iterative feedback improve reliability, echoing research like ReAct (reasoning + acting) and self-reflection methods.
- OpenAI’s overview: Building self-improving tax agents with Codex (link).
- ReAct: Reasoning and acting in language models (arXiv).
Practical tips
- Keep tasks narrow and testable; ship small tools that compose.
- Constrain outputs via function calling and JSON schemas to reduce drift.
- Instrument success rate, time-to-pass, and regression failures across versions.
- Prefer citing sources in every answer; show the exact lines used.
- Budget time and tokens; stop after diminishing returns.
- Treat PII carefully. Keep all execution and data inside a controlled environment.
Limitations to watch
- Evaluation quality: Poor tests teach the wrong behavior.
- Spec drift: Laws, forms, and policies change; schedule re-indexing and re-tests.
- Error compounding: Reflection can overfit; diversify data and freeze known-good tools.
- Licensing and citations: Ensure you have rights to your corpora and always attribute.
- Operational cost: Sandboxes and grading add latency; batch where possible.
Clear takeaway
The win isn’t magic autonomy—it’s a disciplined loop: generate small tools, run them safely, grade ruthlessly, and keep what works. Do that, and your agent improves itself.
Want more practical AI nuggets like this? Subscribe to our free weekly newsletter: theainuggets.com/newsletter.

