Turn Your Agent Prompts Into a Dataset: DSPy + Datasette for Transparent, Reproducible Agents

Prompts are product logic. Treat them like data. Inspired by Simon Willison’s write-up on cataloging agent prompts with DSPy and Datasette, here’s a practical way to make prompts queryable, versioned, and testable for real-world teams.

Source: DSPy + Datasette agent prompts (Simon Willison)

Why this matters

Reproducibility: Prompts become rows you can diff, query, and roll back.
Evaluation: Log runs and compare prompt variants with DSPy’s optimization loop.
Governance: Track tool access, safety checks, and approvals over time.
Collaboration: Share a browsable prompt catalog via Datasette—no guessing which prompt shipped.

Build a prompt registry with DSPy + Datasette (5 steps)

Model your prompts as data: task name, system prompt, tool permissions, parameters (temperature/top_p), owner, version, and change notes.
Capture runs automatically: Log input, output, model, token counts, latency, cost, and a run hash. Store to JSONL or SQLite.
Publish with Datasette: Load your table into SQLite and serve it with Datasette so teammates can search, filter, and export.
Add evaluation signals: success labels, automatic checks, regression status, and DSPy-tuned scores to compare prompt variants.
Close the loop: Use DSPy to auto-optimize prompts against your eval set, then write the winning prompt back as a new version.

What to log for agents

Prompt fields: system, developer, and user messages; tool schema; guardrail summary.
Run metadata: model ID, temperature, context length used, token counts, latency, and cost.
Tool traces: which tools were called, arguments, results, and errors (with redaction for sensitive data).
Safety signals: jailbreak hits, toxic content flags, and prompt-injection detections.
Outcome labels: pass/fail, score, evaluator notes, and regression status vs. last release.

Practical tips

Version everything: Use semantic versions and include a change note for each prompt update.
Separate content from policy: Keep safety and tool policies as distinct fields you can swap without touching task logic.
Prefer structured prompts: Use named sections (Goal, Constraints, Tools, Output schema) to make diffing and tuning easier.
Start with a small eval set: 20–50 representative tasks catch most regressions before production.
Automate snapshots: On each deploy, snapshot prompts + model IDs so you can reproduce customer-visible behavior.

Risks to watch

Prompt injection: Treat external content as untrusted; log and monitor for tool-abuse patterns.
Data leakage: Redact PII from traces before publishing to Datasette.
Evaluation drift: As models change, re-run evals and pin model versions for critical workflows.

Get started

DSPy framework: github.com/stanfordnlp/dspy
Datasette: datasette.io
Reference write-up: Simon Willison on DSPy + Datasette prompts

Takeaway: Treat prompts like code and data. Log, version, evaluate, and publish them. Your agents will get measurably better—and be far easier to govern.

Enjoy this? Subscribe to our newsletter for weekly, bite-sized AI tactics: theainuggets.com/newsletter

Subscribe

What's Hot