Claude Sonnet 5 is making headlines, and Simon Willison has a hands-on write‑up worth bookmarking. If you’re considering a switch or pilot, here’s a fast, reproducible way to evaluate any new frontier model—before it touches production. Read Simon’s post: Notes on Claude Sonnet 5.
What to test in 60 minutes
- Coding depth: Implement a small feature from a spec. Track unit‑test pass rate, ability to navigate multi‑file repos, and hallucinated API calls.
- Tool use and structured output: Evaluate function calling, JSON‑only responses, and parallel tool executions. Measure failures to follow schemas.
- Retrieval (mini‑RAG): Provide 5–10 short docs and ask grounded questions. Score citation accuracy and refusal when evidence is missing.
- Reasoning under constraints: Multi‑step math/logic with strict formats (no chain‑of‑thought in prompts). Verify answers against hidden solutions.
- Latency and cost: Capture end‑to‑end latency, output tokens per second, and estimated $/1K tokens on your typical prompts.
- Safety and abuse resistance: Run jailbreak smoke tests, PII redaction checks, and prompt‑injection attempts during tool calls.
Lightweight test harness (no heavy infra)
- Create a 20–30 item prompt set that reflects your workload (coding, retrieval, analysis, summarization). Keep a hidden answer key where possible.
- Run each task across 2–3 models with fixed temperature, max tokens, and identical system prompts. Seed where supported for reproducibility.
- Log per‑request metrics: input/output tokens, latency, cost, schema‑adherence (valid JSON), and pass/fail against assertions.
- Automate retries and timeouts; flag nondeterministic outputs that break schemas. Save raw transcripts for later error analysis.
- If you prefer off‑the‑shelf tools, try the open‑source LM Evaluation Harness and adapt it with your private tasks.
Decision checklist to switch models
- Quality: Target ≥5–10 point win on your task‑specific scorecard, not just public benchmarks.
- Latency: Meets P95 goals under real concurrency, not just single‑request tests.
- Cost: Improves $/task or unlocks material efficiency (fewer retries, shorter prompts).
- Reliability: Stable schema adherence and tool‑call accuracy across 100+ runs.
- Policy fit: Satisfies your safety, data handling, and compliance requirements.
Further reading
- Simon Willison’s perspective: Notes on Claude Sonnet 5
- Anthropic’s model update context: Claude 3.5 Sonnet
The takeaway
Don’t switch on vibes or single‑shot demos. Run a tight, task‑specific eval, compare cost and latency, and confirm safety under tool use. If it wins your scorecard, then pilot.
Enjoy this? Get weekly, nugget‑sized AI briefings—subscribe to our newsletter: theainuggets.com/newsletter.

