Another big-model drop just landed. Before you rebuild your stack around Claude Opus 4.8, run this fast, practical test plan to see if it actually outperforms what you use today.
This workflow is inspired by community testing habits (see Simon Willison’s notes) and standard LLM evaluation practice. It’s vendor‑neutral and copy‑paste ready.
Source: Simon Willison on Claude Opus 4.8 • Background: Anthropic Claude • Benchmarks context: Stanford HELM
Your 30‑minute evaluation plan
- Instructions following: Measure if the model obeys precise constraints (format, length, tone). Ask: “Return exactly 5 bullet points, each <20 words, JSON only.”
- Structured JSON: Provide a JSON Schema and require a valid instance. Check strictness, key ordering (if relevant), and missing-field behavior.
- Retrieval grounding: Paste a short doc (2–3 paragraphs) and ask factual questions that are only answerable from that text. Require quoted citations with line numbers.
- Hallucination trap: Ask for references on a niche topic you know well. Require URLs and exact quotes. Verify every claim and link.
- Coding fix: Give a 25–40 line buggy snippet with tests. Ask for a minimal diff and a one‑sentence rationale. Run the patch to confirm.
- Tool calling (if supported): Define 2–3 functions with schemas and edge cases. Test tool selection, argument accuracy, and graceful failure.
- Long input handling: Feed a dense prompt (5–10k characters) with a clear query at the end. Score focus, not length of response.
- Reasoned decision: Present a small table (CSV) and ask for a ranked recommendation with a short, verifiable rationale using only data provided.
- Multilingual check: Ask for a concise English summary of a paragraph in another language you can verify. Then ask for a back‑translation to spot drift.
- Safety and policy: Include a mildly risky request (e.g., scraping a paywalled site). Confirm it declines with a helpful, alternative path.
- Latency and cost: Time three identical prompts back‑to‑back. Record average latency. Estimate cost = (input + output tokens) × price per token.
Copy‑paste prompts to get started
Instruction‑following: “Respond in valid JSON only with the keys [‘summary’, ‘actions’]. Provide exactly 5 bullet points in ‘actions’, each under 20 words. No extra text.”
Schema validation: “Here’s the JSON Schema: { … }. Return an instance that validates. If a field is unknown, set it to null. No commentary.”
Grounded QA: “Using only the text above, answer the question. After the answer, include ‘citations’ with exact quotes and line numbers from the text.”
Coding diff: “Given the code and tests, return a unified diff (git format) that fixes the failing test. Keep the diff minimal and add no new files.”
Decision from data: “Given the CSV, pick one option and justify in <80 words using only fields provided. Include a one‑line risk note.”
Simple scoring rubric (1–5)
- Format adherence: 1 = ignores, 5 = perfect and consistent
- Factuality: 1 = hallucinations, 5 = grounded with verifiable citations
- Task success: 1 = unusable, 5 = correct, concise, minimal edits needed
- Tool/JSON reliability: 1 = brittle, 5 = robust across edge cases
- Latency/cost: 1 = slow/pricey, 5 = fast/economical for scope
What to log for a fair comparison
- Prompt, temperature (or equivalent), max tokens, and any system message
- Model version and date/time of test
- Latency (p50/p95 if you can), token usage, and estimated $
- Pass/fail per task with short notes and links to evidence
- Any manual edits needed to make outputs production‑ready
When to switch
- Wins your top 3 business‑critical tasks by ≥1 point on the rubric
- Cuts average latency by ≥20% or cost by ≥15% for equivalent quality
- Shows lower hallucination rates under citation‑required prompts
- Maintains reliability across three separate runs per task
Takeaway: Don’t chase leaderboards. Run this focused test set on Claude Opus 4.8 and your current model, then promote only if the gains are real on your data.
Enjoy this? Get weekly, no‑fluff playbooks like this in your inbox. Subscribe to The AI Nuggets newsletter.

