Everyone loves a jaw-dropping AI demo—until it crumbles outside the spotlight. Inspired by Simon Willison’s note on Kyle Ferrana, here’s a fast, repeatable way to vet claims before you share or ship them.
Why fast vetting matters
Viral AI clips often cherry-pick ideal prompts, hide manual steps, or overfit to narrow tasks. A 5-minute check can save hours of rework—and your credibility.
The 5-minute checklist
- Pin the claim: Write one sentence that states what the system promises. If you can’t, the demo isn’t specific enough to verify.
- Reproduce with fresh inputs: Swap in 3-5 new examples the demo never showed. If performance collapses, it’s likely cherry-picked.
- Baseline it: Compare to a simple heuristic or smaller model. If a regex or ruleset gets you close, the “AI magic” may be overstated.
- Probe failure modes: Try adversarial phrasing, longer/shorter inputs, and out-of-domain data. Note misclassifications and hallucinations.
- Check hidden scaffolding: Look for human-in-the-loop steps, pre-labeling, retrieval indexing, or hand-curated examples that won’t scale.
- Verify model + context: Record model name, context window, temperature, and tools. If those are missing, treat claims as anecdotal.
- Cost and latency sanity check: Estimate per-call tokens and end-to-end latency. Great accuracy at unusable cost is not production-ready.
- Data provenance: Ensure the system isn’t memorizing private docs or test answers. Ask what the model could have seen during training.
- Log everything: Save prompts, seeds, timestamps, and outputs. You’ll need them when results drift or regress later.
Bonus tools
- Quick eval harness: Create a 10–20 item eval set in a CSV and run batch prompts. Track simple pass/fail to avoid vibe-based judging.
- Screen record: Film your repro attempts. It deters quiet tweaks and documents real-world latency.
- Red-team prompts: Ask the model to justify, cite sources, explain uncertainty, and refuse unsafe requests to expose guardrail gaps.
For broader guidance, see the NIST AI Risk Management Framework—a credible reference for evaluation, transparency, and risk controls.
Bottom line
If a demo can’t survive fresh inputs, simple baselines, and basic logging, it’s not ready to ship—or share. Trust data, not vibes.
Get smarter on AI (weekly)
Like this? Subscribe to our newsletter for sharp, practical AI insights: theainuggets.com/newsletter.

