Security researchers and builders are embracing a simple idea: “hack my AI assistant.” Simon Willison highlights this approach—use it to safely crowdsource attacks and ship real fixes, fast.
Why this works
LLM apps expand your attack surface with tools, retrieval, and memory. Prompt injection and data exfiltration are top risks—see the OWASP Top 10 for LLM Applications for specifics.
Run your own “Hack My Assistant” drill (quick start)
- Define scope + safe harbor: clearly state in-scope endpoints, test accounts, and legal protections (see disclose.io Safe Harbor).
- Publish rules: no production user data, no denial-of-service; enumerate allowed domains/APIs and rate limits.
- Instrument everything: log prompts, tool calls, responses, and errors with request IDs; scrub PII before storage.
- Sandbox tools: run functions in isolated containers with outbound allowlists and no raw credential exposure.
- Severity rubric: P1 = cross-tenant data leak or arbitrary file/network access; P2 = policy bypass with sensitive metadata; etc.
- Repro template: require minimal prompt, steps, expected/actual behavior, screenshots, and cURL export.
- Response SLAs: acknowledge in 24h, triage in 72h, patch fast; publish a changelog of fixes and lessons.
Guardrails that actually help
- Strict function calling: enforce JSON schema, parameter whitelists, timeouts, and deny-by-default execution.
- RAG hygiene: allowlist sources; strip system-like strings from retrieved text; attach origin metadata and hashes.
- Output gating: add deterministic checks (regex, policy rules) before high-impact actions; don’t rely on vibes-only model grading.
- User-in-the-loop: require explicit confirmation for payments, emails, file writes, and external posts.
- Egress controls: restrict network access, block localhost/IMDS, and proxy all requests with auditing.
- Circuit breakers: rate-limit, detect repetition loops, and auto-disable a tool on anomalous error spikes.
What to measure
- Coverage: number of attack classes exercised (prompt injection, indirect injection via RAG, tool abuse, data exfil).
- MTTR: median time to detect, triage, and patch findings.
- Block rate vs. false positives: policy blocks that prevented harm without breaking legitimate tasks.
- Latency and token overhead from guardrails (to spot regressions).
What not to do
- Don’t rely only on a “prompt firewall.” Defense-in-depth beats a single magic prompt.
- Don’t hand models raw secrets or broad filesystem/network access.
- Don’t fetch from the open internet without isolation, allowlists, and response validation.
- Don’t log unhashed PII or leave artifacts that enable re-identification.
Takeaway
Turn “hack my assistant” into a disciplined red-team drill: constrain scope, log deeply, sandbox tools, measure impact, and ship fixes. You’ll harden faster than any static checklist.
Like this? Get one actionable AI nugget in your inbox weekly—subscribe to our newsletter: theainuggets.com/newsletter.

