Testing a Skill Before You Ship It

Most skills break in two predictable ways: the description doesn't trigger, or the body misfires once invoked. Both are catchable before users see them. This guide covers a 15-minute test pass that catches the routine failures, plus when to invest in a real eval set.

A skill ships in two halves: the description (does it ever fire?) and the body (when it fires, does it do the right thing?). Almost every shipped-skill bug is in one of these two surfaces, and almost every one of those bugs is catchable in fifteen minutes of structured testing before users see it.

This guide is the short pass that catches the routine failures, with a note at the end on when it’s worth graduating to a real eval set.

The 15-minute pre-ship pass

1. The 6/6 trigger test (5 min)

Write three user phrases that should trigger your skill and three that shouldn’t. Run all six against the agent with the skill installed. The skill should fire on the first three and stay silent on the other three.

text

SHOULD trigger:
- "extract text from this pdf"
- "merge these two pdfs into one"
- "the .pdf in my downloads has a form, fill it"

SHOULD NOT trigger:
- "summarize this docx for me"
- "what's a pdf?"  (factual question, no task)
- "extract text from this image"

If the rate is 6/6, the description is in good shape. If it’s 5/6 or worse, the description needs another pass — either it’s too vague (missed triggers) or too broad (false fires).

Run it cold. Use a fresh conversation each time. If the agent has been chatting about PDFs for the last hour, its routing decision is biased. The 6/6 test is only meaningful from a clean slate.

2. The happy-path invocation (5 min)

Pick the most common job the skill is for and run it end-to-end. Watch what the agent actually does:

Does it find and read the right scripts/templates from the body?
Does it pass the right arguments?
Does it interpret the output correctly?
Does the final reply to the user reflect what actually happened?

If the answer to all four is yes, you’re shipping a skill that works for its main case. That’s the bar.

3. The “wrong input” test (3 min)

Now hand the skill an input it shouldn’t accept. A PDF tool gets a Word doc. A merge tool gets a single file. A search tool gets an empty query. Watch how it fails.

You’re not trying to make it succeed — you’re checking that it fails cleanly: a clear error message back to the user, no half-completed side effects, no scary stack trace.

4. The “two skills could fire” test (2 min)

If your skill has any sibling that could plausibly fire on the same prompt, run a phrase that’s right on the boundary and watch which one wins. If the wrong one fires, add a “Do NOT use this when…” line to each description and re-run.

What to skip in the pre-ship pass

You’re not running a full QA suite — that’s a different exercise. Skip:

Performance / latency micro-optimisation
Exhaustive edge-case enumeration (catalogue them in reference/ instead)
Multi-skill orchestration scenarios
Anything that requires more than one human turn to set up

If you find yourself spending 45 minutes on the pre-ship pass, you’ve drifted into the eval-set territory below.

When to invest in a real eval set

The 15-minute pass is the right tool for shipping. It is not the right tool for iteration — the moment you start tweaking the description weekly, every change risks regressing one of the six trigger phrases that used to work. That’s when an automated eval pays off.

Build one when:

The skill is being used by people other than you
You’re tuning the description more than once a month
You’ve already shipped two regressions caused by description tweaks
The skill is the front door to something expensive (a paid API, a destructive operation)

A reasonable eval set is 30–50 prompts split into “should fire,” “should not fire,” and “edge / ambiguous.” Each prompt has an expected outcome. You re-run the set every time the description changes. CI-grade is overkill at first; a Python script you run manually before commits is plenty.

Don’t over-fit the evals. If you tune the description to ace a 50-prompt set, you’ll regress on prompt 51. Keep a “blind test” set of 10 prompts you don’t tune against, only check at release time.

Test data hygiene

A few mistakes that show up over and over in skill testing:

Single-domain prompts. If every test prompt is about PDFs of invoices, you don’t know whether the skill works on PDFs of contracts. Spread the domain.
Author bias. If the skill author writes the prompts, they unconsciously match the description’s wording. Get someone else to write half the prompts.
Stale prompts. Phrasing drifts. Re-collect prompts every quarter from real conversation logs.

Final checklist before shipping

6/6 on the trigger test, on a cold conversation
Happy-path invocation runs end-to-end without manual fixes
Wrong-input case fails cleanly
Boundary case picks the right sibling skill (or no skill)
If iterating frequently: a 30-prompt eval set in version control

For the description-level patterns the trigger test exercises, see Writing Skill Descriptions That Actually Trigger. For the body-level structure the invocation test exercises, see Bundling Scripts and Reference Docs in a Skill.

#evals #quality #skills #testing