Practical Guide to Evaluating and Testing Claude Code Skills
Skills are everywhere. Since Anthropic launched Claude Code, the ecosystem has exploded with custom "Skills"—directories of instructions and scripts that teach Claude how to handle specific domains.
But there is a problem: almost nobody is testing them.
Most developers "vibe-check" their skills with a few manual prompts, see a good response, and ship it. You wouldn’t ship a TypeScript library without a test suite, so why ship a Skill without an evaluation? This is a practical guide to fixing that.
What Are Claude Code Skills?
Agent Skills are folders that augment Claude's capabilities without retraining the model. They follow a progressive disclosure pattern, meaning Claude only loads the full instructions when it thinks they are relevant.
At a minimum, a skill requires a SKILL.md file in .claude/skills/:
---
name: ts-zod-architect
description: Use this skill when the user wants to define data models or API schemas using TypeScript and Zod.
---
# TypeScript Zod Architect
Always follow these patterns:
- Use `z.infer<T>` for type extraction.
- Place schemas in `shared/schemas/`.
- Never use `any`; always use strict Zod validation.
The description in the frontmatter is the "trigger." If it’s vague, the skill won’t activate reliably. If it’s too broad, it’ll trigger on every request and bloat your context.
Step 1: Define Success with Assertions
To test a skill, we need to move from "it looks okay" to "it passed these 3 checks." In TypeScript, we can define a test suite using a simple structured object. For our ts-zod-architect skill, we want to ensure Claude doesn't just write the code, but follows our architectural rules.
// evals/zod-skill.eval.ts
export const zodEval = {
name: "ts-zod-architect-eval",
testCases: [
{
prompt: "Create a user schema with email and age.",
assertions: [
{
id: "use-zod",
text: "Uses 'z' from the zod library",
type: "regex",
pattern: /import.*z.*from.*'zod'/
},
{
id: "strict-types",
text: "No 'any' keywords used",
type: "negative_regex",
pattern: /: any/
},
{
id: "correct-path",
text: "Suggests saving to shared/schemas/",
type: "inclusion",
value: "shared/schemas/"
}
]
}
]
};
Step 2: Build a Lightweight Eval Harness
Claude Code is a terminal tool, so our harness should be too. We can use a simple script to run Claude with the skill enabled, capture the output, and run our assertions.
import { runClaudeCode } from './utils/claude-wrapper';
import { gradeOutput } from './utils/grader';
async function runEvals(suite) {
for (const test of suite.testCases) {
console.log(`Testing: ${test.prompt}`);
// Run Claude Code in a clean temporary directory
const output = await runClaudeCode(test.prompt, { skill: 'ts-zod-architect' });
const results = test.assertions.map(assertion => ({
...assertion,
passed: gradeOutput(output, assertion)
}));
const passRate = (results.filter(r => r.passed).length / results.length) * 100;
console.log(`Score: ${passRate}%`);
}
}
Pro Tip: I recommend using a "Model-as-a-Grader" approach for complex logic and simple regex for structural checks. LLMs are excellent at verifying if code "follows architectural patterns" where regex might fail.
Step 3: The Hill-Climbing Loop
The first time you run your evals, you’ll likely see a 40-60% pass rate. The goal is to "hill-climb"—iterate on the SKILL.md instructions until the pass rate stabilizes at 100%.
Common Failure: Vague Triggers
If your skill doesn't trigger, your description is too technical or narrow.
- Bad: "Implements Zod schema definitions with inferred types."
- Good: "Use when the user asks to create, update, or define data models, schemas, or validation logic."
Step 4: Graduate to "Negative Tests"
A great skill knows when not to trigger. If you ask Claude "How do I center a div?", your Zod skill shouldn't activate. Add "Negative Controls" to your eval suite to prevent context leakage.
{
prompt: "Center a div with CSS.",
assertions: [
{
id: "no-activation",
text: "Skill 'ts-zod-architect' was NOT used",
type: "skill_not_called"
}
]
}
Conclusion
Building powerful AI Agents with Claude Code is no longer about "magic prompts." It is about Context Engineering.
By treating your Skills as software—with versioned files, structured evals, and automated grading—you turn a "vibe-checked" experiment into a reliable tool. Don't just prompt. Engineer the context.
Thanks for reading! If you want to see the full implementation of the eval harness, check out my GitHub.
Comments