Open source. 10 free runs, no credit card.

Is your skill actually helping the model?

Thousands of SKILL.md files are published. Almost none are tested. SkillCheck runs a blind A/B experiment on any agent skill and hands you a verdict with a confidence interval, in about two minutes.

Star on GitHub

npx @sx4im/skillcheck to try it without installing anything.

2
arms per task: with and without the skill
1,000
bootstrap resamples behind every verdict
95%
confidence interval on the measured effect
10
free runs on every new account
The problem

Skills ship on vibes

A skill file changes your agent's behavior on every single request. Yet almost nobody measures whether that change is an improvement.

Untested by default

Skills get written, committed, and shared without a single controlled comparison. "It feels better" is the entire QA process.

Placebo tax

Every skill costs prompt tokens on every call. A skill that does nothing still bills you for the privilege, forever.

Silent rot

Models change underneath you. A skill that helped six months ago can quietly become useless, or start hurting, after an upgrade.

How it works

A drug trial for your skill

One command runs the whole experiment. No setup, no harness, no notebooks.

Normalize

Your skill file is parsed and its declared domain extracted from front matter or the first heading.

Generate tasks

A generator model writes fresh evaluation tasks from the domain only. It never sees the skill body, so tasks cannot leak its instructions.

Run both arms

Every task runs with the skill injected as a system prompt, and again without it. Same model, same temperature, same everything else.

Grade blind

A separate grader scores each output against the task's pass criterion at temperature 0. It never knows which arm produced the output.

Score the difference

A 1,000 iteration paired bootstrap turns the pass rates into an effect size, a 95% confidence interval, and one verdict: HELPS, PLACEBO, or HARMS.

What you get

Evidence, not anecdotes

SkillCheck produces numbers you can put in a PR description. Every run is reproducible and every score carries its uncertainty.

Forced-injection A/B

The same tasks run twice, with and without your skill. The delta in pass rate is the skill's measured effect in percentage points.

Blind grading

Outputs are shuffled before grading, so the grader cannot favor either arm. No self-evaluation bias, no cherry-picking.

Bootstrap confidence

1,000 paired resamples build a 95% interval around the effect. The verdict only says HELPS when the interval clears zero.

Rot detection

Re-run saved results against new model releases. If a verdict flips from HELPS to PLACEBO, you know the skill rotted before your users do.

Reproducible by design

Every result records the skill hash, task suite, model versions, and transcript hashes. Anyone can re-run and verify the number.

Token aware

SkillCheck counts the prompt tokens your skill adds and reports value per 1k tokens. A small win that triples your context is not a win.

SKILL.md AGENTS.md CLAUDE.md any *.md file whole folders
The result

One card. One answer.

No dashboards to interpret. The CLI prints a single result card that says whether the skill earned its place in your prompt.

  • VerdictHELPS, PLACEBO, or HARMS, decided by whether the 95% interval clears zero. No interval, no claim.
  • Skill effectThe change in pass rate, in percentage points. This is the number to quote in your PR.
  • Token costWhat the skill adds to every prompt. Weigh it against the effect before shipping.
  • SatisfactionA 0 to 100 quality score where 50 means no effect. Quick to read, backed by the bootstrap.
Pricing

Start free. Upgrade when it pays off.

Ten runs is enough to test a real skill at two effort levels. Go unlimited when SkillCheck earns a place in your workflow.

Free
$0
  • 10 SkillCheck runs included
  • Full CLI: check, eval, verify
  • Blind grading and bootstrap CI
  • No credit card required

Upgrade lives in your dashboard once you are signed in.

FAQ

Questions, answered

What is SkillCheck?
SkillCheck is an open source CLI and cloud service that measures whether an agent skill file actually improves a model's task performance. It runs a controlled A/B experiment instead of relying on intuition: the same tasks are solved with and without your skill, graded blind, and scored with a bootstrap confidence interval.
How does SkillCheck test a skill?
SkillCheck reads your skill's declared domain and generates fresh evaluation tasks from the domain only, so tasks can never leak the skill's instructions. Each task runs in two arms, with and without the skill injected as a system prompt. A separate grader model scores every output blind, then a 1,000 iteration paired bootstrap produces the effect size, a 95% confidence interval, and one verdict: HELPS, PLACEBO, or HARMS.
Which skill files can I check?
Any Markdown skill file: SKILL.md, AGENTS.md, CLAUDE.md, or any other .md file. You can also point SkillCheck at a folder that contains one and pick the file interactively.
Do I need my own model API key?
No. Sign in with Google or GitHub and SkillCheck Cloud issues you a free API key with 10 runs included. The model provider key stays on the server and never reaches your terminal. Power users can set NVIDIA_API_KEY to run fully direct instead.
What does a PLACEBO verdict mean?
PLACEBO means the 95% confidence interval for the skill's effect overlaps zero: no measurable difference between running with and without the skill. It does not always mean the skill is bad, but it does mean you are paying tokens for an effect you cannot demonstrate.
Is SkillCheck open source?
Yes. The CLI, the evaluation methodology, and this dashboard are MIT licensed and developed in the open at github.com/sx4im/skillcheck.

Stop shipping placebo skills

Sign in with Google or GitHub, grab your key, and get your first verdict in about two minutes.

npm install -g @sx4im/skillcheck skillcheck check ./SKILL.md