Skill Evaluations

You've created a custom skill that teaches Meggy how to write emails in your company's voice. It works great — today. But how do you know it'll still work after the next model update? Or after you tweak the instructions? Or three months from now when you've forgotten what it even does?

Meggy's Skill Evaluation pipeline gives you three ways to answer: scenario tests that verify a skill does what it claims, blind A/B comparisons that prove it's making responses better, and regression monitoring that alerts you the moment quality starts slipping.

Quality Scores

Every active skill gets a quality score from 0 to 100, updated automatically as you use Meggy. The score is a weighted blend of four signals:

Signal	Weight	What It Measures
😊 User satisfaction	50%	Does the AI produce better responses with this skill? Tracks implicit signals — regenerations (bad), positive reactions and completions (good)
📦 Token efficiency	20%	How much context does the skill add? Skills that bloat the prompt without adding value score lower
⚡ Latency	15%	Does the skill slow things down? Lower overhead means a higher score
🎯 Injection rate	15%	Is the skill actually being used? If the router rarely picks it, something's off

Scores need at least 10 injections before they appear, which prevents snap judgments on skills you've barely used.

What the Scores Mean

Score	Rating	What to Do
80–100	Excellent	This skill is a keeper
60–79	Good	Working well, no action needed
40–59	Fair	Worth reviewing — might need a prompt tweak
Below 40	Poor	May be doing more harm than good — consider editing or disabling

Each score also includes a trend — rising, stable, or declining — so you can catch problems before they become critical.

Scenario Tests

Want to verify a skill works correctly? Create a scenario — a test prompt paired with an expected outcome.

How It Works

You define a scenario: "When asked 'Draft a meeting recap', the response should include an action items section"
Meggy sends the prompt to the model with the skill active
An auto-grader checks whether the response matches your expectation (natural language or regex)
The result: pass or fail, with reasoning

You can create as many scenarios as you want for each skill. Run them all at once to get an evaluation report showing:

Pass rate — What percentage of scenarios passed?
Average latency — How fast were the responses?
Token count — How much context did the skill consume?

Scenario tests are perfect for catching regressions after editing a skill's instructions. Change something, run the scenarios, see if anything broke.

Creating Scenarios

Open a skill's detail view and navigate to the Evaluations tab. You can:

Write scenarios manually — give it a prompt and describe what the response should contain
Group scenarios by category (e.g. "tone", "accuracy", "formatting")
Re-run individual scenarios or the full suite

A/B Comparison

The quality score tells you how well a skill is performing. But what if you want proof that the skill is actually making responses better?

That's what A/B comparisons are for. Meggy runs a blind test:

Takes a set of prompts
Generates two responses for each — one with the skill, one without
Sends both responses (in random order) to an auto-judge model
The judge picks a winner for each pair, without knowing which had the skill

After all prompts are judged, you get:

Result	What It Means
Skill-on wins	The skill clearly improves responses
Skill-off wins	Responses are better without it — time to revise
Inconclusive	Not enough difference to call a winner

Each comparison includes a confidence score (0–100%), so you know how trustworthy the result is. High confidence with "skill-on wins"? Keep it. Low confidence? Run more prompts.

When to Run A/B

After creating a new skill, to verify it helps
After editing a skill, to confirm the changes improved things
When a skill's quality score is declining
Before sharing a skill with others, to validate its value

Regression Monitoring

You don't have to remember to check your skills. Meggy does it for you.

A daily regression scan runs in the background, comparing each skill's most recent evaluation against a baseline you've pinned. If a scenario that passed on the baseline now fails, you'll see an alert before bad outputs become a habit.

How It Works

Pick any evaluation report you're happy with and click Set as Baseline in the Report Detail header. That report becomes the reference point. From then on, Meggy compares every newer run against it:

Run an evaluation — either manually from the Evaluations or Regressions tab, or automatically via the daily quality agent.
Compare — the regression engine looks for scenarios that passed on the baseline but fail in the latest run. New scenarios added after the baseline are ignored (they couldn't have regressed).
Alert — if any baseline-passing scenario is now failing, an alert is created with the exact scenario IDs.

What Triggers an Alert

The eval-based check (primary):

Severity	Trigger
🚨 Critical	3 or more baseline-passing scenarios now fail, or at least 50% of overlapping baseline-passing scenarios fail
⚠️ Warning	One or two baseline-passing scenarios now fail

A second runtime-metrics check runs as a fallback when no baseline exists yet. It uses a 7-day rolling window so a single bad session doesn't trigger false alarms:

Check	⚠️ Warning	🚨 Critical
Quality score drops	≥ 10 points	≥ 20 points
Users stripping the skill	≥ 15% strip rate	≥ 25% strip rate

What You See

When a regression is detected, an alert appears under Settings → Skills → [skill] → Regressions (or in the global Skill Regression page) showing:

Which skill is affected
Pass-rate change (e.g. "100 → 80")
Regressed scenarios — the exact scenario IDs that newly fail
Severity — warning (yellow) or critical (red)

If a check produces no alert, the tab tells you why with a friendly status banner — for example, "No baseline marked yet — set one in the Evaluations tab to enable regression detection." or "No regressions — N previously-passing scenarios still passing." So you're never left wondering whether silence means "all good" or "nothing was checked".

What You Can Do

For each alert, you have these options:

Action	Effect
Dismiss	Acknowledge and hide the alert
Evaluate	Mark the alert as evaluated (you've reviewed it)
Quarantine	Temporarily disable the skill while you fix it

And on the Regressions tab itself you'll find two buttons:

Button	When to Use
Run Evaluation	Run a fresh evaluation against the skill's scenarios, then immediately re-check the result against the baseline. The typical flow after pinning a baseline.
Check Now	Re-compare the most recent existing evaluation against the baseline without running anything new — useful when an evaluation has already run elsewhere.

Putting It All Together

Here's a typical workflow for maintaining skill quality:

Create a skill — teach Meggy your company's email style
Write scenarios — define 5-10 test prompts with expected outcomes
Run an A/B comparison — prove the skill improves email drafts
Monitor passively — the regression monitor checks daily
Get alerted — two weeks later, a model update causes a quality drop
Investigate — re-run scenarios, find the ones that failed
Fix and verify — edit the skill instructions, run A/B again to confirm

No manual spreadsheets, no guesswork. Just data.

What's Next?

Skills System — Learn how skills work, from bundled skills to custom creation
Plugins — Plugin health scores are powered by skill quality metrics
Brain Transparency — See skill quality scores in the configuration editor
Agent Creator — Build agents that use your highest-quality skills