Skill Evaluations

You've created a custom skill that teaches Meggy how to write emails in your company's voice. It works great — today. But how do you know it'll still work after the next model update? Or after you tweak the instructions? Or three months from now when you've forgotten what it even does?

Meggy's Skill Evaluation pipeline gives you three ways to answer: scenario tests that verify a skill does what it claims, blind A/B comparisons that prove it's making responses better, and regression monitoring that alerts you the moment quality starts slipping.

Quality Scores

Every active skill gets a quality score from 0 to 100, updated automatically as you use Meggy. The score is a weighted blend of four signals:

Signal Weight What It Measures
😊 User satisfaction 50% Does the AI produce better responses with this skill? Tracks implicit signals — regenerations (bad), positive reactions and completions (good)
📦 Token efficiency 20% How much context does the skill add? Skills that bloat the prompt without adding value score lower
Latency 15% Does the skill slow things down? Lower overhead means a higher score
🎯 Injection rate 15% Is the skill actually being used? If the router rarely picks it, something's off

Scores need at least 10 injections before they appear, which prevents snap judgments on skills you've barely used.

What the Scores Mean

Score Rating What to Do
80–100 Excellent This skill is a keeper
60–79 Good Working well, no action needed
40–59 Fair Worth reviewing — might need a prompt tweak
Below 40 Poor May be doing more harm than good — consider editing or disabling

Each score also includes a trend — rising, stable, or declining — so you can catch problems before they become critical.

Scenario Tests

Want to verify a skill works correctly? Create a scenario — a test prompt paired with an expected outcome.

How It Works

  1. You define a scenario: "When asked 'Draft a meeting recap', the response should include an action items section"
  2. Meggy sends the prompt to the model with the skill active
  3. An auto-grader checks whether the response matches your expectation (natural language or regex)
  4. The result: pass or fail, with reasoning

You can create as many scenarios as you want for each skill. Run them all at once to get an evaluation report showing:

Scenario tests are perfect for catching regressions after editing a skill's instructions. Change something, run the scenarios, see if anything broke.

Creating Scenarios

Open a skill's detail view and navigate to the Evaluations tab. You can:

A/B Comparison

The quality score tells you how well a skill is performing. But what if you want proof that the skill is actually making responses better?

That's what A/B comparisons are for. Meggy runs a blind test:

  1. Takes a set of prompts
  2. Generates two responses for each — one with the skill, one without
  3. Sends both responses (in random order) to an auto-judge model
  4. The judge picks a winner for each pair, without knowing which had the skill

After all prompts are judged, you get:

Result What It Means
Skill-on wins The skill clearly improves responses
Skill-off wins Responses are better without it — time to revise
Inconclusive Not enough difference to call a winner

Each comparison includes a confidence score (0–100%), so you know how trustworthy the result is. High confidence with "skill-on wins"? Keep it. Low confidence? Run more prompts.

When to Run A/B

Regression Monitoring

You don't have to remember to check your skills. Meggy does it for you.

A daily regression scan runs in the background, comparing each skill's most recent evaluation against a baseline you've pinned. If a scenario that passed on the baseline now fails, you'll see an alert before bad outputs become a habit.

How It Works

Pick any evaluation report you're happy with and click Set as Baseline in the Report Detail header. That report becomes the reference point. From then on, Meggy compares every newer run against it:

  1. Run an evaluation — either manually from the Evaluations or Regressions tab, or automatically via the daily quality agent.
  2. Compare — the regression engine looks for scenarios that passed on the baseline but fail in the latest run. New scenarios added after the baseline are ignored (they couldn't have regressed).
  3. Alert — if any baseline-passing scenario is now failing, an alert is created with the exact scenario IDs.

What Triggers an Alert

The eval-based check (primary):

Severity Trigger
🚨 Critical 3 or more baseline-passing scenarios now fail, or at least 50% of overlapping baseline-passing scenarios fail
⚠️ Warning One or two baseline-passing scenarios now fail

A second runtime-metrics check runs as a fallback when no baseline exists yet. It uses a 7-day rolling window so a single bad session doesn't trigger false alarms:

Check ⚠️ Warning 🚨 Critical
Quality score drops ≥ 10 points ≥ 20 points
Users stripping the skill ≥ 15% strip rate ≥ 25% strip rate

What You See

When a regression is detected, an alert appears under Settings → Skills → [skill] → Regressions (or in the global Skill Regression page) showing:

If a check produces no alert, the tab tells you why with a friendly status banner — for example, "No baseline marked yet — set one in the Evaluations tab to enable regression detection." or "No regressions — N previously-passing scenarios still passing." So you're never left wondering whether silence means "all good" or "nothing was checked".

What You Can Do

For each alert, you have these options:

Action Effect
Dismiss Acknowledge and hide the alert
Evaluate Mark the alert as evaluated (you've reviewed it)
Quarantine Temporarily disable the skill while you fix it

And on the Regressions tab itself you'll find two buttons:

Button When to Use
Run Evaluation Run a fresh evaluation against the skill's scenarios, then immediately re-check the result against the baseline. The typical flow after pinning a baseline.
Check Now Re-compare the most recent existing evaluation against the baseline without running anything new — useful when an evaluation has already run elsewhere.

Putting It All Together

Here's a typical workflow for maintaining skill quality:

  1. Create a skill — teach Meggy your company's email style
  2. Write scenarios — define 5-10 test prompts with expected outcomes
  3. Run an A/B comparison — prove the skill improves email drafts
  4. Monitor passively — the regression monitor checks daily
  5. Get alerted — two weeks later, a model update causes a quality drop
  6. Investigate — re-run scenarios, find the ones that failed
  7. Fix and verify — edit the skill instructions, run A/B again to confirm

No manual spreadsheets, no guesswork. Just data.

What's Next?