Skill Evaluations

You've created a custom skill that teaches Meggy how to write emails in your company's voice. It works great — today. But how do you know it'll still work after the next model update? Or after you tweak the instructions? Or three months from now when you've forgotten what it even does?

Meggy's Skill Evaluation pipeline gives you three ways to answer: scenario tests that verify a skill does what it claims, blind A/B comparisons that prove it's making responses better, and regression monitoring that alerts you the moment quality starts slipping.

Quality Scores

Every active skill gets a quality score from 0 to 100, updated automatically as you use Meggy. The score is a weighted blend of four signals:

Signal Weight What It Measures
😊 User satisfaction 50% Does the AI produce better responses with this skill? Tracks implicit signals — regenerations (bad), positive reactions and completions (good)
📦 Token efficiency 20% How much context does the skill add? Skills that bloat the prompt without adding value score lower
Latency 15% Does the skill slow things down? Lower overhead means a higher score
🎯 Injection rate 15% Is the skill actually being used? If the router rarely picks it, something's off

Scores need at least 10 injections before they appear, which prevents snap judgments on skills you've barely used.

What the Scores Mean

Score Rating What to Do
80–100 Excellent This skill is a keeper
60–79 Good Working well, no action needed
40–59 Fair Worth reviewing — might need a prompt tweak
Below 40 Poor May be doing more harm than good — consider editing or disabling

Each score also includes a trend — rising, stable, or declining — so you can catch problems before they become critical.

Scenario Tests

Want to verify a skill works correctly? Create a scenario — a test prompt paired with an expected outcome.

How It Works

  1. You define a scenario: "When asked 'Draft a meeting recap', the response should include an action items section"
  2. Meggy sends the prompt to the model with the skill active
  3. An auto-grader checks whether the response matches your expectation (natural language or regex)
  4. The result: pass or fail, with reasoning

You can create as many scenarios as you want for each skill. Run them all at once to get an evaluation report showing:

Scenario tests are perfect for catching regressions after editing a skill's instructions. Change something, run the scenarios, see if anything broke.

Creating Scenarios

Open a skill's detail view and navigate to the Evaluations tab. You can:

A/B Comparison

The quality score tells you how well a skill is performing. But what if you want proof that the skill is actually making responses better?

That's what A/B comparisons are for. Meggy runs a blind test:

  1. Takes a set of prompts
  2. Generates two responses for each — one with the skill, one without
  3. Sends both responses (in random order) to an auto-judge model
  4. The judge picks a winner for each pair, without knowing which had the skill

After all prompts are judged, you get:

Result What It Means
Skill-on wins The skill clearly improves responses
Skill-off wins Responses are better without it — time to revise
Inconclusive Not enough difference to call a winner

Each comparison includes a confidence score (0–100%), so you know how trustworthy the result is. High confidence with "skill-on wins"? Keep it. Low confidence? Run more prompts.

When to Run A/B

Regression Monitoring

You don't have to remember to check your skills. Meggy does it for you.

A regression monitor runs in the background every 24 hours, scanning all active skills for quality drops. If something's going wrong, you'll see an alert before it becomes a problem.

What Triggers an Alert

Check ⚠️ Warning 🚨 Critical
Quality score drops ≥ 10 points ≥ 20 points
Users stripping the skill ≥ 15% strip rate ≥ 25% strip rate

Both checks use a 7-day rolling window, so a single bad session doesn't trigger false alarms.

What You See

When a regression is detected, an alert appears in Settings → Skills showing:

What You Can Do

For each alert, you have four options:

Action Effect
Dismiss Acknowledge and hide the alert
Re-evaluate Run the skill's scenario tests to investigate
Quarantine Temporarily disable the skill while you fix it
Manual check Trigger an immediate regression check (instead of waiting 24 hours)

Putting It All Together

Here's a typical workflow for maintaining skill quality:

  1. Create a skill — teach Meggy your company's email style
  2. Write scenarios — define 5-10 test prompts with expected outcomes
  3. Run an A/B comparison — prove the skill improves email drafts
  4. Monitor passively — the regression monitor checks daily
  5. Get alerted — two weeks later, a model update causes a quality drop
  6. Investigate — re-run scenarios, find the ones that failed
  7. Fix and verify — edit the skill instructions, run A/B again to confirm

No manual spreadsheets, no guesswork. Just data.

What's Next?