You've created a custom skill that teaches Meggy how to write emails in your company's voice. It works great — today. But how do you know it'll still work after the next model update? Or after you tweak the instructions? Or three months from now when you've forgotten what it even does?
Meggy's Skill Evaluation pipeline gives you three ways to answer: scenario tests that verify a skill does what it claims, blind A/B comparisons that prove it's making responses better, and regression monitoring that alerts you the moment quality starts slipping.
Every active skill gets a quality score from 0 to 100, updated automatically as you use Meggy. The score is a weighted blend of four signals:
| Signal | Weight | What It Measures |
|---|---|---|
| 😊 User satisfaction | 50% | Does the AI produce better responses with this skill? Tracks implicit signals — regenerations (bad), positive reactions and completions (good) |
| 📦 Token efficiency | 20% | How much context does the skill add? Skills that bloat the prompt without adding value score lower |
| ⚡ Latency | 15% | Does the skill slow things down? Lower overhead means a higher score |
| 🎯 Injection rate | 15% | Is the skill actually being used? If the router rarely picks it, something's off |
Scores need at least 10 injections before they appear, which prevents snap judgments on skills you've barely used.
| Score | Rating | What to Do |
|---|---|---|
| 80–100 | Excellent | This skill is a keeper |
| 60–79 | Good | Working well, no action needed |
| 40–59 | Fair | Worth reviewing — might need a prompt tweak |
| Below 40 | Poor | May be doing more harm than good — consider editing or disabling |
Each score also includes a trend — rising, stable, or declining — so you can catch problems before they become critical.
Want to verify a skill works correctly? Create a scenario — a test prompt paired with an expected outcome.
You can create as many scenarios as you want for each skill. Run them all at once to get an evaluation report showing:
Scenario tests are perfect for catching regressions after editing a skill's instructions. Change something, run the scenarios, see if anything broke.
Open a skill's detail view and navigate to the Evaluations tab. You can:
The quality score tells you how well a skill is performing. But what if you want proof that the skill is actually making responses better?
That's what A/B comparisons are for. Meggy runs a blind test:
After all prompts are judged, you get:
| Result | What It Means |
|---|---|
| Skill-on wins | The skill clearly improves responses |
| Skill-off wins | Responses are better without it — time to revise |
| Inconclusive | Not enough difference to call a winner |
Each comparison includes a confidence score (0–100%), so you know how trustworthy the result is. High confidence with "skill-on wins"? Keep it. Low confidence? Run more prompts.
You don't have to remember to check your skills. Meggy does it for you.
A regression monitor runs in the background every 24 hours, scanning all active skills for quality drops. If something's going wrong, you'll see an alert before it becomes a problem.
| Check | ⚠️ Warning | 🚨 Critical |
|---|---|---|
| Quality score drops | ≥ 10 points | ≥ 20 points |
| Users stripping the skill | ≥ 15% strip rate | ≥ 25% strip rate |
Both checks use a 7-day rolling window, so a single bad session doesn't trigger false alarms.
When a regression is detected, an alert appears in Settings → Skills showing:
For each alert, you have four options:
| Action | Effect |
|---|---|
| Dismiss | Acknowledge and hide the alert |
| Re-evaluate | Run the skill's scenario tests to investigate |
| Quarantine | Temporarily disable the skill while you fix it |
| Manual check | Trigger an immediate regression check (instead of waiting 24 hours) |
Here's a typical workflow for maintaining skill quality:
No manual spreadsheets, no guesswork. Just data.