You've created a custom skill that teaches Meggy how to write emails in your company's voice. It works great — today. But how do you know it'll still work after the next model update? Or after you tweak the instructions? Or three months from now when you've forgotten what it even does?
Meggy's Skill Evaluation pipeline gives you three ways to answer: scenario tests that verify a skill does what it claims, blind A/B comparisons that prove it's making responses better, and regression monitoring that alerts you the moment quality starts slipping.
Every active skill gets a quality score from 0 to 100, updated automatically as you use Meggy. The score is a weighted blend of four signals:
| Signal | Weight | What It Measures |
|---|---|---|
| 😊 User satisfaction | 50% | Does the AI produce better responses with this skill? Tracks implicit signals — regenerations (bad), positive reactions and completions (good) |
| 📦 Token efficiency | 20% | How much context does the skill add? Skills that bloat the prompt without adding value score lower |
| ⚡ Latency | 15% | Does the skill slow things down? Lower overhead means a higher score |
| 🎯 Injection rate | 15% | Is the skill actually being used? If the router rarely picks it, something's off |
Scores need at least 10 injections before they appear, which prevents snap judgments on skills you've barely used.
| Score | Rating | What to Do |
|---|---|---|
| 80–100 | Excellent | This skill is a keeper |
| 60–79 | Good | Working well, no action needed |
| 40–59 | Fair | Worth reviewing — might need a prompt tweak |
| Below 40 | Poor | May be doing more harm than good — consider editing or disabling |
Each score also includes a trend — rising, stable, or declining — so you can catch problems before they become critical.
Want to verify a skill works correctly? Create a scenario — a test prompt paired with an expected outcome.
You can create as many scenarios as you want for each skill. Run them all at once to get an evaluation report showing:
Scenario tests are perfect for catching regressions after editing a skill's instructions. Change something, run the scenarios, see if anything broke.
Open a skill's detail view and navigate to the Evaluations tab. You can:
The quality score tells you how well a skill is performing. But what if you want proof that the skill is actually making responses better?
That's what A/B comparisons are for. Meggy runs a blind test:
After all prompts are judged, you get:
| Result | What It Means |
|---|---|
| Skill-on wins | The skill clearly improves responses |
| Skill-off wins | Responses are better without it — time to revise |
| Inconclusive | Not enough difference to call a winner |
Each comparison includes a confidence score (0–100%), so you know how trustworthy the result is. High confidence with "skill-on wins"? Keep it. Low confidence? Run more prompts.
You don't have to remember to check your skills. Meggy does it for you.
A daily regression scan runs in the background, comparing each skill's most recent evaluation against a baseline you've pinned. If a scenario that passed on the baseline now fails, you'll see an alert before bad outputs become a habit.
Pick any evaluation report you're happy with and click Set as Baseline in the Report Detail header. That report becomes the reference point. From then on, Meggy compares every newer run against it:
The eval-based check (primary):
| Severity | Trigger |
|---|---|
| 🚨 Critical | 3 or more baseline-passing scenarios now fail, or at least 50% of overlapping baseline-passing scenarios fail |
| ⚠️ Warning | One or two baseline-passing scenarios now fail |
A second runtime-metrics check runs as a fallback when no baseline exists yet. It uses a 7-day rolling window so a single bad session doesn't trigger false alarms:
| Check | ⚠️ Warning | 🚨 Critical |
|---|---|---|
| Quality score drops | ≥ 10 points | ≥ 20 points |
| Users stripping the skill | ≥ 15% strip rate | ≥ 25% strip rate |
When a regression is detected, an alert appears under Settings → Skills → [skill] → Regressions (or in the global Skill Regression page) showing:
If a check produces no alert, the tab tells you why with a friendly status banner — for example, "No baseline marked yet — set one in the Evaluations tab to enable regression detection." or "No regressions — N previously-passing scenarios still passing." So you're never left wondering whether silence means "all good" or "nothing was checked".
For each alert, you have these options:
| Action | Effect |
|---|---|
| Dismiss | Acknowledge and hide the alert |
| Evaluate | Mark the alert as evaluated (you've reviewed it) |
| Quarantine | Temporarily disable the skill while you fix it |
And on the Regressions tab itself you'll find two buttons:
| Button | When to Use |
|---|---|
| Run Evaluation | Run a fresh evaluation against the skill's scenarios, then immediately re-check the result against the baseline. The typical flow after pinning a baseline. |
| Check Now | Re-compare the most recent existing evaluation against the baseline without running anything new — useful when an evaluation has already run elsewhere. |
Here's a typical workflow for maintaining skill quality:
No manual spreadsheets, no guesswork. Just data.