Voice Chat

You're cooking dinner and your hands are covered in flour. You need to know how long to roast the chicken at 375°. You could wash your hands, dry them, unlock your phone, type the question... or you could just say "Hey Meggy, how long do I roast a chicken at 375?" and get an answer spoken back to you.

Voice Chat turns Meggy into a hands-free assistant. It listens for your voice, understands what you say, thinks about the answer using the same powerful AI engine behind text conversations, and speaks the response back to you. Same tools, same memory, same intelligence — just no keyboard required.

How It Works

Voice Chat is built on a four-stage pipeline:

1. Wake Word Detection

Meggy listens for its wake word — "Hey Meggy" — using a lightweight local detection model. This runs continuously in the background without sending any audio to the cloud. When the wake word is detected, the microphone activates and recording begins.

You can also use push-to-talk mode if you prefer — hold a key to speak, release to send. Both modes are available in settings.

2. Speech-to-Text (STT)

Once you finish speaking, your audio is transcribed into text. Meggy supports multiple STT providers:

Provider	Model	Runs Locally?
Local Whisper	Whisper.cpp	✅ Yes — fully on-device
OpenAI	Whisper	Cloud
Deepgram	Nova-3	Cloud
Groq	Whisper Large v3 Turbo	Cloud — extremely fast
AssemblyAI	Universal-2	Cloud — high accuracy

If privacy is your priority, the local Whisper option means your voice never leaves your machine. If speed matters most, Groq’s LPU-accelerated Whisper is the fastest cloud option.

3. AI Processing

The transcribed text is processed through the exact same AI pipeline as typed messages. This means Voice Chat has access to:

All 110+ built-in tools
Your vault documents
Unified memory (facts, preferences, episodes)
Active skills
Connected agents

You can ask voice questions that trigger tool calls — "What's the weather like?", "Turn off the bedroom lights", "Add milk to my shopping list" — and Meggy will use the appropriate tools to fulfill the request.

4. Text-to-Speech (TTS)

The AI's response is spoken back to you using natural-sounding voice synthesis:

Provider	Voices	Quality
ElevenLabs	Hundreds of natural voices	Premium, highly expressive
OpenAI	6 built-in voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer)	High quality, fast
Deepgram	Aura-2 voices	Low latency, natural
Cartesia	Sonic-2 voices	Ultra-low latency, expressive
Piper	Local voices	✅ Runs offline — no cloud needed
Web Speech API	Browser built-in voices	Zero setup — works everywhere

You can choose your preferred voice in settings, adjusting speed, pitch, and provider.

Voice Activity Detection (VAD)

Meggy uses Voice Activity Detection to know when you've finished speaking. VAD analyzes the audio stream in real time to detect speech boundaries — it knows when you start talking and when you stop, so it doesn't cut you off mid-sentence or wait awkwardly after you've finished.

Platform Support

Voice Chat works on all supported platforms:

macOS — Full support including wake word
Windows — Full support (cloud STT only — local Whisper not available)
Linux — Full support including wake word

Conversational Mode

Want to interrupt Meggy mid-sentence? With conversational mode enabled, you can barge in at any time — correct the AI, change the topic, or redirect the conversation — and Meggy will stop, understand what you meant, and pivot instantly.

An echo gate prevents false triggers by distinguishing your voice from the AI's own speech playing through the speakers. During TTS playback, only audio significantly louder than the speaker baseline passes through to the speech recognizer. After playback ends, a brief cooldown blocks residual echo.

You can tune interrupt sensitivity from 1 (very conservative — only explicit corrections) to 5 (very aggressive — almost any speech interrupts). The default level of 3 works well for most conversations.

To enable it, go to your agent's channel bindings and toggle Conversational Mode on for the voice channel. See the Conversational Agents article for the full deep-dive.

Live Audio Mode (Speech-to-Speech)

For the ultimate low-latency experience, Meggy supports Live Audio Mode. Instead of the traditional multi-stage pipeline (Voice → STT → LLM → TTS → Voice), Live Audio uses a direct bidirectional speech-to-speech connection.

This eliminates round-trip delays, offering sub-500ms latency that feels like talking to a real human. It supports native conversational barge-in, meaning you can interrupt the AI organically and it will stop instantly, without needing the traditional echo gate.

Supported Live Audio providers include:

Gemini Live (Google's latest multimodal audio API)
OpenAI Realtime (WebSocket-based fast audio response)
Groq Voice (Ultra-low latency LPU processing)

You can toggle Live Audio mode on directly from the Voice & Speech settings. When enabled, it short-circuits the traditional pipeline for unprecedented speed.

Setting Up Voice Chat

Open Settings → Voice & Speech (in the AI & Models group)
Choose your STT provider (Local Whisper for privacy, Groq for speed, Deepgram for streaming, AssemblyAI for speaker labels, or OpenAI as a reliable default)
Choose your TTS provider (ElevenLabs for the most natural voices, Piper for offline use)
Select a voice from the provider's catalog
Toggle wake word detection on if you want hands-free activation
Start talking!

Tip: You can also configure Voice & Speech during the onboarding wizard when you first set up Meggy.

Voice Chat integrates with all of Meggy's channels — you can start a voice conversation on desktop and continue it via text on WhatsApp, or vice versa. It's all the same conversation, the same memory, the same assistant.