Architecture Deep Dive
How WhatsApp Voice Communication Works in Hermes Agent
From the Baileys WebSocket bridge to Edge TTS voice synthesis — a full end-to-end walk through the pipeline that turns a WhatsApp voice note into an AI reply spoken back as a voice bubble.
System Overview
At a high level the system is five core components — a Node.js bridge that speaks the WhatsApp Web protocol, a Python gateway that orchestrates the conversation, and a multi-provider TTS engine.
| Component | Location | Lang | Role |
|---|---|---|---|
| bridge.js | scripts/whatsapp-bridge/bridge.js | Node | WhatsApp bridge — connects to WhatsApp servers via Baileys |
| whatsapp.py | gateway/platforms/whatsapp.py | Python | Gateway adapter — talks to the bridge over HTTP |
| run.py | gateway/run.py | Python | Gateway main loop — dispatches messages, invokes agent, triggers TTS |
| base.py | gateway/platforms/base.py | Python | Base platform adapter — shared message lifecycle logic |
| tts_tool.py | tools/tts_tool.py | Python | TTS engine — text-to-speech with 10+ providers |
| Baileys | @whiskeysockets/baileys | Node | WhatsApp Web protocol implementation (open source) |
Complete Data Flow
The full path of a voice message, from the user's microphone to the spoken reply that lands back in the chat.
User sends a voice message
The user records a voice note in WhatsApp. The WhatsApp server pushes it to the connected Baileys client via the WhatsApp Web protocol.
bridge.js receives the message
The Node.js bridge uses Baileys (@whiskeysockets/baileys) to connect to WhatsApp Web and listens for the messages.upsert event.
socket.ev.on('messages.upsert', async ({ messages }) => {
for (const msg of messages) {
// Enqueue message internally
messageQueue.push(msg);
}
});Incoming messages are buffered in an internal queue exposed via GET /messages. WhatsApp voice notes arrive as .ogg Opus audio, tagged mediaType="audio" or "ptt".
WhatsAppAdapter polls for messages
The adapter in whatsapp.py runs a long-polling loop, detects voice by mediaType, downloads the audio to a local cache, and builds a MessageEvent.
response = requests.get("http://127.0.0.1:3000/messages")
for msg_data in response.json():
event = self._build_message_event(msg_data)
if event.message_type == MessageType.VOICE:
# Download and cache the audio file
event = self._process_voice_attachment(event)
self.handle_message(event)BasePlatformAdapter processes the message
handle_message() creates a session key, manages concurrent session locking (supporting interruption), and calls _process_message(), which:
- STT transcription — voice is transcribed via Whisper (local), Groq (cloud), or the OpenAI Whisper API; the text becomes the LLM message body.
- Session management — a session is created/loaded from the database with a ~30 min idle TTL.
- Agent invocation — the text goes to
AIAgent.run_conversation(), fed to the LLM (DeepSeek V4 Flash) with history + skills.
LLM generates a response
The LLM processes the transcribed message — optionally calling tools like web search or file ops — and returns a response string. It may include MEDIA:<path> tags if the agent called text_to_speech itself.
Decision — should we send voice?
_should_auto_tts_for_chat() evaluates three conditions: voice.auto_tts enabled, channels.whatsapp.voiceReply true, and the agent did not already call TTS itself. Two paths follow:
_process_message()
→ text_to_speech_tool(text=response_text)
→ asyncio.to_thread() # runs in thread pool
→ returns .mp3 file path
→ play_tts() → send_voice()
→ bridge POST /send-media audio_process_message()
→ extract_media(response) # parses MEDIA:<path> tags
→ send_voice() # sends the generated audioAudio format conversion (bridge.js)
Before delivery, bridge.js normalizes the format. Supported inputs: ogg / opus / mp3 / wav / m4a — anything that isn't already ogg/opus is converted with ffmpeg on the bridge side.
TTS produces .mp3 (or other format)
↓
Not ogg/opus? → convert with ffmpeg to ogg
↓
Send with ptt=true (WhatsApp voice bubble mode)
↓
User hears the Edge TTS voice on WhatsAppDelivery to user
send_voice() POSTs the audio to bridge.js at /send-media; Baileys sends it as PTT (push-to-talk) so it renders as the familiar voice-note bubble. The text reply is sent separately via /send.
{
"chatId": "***@lid",
"filePath": "/tmp/hermes_tts_abc123.mp3",
"mediaType": "audio",
"caption": ""
}Voice Mode System
A persistent, per-chat voice mode is stored in ~/.hermes/gateway_voice_mode.json. Per-chat settings override the global default.
{
"whatsapp:***@lid": "all"
}| Mode | Behavior |
|---|---|
| off | No auto-TTS, text-only replies |
| voice_only | Voice input triggers auto-TTS, only the voice reply is sent |
| all | Voice input triggers auto-TTS, both voice + text are sent |
Users toggle modes from the chat with slash commands; the GatewayRunner loads them at connect time and pushes the voice.auto_tts default to every adapter.
Configuration Reference
All voice and TTS settings live in ~/.hermes/config.yaml. The master switch is voice.auto_tts.
tts:
enabled: true
replyMode: voice # Gateway uses voice replies
provider: edge # Edge TTS (free, no API key needed)
voice: zh-CN-XiaoxiaoNeural # Microsoft Xiaoxiao (Chinese female)
voice:
auto_tts: true # Master switch: auto-TTS on all replies
channels:
whatsapp:
voiceReply: true # WhatsApp channel voice toggle| Setting | Default | Description |
|---|---|---|
| tts.enabled | true | Master TTS feature toggle |
| tts.replyMode | voice | Reply mode for the gateway |
| tts.provider | edge | TTS engine provider |
| voice.auto_tts | false | Key switch — when true, all replies auto-convert to speech |
| channels.whatsapp.voiceReply | true | WhatsApp channel voice toggle |
If voice replies ever stop and only text arrives, check these three in order: voice.auto_tts, the per-chat gateway_voice_mode.json value, and that tts.enabled is still true.
TTS Provider Comparison
The TTS engine (tools/tts_tool.py, 2,283 lines) supports 10+ providers. Edge TTS is the default — free, no API key, high-quality neural voices.
| Provider | Type | API key | Output |
|---|---|---|---|
| edge default | Free — Microsoft Edge neural voices | No | MP3 |
| elevenlabs | Paid | ELEVENLABS_API_KEY | OGG / MP3 |
| openai | Paid | OPENAI_API_KEY | OGG / MP3 |
| mistral (voxtral) | Paid | MISTRAL_API_KEY | OGG |
| neutts / kittentts / piper | Local / Free | No | WAV |
Edge TTS — voice options
| Voice | Gender | Style | Accent |
|---|---|---|---|
| zh-CN-XiaoxiaoNeural | Female | Warm, news | Mandarin |
| zh-CN-XiaoyiNeural | Female | Cheerful, cartoon | Mandarin |
| zh-CN-YunjianNeural | Male | Passionate | Mandarin |
| zh-CN-YunxiNeural | Male | Sunshiny | Mandarin |
| zh-TW-HsiaoChenNeural | Female | Friendly | Taiwanese |
| zh-TW-HsiaoYuNeural | Female | Friendly | Taiwanese |
| zh-TW-YunJheNeural | Male | Friendly | Taiwanese |
The Node.js Bridge in Detail
bridge.js runs as a subprocess of the gateway, over HTTP on port 3000. Session data persists at ~/.hermes/platforms/whatsapp/session/ (creds.json + encryption keys).
| Endpoint | Method | Description |
|---|---|---|
| /health | GET | Returns {"status":"connected"|"connecting"} |
| /messages | GET | Message polling queue |
| /send | POST | Send text message {chatId, message, replyTo} |
| /send-media | POST | Send media {chatId, filePath, mediaType, caption} |
| /typing | POST | Typing indicator {chatId} |
| /edit | POST | Edit message {chatId, messageId, message} |
Pairing & session
Pairing runs via the hermes whatsapp CLI, which generates a QR code for WhatsApp Web. Two modes are supported: self-chat (message yourself, personal use) and bot (receive from any chat, production).
Conversation Session Management
Session handling keeps conversations coherent while avoiding stale context.
- WhatsApp session is managed by the gateway, not the bridge.
- Continuous conversations stay within the same session.
- Session TTL is ~30 minutes of idle time before expiry.
- Fresh sessions reload the Manager
SOUL.mdfrom disk. - Each session loads the latest skills and rules configuration.
Key Files Reference
| File | Purpose |
|---|---|
| gateway/platforms/whatsapp.py | WhatsApp adapter (1,282 lines) |
| gateway/platforms/base.py | Base adapter + auto-TTS logic (3,756 lines) |
| tools/tts_tool.py | TTS engine with 10+ providers (2,283 lines) |
| scripts/whatsapp-bridge/bridge.js | Node.js Baileys bridge |
| ~/.hermes/config.yaml | TTS & voice configuration |
| ~/.hermes/gateway_voice_mode.json | Per-chat voice mode state |
Conclusion
A well-orchestrated pipeline where a user speaks to an AI agent over WhatsApp and hears a natural-sounding reply — on commodity hardware with zero per-message cost for the TTS component.
- Baileys for reliable WhatsApp Web connectivity
- STT for transcribing incoming voice notes
- DeepSeek V4 Flash for response generation
- Edge TTS for high-quality, free speech synthesis
- ffmpeg for audio format conversion
- A voice mode system for granular per-chat control