Architecture Deep Dive

How WhatsApp Voice Communication Works in Hermes Agent

From the Baileys WebSocket bridge to Edge TTS voice synthesis — a full end-to-end walk through the pipeline that turns a WhatsApp voice note into an AI reply spoken back as a voice bubble.

Edge TTS Baileys Node.js bridge Python gateway DeepSeek V4 Flash ffmpeg

An open-source AI agent framework by Nous Research · source on www.wwaiLAB.com

System Overview

At a high level the system is five core components — a Node.js bridge that speaks the WhatsApp Web protocol, a Python gateway that orchestrates the conversation, and a multi-provider TTS engine.

Component	Location	Lang	Role
bridge.js	scripts/whatsapp-bridge/bridge.js	Node	WhatsApp bridge — connects to WhatsApp servers via Baileys
whatsapp.py	gateway/platforms/whatsapp.py	Python	Gateway adapter — talks to the bridge over HTTP
run.py	gateway/run.py	Python	Gateway main loop — dispatches messages, invokes agent, triggers TTS
base.py	gateway/platforms/base.py	Python	Base platform adapter — shared message lifecycle logic
tts_tool.py	tools/tts_tool.py	Python	TTS engine — text-to-speech with 10+ providers
Baileys	@whiskeysockets/baileys	Node	WhatsApp Web protocol implementation (open source)

Complete Data Flow

The full path of a voice message, from the user's microphone to the spoken reply that lands back in the chat.

Architecture — inbound pipeline

Outbound: tts_tool.py → ffmpeg (→ ogg/opus) → bridge.js /send-media → WhatsApp voice note (PTT)

User sends a voice message

The user records a voice note in WhatsApp. The WhatsApp server pushes it to the connected Baileys client via the WhatsApp Web protocol.

bridge.js receives the message

The Node.js bridge uses Baileys (@whiskeysockets/baileys) to connect to WhatsApp Web and listens for the messages.upsert event.

javascriptbridge.js — event handling

socket.ev.on('messages.upsert', async ({ messages }) => {
  for (const msg of messages) {
    // Enqueue message internally
    messageQueue.push(msg);
  }
});

Incoming messages are buffered in an internal queue exposed via GET /messages. WhatsApp voice notes arrive as .ogg Opus audio, tagged mediaType="audio" or "ptt".

WhatsAppAdapter polls for messages

The adapter in whatsapp.py runs a long-polling loop, detects voice by mediaType, downloads the audio to a local cache, and builds a MessageEvent.

pythonwhatsapp.py — polling the bridge

response = requests.get("http://127.0.0.1:3000/messages")
for msg_data in response.json():
    event = self._build_message_event(msg_data)
    if event.message_type == MessageType.VOICE:
        # Download and cache the audio file
        event = self._process_voice_attachment(event)
    self.handle_message(event)

BasePlatformAdapter processes the message

handle_message() creates a session key, manages concurrent session locking (supporting interruption), and calls _process_message(), which:

STT transcription — voice is transcribed via Whisper (local), Groq (cloud), or the OpenAI Whisper API; the text becomes the LLM message body.
Session management — a session is created/loaded from the database with a ~30 min idle TTL.
Agent invocation — the text goes to AIAgent.run_conversation(), fed to the LLM (DeepSeek V4 Flash) with history + skills.

LLM generates a response

The LLM processes the transcribed message — optionally calling tools like web search or file ops — and returns a response string. It may include MEDIA:<path> tags if the agent called text_to_speech itself.

Decision — should we send voice?

_should_auto_tts_for_chat() evaluates three conditions: voice.auto_tts enabled, channels.whatsapp.voiceReply true, and the agent did not already call TTS itself. Two paths follow:

flowPath A — Auto-TTS (automatic voice reply)

_process_message()
  → text_to_speech_tool(text=response_text)
    → asyncio.to_thread()   # runs in thread pool
    → returns .mp3 file path
  → play_tts() → send_voice()
    → bridge POST /send-media audio

flowPath B — Manual TTS (agent-triggered)

_process_message()
  → extract_media(response)   # parses MEDIA:<path> tags
  → send_voice()              # sends the generated audio

Audio format conversion (bridge.js)

Before delivery, bridge.js normalizes the format. Supported inputs: ogg / opus / mp3 / wav / m4a — anything that isn't already ogg/opus is converted with ffmpeg on the bridge side.

flowbridge.js (lines 606–631)

TTS produces .mp3 (or other format)
        ↓
Not ogg/opus? → convert with ffmpeg to ogg
        ↓
Send with ptt=true (WhatsApp voice bubble mode)
        ↓
User hears the Edge TTS voice on WhatsApp

Delivery to user

send_voice() POSTs the audio to bridge.js at /send-media; Baileys sends it as PTT (push-to-talk) so it renders as the familiar voice-note bubble. The text reply is sent separately via /send.

jsonPOST /send-media

{
  "chatId": "***@lid",
  "filePath": "/tmp/hermes_tts_abc123.mp3",
  "mediaType": "audio",
  "caption": ""
}

Voice Mode System

A persistent, per-chat voice mode is stored in ~/.hermes/gateway_voice_mode.json. Per-chat settings override the global default.

json~/.hermes/gateway_voice_mode.json

{
  "whatsapp:***@lid": "all"
}

Mode	Behavior
off	No auto-TTS, text-only replies
voice_only	Voice input triggers auto-TTS, only the voice reply is sent
all	Voice input triggers auto-TTS, both voice + text are sent

Users toggle modes from the chat with slash commands; the GatewayRunner loads them at connect time and pushes the voice.auto_tts default to every adapter.

/voice all /voice off /voice voice_only

Configuration Reference

All voice and TTS settings live in ~/.hermes/config.yaml. The master switch is voice.auto_tts.

tts.enabledmaster TTS toggle

voice.auto_ttsauto-speak every reply

channels.whatsapp.voiceReplyWhatsApp channel

yaml~/.hermes/config.yaml

tts:
  enabled: true
  replyMode: voice              # Gateway uses voice replies
  provider: edge                # Edge TTS (free, no API key needed)
  voice: zh-CN-XiaoxiaoNeural   # Microsoft Xiaoxiao (Chinese female)

voice:
  auto_tts: true                # Master switch: auto-TTS on all replies

channels:
  whatsapp:
    voiceReply: true            # WhatsApp channel voice toggle

Setting	Default	Description
tts.enabled	true	Master TTS feature toggle
tts.replyMode	voice	Reply mode for the gateway
tts.provider	edge	TTS engine provider
voice.auto_tts	false	Key switch — when true, all replies auto-convert to speech
channels.whatsapp.voiceReply	true	WhatsApp channel voice toggle

Tip

If voice replies ever stop and only text arrives, check these three in order: voice.auto_tts, the per-chat gateway_voice_mode.json value, and that tts.enabled is still true.

TTS Provider Comparison

The TTS engine (tools/tts_tool.py, 2,283 lines) supports 10+ providers. Edge TTS is the default — free, no API key, high-quality neural voices.

Provider	Type	API key	Output
edge default	Free — Microsoft Edge neural voices	No	MP3
elevenlabs	Paid	ELEVENLABS_API_KEY	OGG / MP3
openai	Paid	OPENAI_API_KEY	OGG / MP3
mistral (voxtral)	Paid	MISTRAL_API_KEY	OGG
neutts / kittentts / piper	Local / Free	No	WAV

Edge TTS — voice options

Voice	Gender	Style	Accent
zh-CN-XiaoxiaoNeural	Female	Warm, news	Mandarin
zh-CN-XiaoyiNeural	Female	Cheerful, cartoon	Mandarin
zh-CN-YunjianNeural	Male	Passionate	Mandarin
zh-CN-YunxiNeural	Male	Sunshiny	Mandarin
zh-TW-HsiaoChenNeural	Female	Friendly	Taiwanese
zh-TW-HsiaoYuNeural	Female	Friendly	Taiwanese
zh-TW-YunJheNeural	Male	Friendly	Taiwanese

The Node.js Bridge in Detail

bridge.js runs as a subprocess of the gateway, over HTTP on port 3000. Session data persists at ~/.hermes/platforms/whatsapp/session/ (creds.json + encryption keys).

Endpoint	Method	Description
/health	GET	Returns `{"status":"connected"\|"connecting"}`
/messages	GET	Message polling queue
/send	POST	Send text message {chatId, message, replyTo}
/send-media	POST	Send media {chatId, filePath, mediaType, caption}
/typing	POST	Typing indicator {chatId}
/edit	POST	Edit message {chatId, messageId, message}

Pairing & session

Pairing runs via the hermes whatsapp CLI, which generates a QR code for WhatsApp Web. Two modes are supported: self-chat (message yourself, personal use) and bot (receive from any chat, production).

Conversation Session Management

Session handling keeps conversations coherent while avoiding stale context.

WhatsApp session is managed by the gateway, not the bridge.
Continuous conversations stay within the same session.
Session TTL is ~30 minutes of idle time before expiry.
Fresh sessions reload the Manager SOUL.md from disk.
Each session loads the latest skills and rules configuration.

Key Files Reference

File	Purpose
gateway/platforms/whatsapp.py	WhatsApp adapter (1,282 lines)
gateway/platforms/base.py	Base adapter + auto-TTS logic (3,756 lines)
tools/tts_tool.py	TTS engine with 10+ providers (2,283 lines)
scripts/whatsapp-bridge/bridge.js	Node.js Baileys bridge
~/.hermes/config.yaml	TTS & voice configuration
~/.hermes/gateway_voice_mode.json	Per-chat voice mode state

Conclusion

A well-orchestrated pipeline where a user speaks to an AI agent over WhatsApp and hears a natural-sounding reply — on commodity hardware with zero per-message cost for the TTS component.

Baileys for reliable WhatsApp Web connectivity
STT for transcribing incoming voice notes
DeepSeek V4 Flash for response generation
Edge TTS for high-quality, free speech synthesis
ffmpeg for audio format conversion
A voice mode system for granular per-chat control