Hermes.Agent / Architecture Deep Dive www.wwaiLAB.com

Architecture Deep Dive

How WhatsApp Voice Communication Works in Hermes Agent

From the Baileys WebSocket bridge to Edge TTS voice synthesis — a full end-to-end walk through the pipeline that turns a WhatsApp voice note into an AI reply spoken back as a voice bubble.

Edge TTS Baileys Node.js bridge Python gateway DeepSeek V4 Flash ffmpeg
01

System Overview

At a high level the system is five core components — a Node.js bridge that speaks the WhatsApp Web protocol, a Python gateway that orchestrates the conversation, and a multi-provider TTS engine.

ComponentLocationLangRole
bridge.jsscripts/whatsapp-bridge/bridge.jsNodeWhatsApp bridge — connects to WhatsApp servers via Baileys
whatsapp.pygateway/platforms/whatsapp.pyPythonGateway adapter — talks to the bridge over HTTP
run.pygateway/run.pyPythonGateway main loop — dispatches messages, invokes agent, triggers TTS
base.pygateway/platforms/base.pyPythonBase platform adapter — shared message lifecycle logic
tts_tool.pytools/tts_tool.pyPythonTTS engine — text-to-speech with 10+ providers
Baileys@whiskeysockets/baileysNodeWhatsApp Web protocol implementation (open source)
02

Complete Data Flow

The full path of a voice message, from the user's microphone to the spoken reply that lands back in the chat.

Architecture — inbound pipeline
Outbound: tts_tool.py → ffmpeg (→ ogg/opus) → bridge.js /send-media → WhatsApp voice note (PTT)
1

User sends a voice message

The user records a voice note in WhatsApp. The WhatsApp server pushes it to the connected Baileys client via the WhatsApp Web protocol.

2

bridge.js receives the message

The Node.js bridge uses Baileys (@whiskeysockets/baileys) to connect to WhatsApp Web and listens for the messages.upsert event.

javascriptbridge.js — event handling
socket.ev.on('messages.upsert', async ({ messages }) => {
  for (const msg of messages) {
    // Enqueue message internally
    messageQueue.push(msg);
  }
});

Incoming messages are buffered in an internal queue exposed via GET /messages. WhatsApp voice notes arrive as .ogg Opus audio, tagged mediaType="audio" or "ptt".

3

WhatsAppAdapter polls for messages

The adapter in whatsapp.py runs a long-polling loop, detects voice by mediaType, downloads the audio to a local cache, and builds a MessageEvent.

pythonwhatsapp.py — polling the bridge
response = requests.get("http://127.0.0.1:3000/messages")
for msg_data in response.json():
    event = self._build_message_event(msg_data)
    if event.message_type == MessageType.VOICE:
        # Download and cache the audio file
        event = self._process_voice_attachment(event)
    self.handle_message(event)
4

BasePlatformAdapter processes the message

handle_message() creates a session key, manages concurrent session locking (supporting interruption), and calls _process_message(), which:

  • STT transcription — voice is transcribed via Whisper (local), Groq (cloud), or the OpenAI Whisper API; the text becomes the LLM message body.
  • Session management — a session is created/loaded from the database with a ~30 min idle TTL.
  • Agent invocation — the text goes to AIAgent.run_conversation(), fed to the LLM (DeepSeek V4 Flash) with history + skills.
5

LLM generates a response

The LLM processes the transcribed message — optionally calling tools like web search or file ops — and returns a response string. It may include MEDIA:<path> tags if the agent called text_to_speech itself.

6

Decision — should we send voice?

_should_auto_tts_for_chat() evaluates three conditions: voice.auto_tts enabled, channels.whatsapp.voiceReply true, and the agent did not already call TTS itself. Two paths follow:

flowPath A — Auto-TTS (automatic voice reply)
_process_message()
  → text_to_speech_tool(text=response_text)
    → asyncio.to_thread()   # runs in thread pool
    → returns .mp3 file path
  → play_tts() → send_voice()
    → bridge POST /send-media audio
flowPath B — Manual TTS (agent-triggered)
_process_message()
  → extract_media(response)   # parses MEDIA:<path> tags
  → send_voice()              # sends the generated audio
7

Audio format conversion (bridge.js)

Before delivery, bridge.js normalizes the format. Supported inputs: ogg / opus / mp3 / wav / m4a — anything that isn't already ogg/opus is converted with ffmpeg on the bridge side.

flowbridge.js (lines 606–631)
TTS produces .mp3 (or other format)
        ↓
Not ogg/opus? → convert with ffmpeg to ogg
        ↓
Send with ptt=true (WhatsApp voice bubble mode)
        ↓
User hears the Edge TTS voice on WhatsApp
8

Delivery to user

send_voice() POSTs the audio to bridge.js at /send-media; Baileys sends it as PTT (push-to-talk) so it renders as the familiar voice-note bubble. The text reply is sent separately via /send.

jsonPOST /send-media
{
  "chatId": "***@lid",
  "filePath": "/tmp/hermes_tts_abc123.mp3",
  "mediaType": "audio",
  "caption": ""
}
03

Voice Mode System

A persistent, per-chat voice mode is stored in ~/.hermes/gateway_voice_mode.json. Per-chat settings override the global default.

json~/.hermes/gateway_voice_mode.json
{
  "whatsapp:***@lid": "all"
}
ModeBehavior
offNo auto-TTS, text-only replies
voice_onlyVoice input triggers auto-TTS, only the voice reply is sent
allVoice input triggers auto-TTS, both voice + text are sent

Users toggle modes from the chat with slash commands; the GatewayRunner loads them at connect time and pushes the voice.auto_tts default to every adapter.

/voice all /voice off /voice voice_only
04

Configuration Reference

All voice and TTS settings live in ~/.hermes/config.yaml. The master switch is voice.auto_tts.

tts.enabledmaster TTS toggle
voice.auto_ttsauto-speak every reply
channels.whatsapp.voiceReplyWhatsApp channel
yaml~/.hermes/config.yaml
tts:
  enabled: true
  replyMode: voice              # Gateway uses voice replies
  provider: edge                # Edge TTS (free, no API key needed)
  voice: zh-CN-XiaoxiaoNeural   # Microsoft Xiaoxiao (Chinese female)

voice:
  auto_tts: true                # Master switch: auto-TTS on all replies

channels:
  whatsapp:
    voiceReply: true            # WhatsApp channel voice toggle
SettingDefaultDescription
tts.enabledtrueMaster TTS feature toggle
tts.replyModevoiceReply mode for the gateway
tts.provideredgeTTS engine provider
voice.auto_ttsfalseKey switch — when true, all replies auto-convert to speech
channels.whatsapp.voiceReplytrueWhatsApp channel voice toggle
Tip

If voice replies ever stop and only text arrives, check these three in order: voice.auto_tts, the per-chat gateway_voice_mode.json value, and that tts.enabled is still true.

05

TTS Provider Comparison

The TTS engine (tools/tts_tool.py, 2,283 lines) supports 10+ providers. Edge TTS is the default — free, no API key, high-quality neural voices.

ProviderTypeAPI keyOutput
edge defaultFree — Microsoft Edge neural voicesNoMP3
elevenlabsPaidELEVENLABS_API_KEYOGG / MP3
openaiPaidOPENAI_API_KEYOGG / MP3
mistral (voxtral)PaidMISTRAL_API_KEYOGG
neutts / kittentts / piperLocal / FreeNoWAV

Edge TTS — voice options

VoiceGenderStyleAccent
zh-CN-XiaoxiaoNeuralFemaleWarm, newsMandarin
zh-CN-XiaoyiNeuralFemaleCheerful, cartoonMandarin
zh-CN-YunjianNeuralMalePassionateMandarin
zh-CN-YunxiNeuralMaleSunshinyMandarin
zh-TW-HsiaoChenNeuralFemaleFriendlyTaiwanese
zh-TW-HsiaoYuNeuralFemaleFriendlyTaiwanese
zh-TW-YunJheNeuralMaleFriendlyTaiwanese
06

The Node.js Bridge in Detail

bridge.js runs as a subprocess of the gateway, over HTTP on port 3000. Session data persists at ~/.hermes/platforms/whatsapp/session/ (creds.json + encryption keys).

EndpointMethodDescription
/healthGETReturns {"status":"connected"|"connecting"}
/messagesGETMessage polling queue
/sendPOSTSend text message {chatId, message, replyTo}
/send-mediaPOSTSend media {chatId, filePath, mediaType, caption}
/typingPOSTTyping indicator {chatId}
/editPOSTEdit message {chatId, messageId, message}

Pairing & session

Pairing runs via the hermes whatsapp CLI, which generates a QR code for WhatsApp Web. Two modes are supported: self-chat (message yourself, personal use) and bot (receive from any chat, production).

07

Conversation Session Management

Session handling keeps conversations coherent while avoiding stale context.

  • WhatsApp session is managed by the gateway, not the bridge.
  • Continuous conversations stay within the same session.
  • Session TTL is ~30 minutes of idle time before expiry.
  • Fresh sessions reload the Manager SOUL.md from disk.
  • Each session loads the latest skills and rules configuration.
08

Key Files Reference

FilePurpose
gateway/platforms/whatsapp.pyWhatsApp adapter (1,282 lines)
gateway/platforms/base.pyBase adapter + auto-TTS logic (3,756 lines)
tools/tts_tool.pyTTS engine with 10+ providers (2,283 lines)
scripts/whatsapp-bridge/bridge.jsNode.js Baileys bridge
~/.hermes/config.yamlTTS & voice configuration
~/.hermes/gateway_voice_mode.jsonPer-chat voice mode state
09

Conclusion

A well-orchestrated pipeline where a user speaks to an AI agent over WhatsApp and hears a natural-sounding reply — on commodity hardware with zero per-message cost for the TTS component.

  • Baileys for reliable WhatsApp Web connectivity
  • STT for transcribing incoming voice notes
  • DeepSeek V4 Flash for response generation
  • Edge TTS for high-quality, free speech synthesis
  • ffmpeg for audio format conversion
  • A voice mode system for granular per-chat control
Tweaks
Theme
Accent
Reading width