Voice mode
Speak your prompt; hear the agent's progress.
Shipped in v0.6.0. Available in every Hover-enabled dev server today.
How it works
Hold the round 🎙 button to the right of Send (push-to-talk). The icon switches to a live elapsed-seconds counter and a mint glow pulses around the button while listening. Speak — in 中文 or English — and release.
- Mid-sentence pauses don't cut you off. Recognition stays open until you release.
- Interim transcripts echo into the textarea live, so you see what the engine is hearing.
- On release, the final transcript fires
submit()— same code path as typing — and the textarea clears.
While the agent works, key step events get spoken aloud:
tool_use— humanized:Opening page/Clicking Submit/Filling form. Diagnostic-only tools (browser_snapshot,browser_take_screenshot) are deliberately silent.text— first sentence of the agent's narration, capped at 60 chars.session_end—Done in N steps./Stopped./Something went wrong.
Press Stop, click a new prompt, or open the mic again to interrupt any in-flight utterance immediately.
Languages
Hover detects the language of your latest prompt (CJK regex) and routes:
| Layer | Behaviour |
|---|---|
| STT | SpeechRecognition.lang defaults to zh-CN. The hint flips to en-US if your last AI text reply was English. |
| TTS phrasing | humanizeTool() emits 点击登录按钮 / Clicking Submit per language. session_end says 完成,共 5 步 / Done in 5 steps. |
| Voice picker | pickVoice() scores by name markers — prefers Siri / Premium / Enhanced / Google / Neural over legacy system voices. Within zh, prefers zh-CN over zh-TW. |
The user's prompt language wins over the agent's reply language — claude / codex commonly answer in English even after a Chinese prompt, but TTS narration sticks with your context.
Privacy
- No cloud round-trip. Both STT and TTS use the browser's built-in Web Speech API.
- Chrome 139+ runs STT on-device via SODA (Speech On-Device API). The first push-to-talk installs the language packs lazily (
SpeechRecognition.install({ langs: ['zh-CN', 'en-US'], processLocally: true })); subsequent recordings never leave the browser. - No new API keys. No
.enventries. No service-side changes — the Node service didn't grow a single network call. - Firefox ships
SpeechRecognitionbehind a flag (dom.webspeech.recognition.enable). The mic button is disabled with a "use Chrome" tooltip.
Settings
Open the ⚙ button in the widget header. The settings overlay has a Speech narration toggle:
- On (default) — agent step events get spoken aloud.
- Off — silent. In-flight utterances are cancelled immediately on toggle.
State persists in localStorage under a key (hover:settings:v1) that's independent of your chat history, so future settings additions won't bump the chat schema.
Implementation notes (for the curious)
All voice logic lives in packages/widget-bootstrap/src/widget/voice.js — pure helpers + two thin factories around SpeechRecognition and SpeechSynthesis.
Two non-obvious correctness fixes:
continuous = true+ final-transcript accumulation. With the defaultcontinuous = false, the engine ends recognition after ~1s of silence — so a thoughtful pause would cut off your sentence. Hover setscontinuous = trueand concatenatesisFinalsegments across multipleonresultbatches; the result fires only whenstop()is called.voiceschangedrace. Chrome returns[]fromgetVoices()on first call, with the real list arriving on avoiceschangedevent. Without an explicit wait, the first utterance picks a null voice and the engine reads Chinese text with an English voice.waitForVoices()fires on WS open (well before any speech is queued) so the first sentence already has a correct voice.
Roadmap
Voice mode v0.6.0 is intentionally MVP. The next stages:
- Pro mode (
HOVER_VOICE_PROVIDER=deepgram+elevenlabs) — opt-in cloud STT/TTS for users who want higher-fidelity 中文 recognition or sub-100ms TTS first-token latency. LLM stays on your local CLI agent — only the I/O adapters are cloud. - Per-voice picker in settings — let users override the auto-selected voice from a dropdown.
- STT language hint UI — small flag indicator on the mic icon showing which language the engine is currently configured for.
See Roadmap for the full picture.