Voice mode

Speak your prompt; hear the agent's progress.

Shipped in v0.6.0. Available in every Hover-enabled dev server today.

How it works

Hold the round 🎙 button to the right of Send (push-to-talk). The icon switches to a live elapsed-seconds counter and a mint glow pulses around the button while listening. Speak — in 中文 or English — and release.

  • Mid-sentence pauses don't cut you off. Recognition stays open until you release.
  • Interim transcripts echo into the textarea live, so you see what the engine is hearing.
  • On release, the final transcript fires submit() — same code path as typing — and the textarea clears.

While the agent works, key step events get spoken aloud:

  • tool_use — humanized: Opening page / Clicking Submit / Filling form. Diagnostic-only tools (browser_snapshot, browser_take_screenshot) are deliberately silent.
  • text — first sentence of the agent's narration, capped at 60 chars.
  • session_endDone in N steps. / Stopped. / Something went wrong.

Press Stop, click a new prompt, or open the mic again to interrupt any in-flight utterance immediately.

Languages

Hover detects the language of your latest prompt (CJK regex) and routes:

LayerBehaviour
STTSpeechRecognition.lang defaults to zh-CN. The hint flips to en-US if your last AI text reply was English.
TTS phrasinghumanizeTool() emits 点击登录按钮 / Clicking Submit per language. session_end says 完成,共 5 步 / Done in 5 steps.
Voice pickerpickVoice() scores by name markers — prefers Siri / Premium / Enhanced / Google / Neural over legacy system voices. Within zh, prefers zh-CN over zh-TW.

The user's prompt language wins over the agent's reply language — claude / codex commonly answer in English even after a Chinese prompt, but TTS narration sticks with your context.

Privacy

  • No cloud round-trip. Both STT and TTS use the browser's built-in Web Speech API.
  • Chrome 139+ runs STT on-device via SODA (Speech On-Device API). The first push-to-talk installs the language packs lazily (SpeechRecognition.install({ langs: ['zh-CN', 'en-US'], processLocally: true })); subsequent recordings never leave the browser.
  • No new API keys. No .env entries. No service-side changes — the Node service didn't grow a single network call.
  • Firefox ships SpeechRecognition behind a flag (dom.webspeech.recognition.enable). The mic button is disabled with a "use Chrome" tooltip.

Settings

Open the button in the widget header. The settings overlay has a Speech narration toggle:

  • On (default) — agent step events get spoken aloud.
  • Off — silent. In-flight utterances are cancelled immediately on toggle.

State persists in localStorage under a key (hover:settings:v1) that's independent of your chat history, so future settings additions won't bump the chat schema.

Implementation notes (for the curious)

All voice logic lives in packages/widget-bootstrap/src/widget/voice.js — pure helpers + two thin factories around SpeechRecognition and SpeechSynthesis.

Two non-obvious correctness fixes:

  • continuous = true + final-transcript accumulation. With the default continuous = false, the engine ends recognition after ~1s of silence — so a thoughtful pause would cut off your sentence. Hover sets continuous = true and concatenates isFinal segments across multiple onresult batches; the result fires only when stop() is called.
  • voiceschanged race. Chrome returns [] from getVoices() on first call, with the real list arriving on a voiceschanged event. Without an explicit wait, the first utterance picks a null voice and the engine reads Chinese text with an English voice. waitForVoices() fires on WS open (well before any speech is queued) so the first sentence already has a correct voice.

Roadmap

Voice mode v0.6.0 is intentionally MVP. The next stages:

  • Pro mode (HOVER_VOICE_PROVIDER=deepgram+elevenlabs) — opt-in cloud STT/TTS for users who want higher-fidelity 中文 recognition or sub-100ms TTS first-token latency. LLM stays on your local CLI agent — only the I/O adapters are cloud.
  • Per-voice picker in settings — let users override the auto-selected voice from a dropdown.
  • STT language hint UI — small flag indicator on the mic icon showing which language the engine is currently configured for.

See Roadmap for the full picture.