Table of Contents
Quick Answer
Chain three models: Whisper (speech → text), an LLM (text → response), TTS like OpenVoice or StyleTTS (text → speech). Stream between steps for sub-second latency. Deploy as a web app with WebRTC mic access or a mobile app via Capacitor.
- Time to working demo: 1-2 days
- Cost: $0.01-0.05 per 60-second conversation
- Latency target: <800ms total
What You'll Need
- Whisper API or local whisper.cpp
- Streaming LLM (OpenAI-compatible)
- TTS: StyleTTS 2, OpenVoice, or hosted (Cartesia, Deepgram Aura)
- Next.js + WebRTC for web; Capacitor for mobile
Steps
- Set up mic capture. Use
MediaRecorderAPI. Ask AI: "Generate a React hook that captures 16kHz mono audio from the mic and emits 100ms chunks as WebM." - Stream STT. Send audio chunks to Whisper API via WebSocket or HTTP stream. For local, use
whisper.cppcompiled to WASM. Target: first partial transcript <300ms. - VAD (voice activity detection). Use Silero VAD (WASM build) to detect end-of-speech. Otherwise you wait forever for user to "finish."
- Trigger LLM on end-of-speech. Stream transcript to LLM. Prompt: "You are a concise voice assistant. Keep answers under 40 words unless asked for detail."
- Stream TTS. As LLM tokens arrive, buffer to sentence boundaries, send each sentence to TTS, play audio chunks as they arrive. This is the key to low latency.
- Barge-in support. If user starts speaking while TTS plays, immediately stop playback and start new STT. Use a state machine: IDLE → LISTENING → THINKING → SPEAKING.
- Deploy. Web: Next.js to Vercel/Coolify. Mobile: wrap in Capacitor, request mic permission on first launch.
- Measure latency. Log: mic-stop → first audio byte. Aim <800ms. Profile and optimize slowest step.
Common Mistakes
- No streaming: Waiting for full transcript + full LLM + full TTS = 5s latency. Stream everything.
- Ignoring barge-in: Users hate being talked over. Detect interruption immediately.
- No VAD: Silence detection via volume threshold is unreliable. Use Silero.
- Long LLM responses: Force
max_tokensshort. Voice users want brevity. - No echo cancellation: Mic picks up TTS speaker output. Enable
echoCancellation: true.
Top Tools
| Tool | Best For | Price |
|---|---|---|
| Whisper API | STT | $0.006/min |
| Cartesia | Low-latency TTS | $0.013/1K chars |
| StyleTTS 2 | Self-hosted TTS | Free |
| Silero VAD | End-of-speech | Free |
| LiveKit | WebRTC infra | Free tier |
Conclusion
Voice is the next interface. Streaming at every step is the secret to feeling magical. Build one narrow voice assistant (doctor's scribe, cooking helper, language tutor) and nail the latency. Everything else follows.
