Voice Setup Guide¶
This guide covers setting up and using Harombe's voice features (Phase 3).
Table of Contents¶
Overview¶
Harombe's voice features enable natural spoken interaction with the AI agent using:
- Speech-to-Text (STT): Whisper models (OpenAI) for transcription
- Text-to-Speech (TTS): Piper (fast, local) or Coqui (high-quality) for voice synthesis
- Push-to-Talk: SPACE key to record audio
- Real-time Processing: Streaming audio input/output
Hardware Requirements¶
Minimum Requirements¶
- CPU: Multi-core processor (4+ cores recommended)
- RAM: 4GB available
- VRAM: 2GB for Whisper base model
- Audio: Microphone and speakers/headphones
Recommended Configuration¶
- CPU: 8+ cores
- RAM: 8GB available
- VRAM: 4GB+ for Whisper medium model
- GPU: NVIDIA (CUDA) or Apple Silicon (MPS) for acceleration
- Audio: USB microphone or headset for better quality
Model VRAM Requirements¶
| Whisper Model | VRAM | Accuracy | Speed |
|---|---|---|---|
| tiny | 1GB | Good | Very Fast |
| base | 2GB | Better | Fast |
| small | 3GB | Good | Medium |
| medium | 4GB | Very Good | Slower |
| large-v2 | 8GB | Excellent | Slow |
| large-v3 | 10GB | Best | Slowest |
Installation¶
System Dependencies¶
macOS¶
Ubuntu/Debian¶
Fedora/RHEL¶
Windows¶
Audio drivers are typically included. If issues occur:
Python Dependencies¶
# Already included if you installed harombe
pip install harombe
# Or install with voice extras explicitly
pip install "harombe[voice]"
Configuration¶
Basic Configuration¶
Create or edit harombe.yaml:
voice:
enabled: true
stt:
model: base # Whisper model size
language: null # Auto-detect language
device: auto # auto, cpu, cuda, mps
compute_type: default # default, int8, float16, float32
tts:
engine: piper # piper or coqui
model: en_US-lessac-medium
speed: 1.0
device: auto
STT Configuration Options¶
model: Whisper model size
tiny: 39M params, 1GB VRAM, fastestbase: 74M params, 2GB VRAM, good balancesmall: 244M params, 3GB VRAMmedium: 769M params, 4GB VRAM, recommendedlarge-v2: 1550M params, 8GB VRAMlarge-v3: 1550M params, 10GB VRAM, most accurate
language: Language code (ISO 639-1)
null: Auto-detect language"en": English"es": Spanish"fr": French"de": German"zh": Chinese- See Whisper docs for full list
device: Compute device
"auto": Auto-select (CUDA > MPS > CPU)"cpu": Force CPU (slower)"cuda": NVIDIA GPU"mps": Apple Silicon GPU
compute_type: Precision mode
"default": Auto-select based on device"int8": Integer quantization (faster, less accurate)"float16": Half precision (good balance)"float32": Full precision (slower, most accurate)
TTS Configuration Options¶
engine: TTS backend
"piper": Fast, local, neural TTS (recommended for real-time, supports all Python versions)"coqui": High-quality, slower (better for production audio, Python <3.11 only)
Piper Models (engine: piper):
en_US-lessac-medium: Male voice, high qualityen_US-amy-medium: Female voice, high qualityen_US-lessac-low: Male voice, fasteren_GB-southern_english_female-medium: British female
Coqui Models (engine: coqui, requires Python 3.10 or earlier):
- Install with:
pip install 'harombe[coqui]'(Python 3.10 only) tts_models/en/ljspeech/tacotron2-DDC: High qualitytts_models/en/vctk/vits: Multi-speakertts_models/multilingual/multi-dataset/your_tts: Multilingual
speed: Speech rate multiplier
0.5: Half speed (clearer for transcription)1.0: Normal speed (default)1.5: 50% faster2.0: Double speed (maximum)
Usage¶
CLI Voice Mode¶
# Start voice assistant
harombe voice
# With custom STT model
harombe voice --stt-model medium
# With different TTS engine
harombe voice --tts-engine coqui --tts-model tts_models/en/ljspeech/tacotron2-DDC
Controls:
- SPACE: Press and hold to record, release to process
- Ctrl+C: Exit voice mode
Programmatic Usage¶
import asyncio
from harombe.agent.loop import Agent
from harombe.config.schema import HarombeConfig
from harombe.llm.ollama import OllamaClient
from harombe.voice.whisper import WhisperSTT
from harombe.voice.piper import PiperTTS
from harombe.cli.voice import VoiceClient
async def main():
# Configuration
config = HarombeConfig(
voice={
"enabled": True,
"stt": {"model": "base", "language": "en"},
"tts": {"engine": "piper", "model": "en_US-lessac-medium"},
}
)
# Initialize engines
stt = WhisperSTT(model="base")
await stt.initialize()
tts = PiperTTS(model="en_US-lessac-medium")
await tts.initialize()
# Create agent
llm = OllamaClient(config=config)
agent = Agent(llm=llm, config=config)
# Run voice client
client = VoiceClient(stt_engine=stt, tts_engine=tts, agent=agent)
await client.run()
asyncio.run(main())
See examples/09_voice_assistant.py for a complete example.
API Endpoints¶
Voice features are available via REST and WebSocket APIs:
REST Endpoints:
# Speech-to-text (base64 audio)
curl -X POST http://localhost:8000/voice/stt \
-H "Content-Type: application/json" \
-d '{"audio_base64": "...", "language": "en"}'
# Text-to-speech
curl -X POST http://localhost:8000/voice/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "default", "speed": 1.0}'
# Upload audio file
curl -X POST http://localhost:8000/voice/stt/file \
-F "file=@recording.wav"
WebSocket Streaming:
const ws = new WebSocket("ws://localhost:8000/voice/stream");
// Send audio chunk
ws.send(
JSON.stringify({
type: "audio_chunk",
data: base64AudioData,
format: "wav",
}),
);
// Signal end of audio
ws.send(JSON.stringify({ type: "audio_end" }));
// Receive transcription
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type === "transcription") {
console.log("Transcribed:", msg.text);
} else if (msg.type === "audio_chunk") {
// Play audio chunk
playAudio(msg.data);
}
};
Troubleshooting¶
Audio Input Issues¶
Microphone not detected:
# macOS: Check System Settings > Privacy & Security > Microphone
# Linux: List audio devices
arecord -l
# Test microphone
python -c "import sounddevice as sd; print(sd.query_devices())"
Permission denied:
# macOS: Grant microphone permission in System Settings
# Linux: Add user to audio group
sudo usermod -a -G audio $USER
# Log out and back in
VRAM / OOM Errors¶
Reduce model size:
Force CPU:
Audio Quality Issues¶
Improve STT accuracy:
Improve TTS quality:
voice:
tts:
engine: coqui # Switch from piper to coqui
model: tts_models/en/ljspeech/tacotron2-DDC
speed: 0.9 # Slightly slower for clarity
Common Errors¶
"No module named 'sounddevice'":
"PortAudio library not found":
# macOS
brew install portaudio
# Ubuntu/Debian
sudo apt-get install portaudio19-dev
# Then reinstall sounddevice
pip uninstall sounddevice
pip install sounddevice
"CUDA out of memory":
Use smaller model or CPU:
Performance Optimization¶
For real-time interaction:
voice:
stt:
model: base # Fast transcription
compute_type: int8 # Quantization
tts:
engine: piper # Fast synthesis
For best quality:
voice:
stt:
model: large-v3 # Most accurate
compute_type: float16
tts:
engine: coqui
model: tts_models/en/ljspeech/tacotron2-DDC
Next Steps¶
- See voice-architecture.md for technical details
- Check examples/09_voice_assistant.py for code examples
- Read API Reference for endpoint reference