Quickstart guide for self-hosting the server and flashing the firmware.
cd server
docker compose up --build
# Health check
curl http://localhost:8080/health
# Text → Speech (PCM16LE mono 16kHz)
curl -X POST http://localhost:8080/v1/ask \
-H "Content-Type: application/json" \
-d '{"text":"hello"}' --output out.pcm
# Audio → Speech (voice input)
curl -X POST http://localhost:8080/v1/ask_audio \
-F "audio=@recording.wav" --output out.pcm
# Play the audio
ffplay -f s16le -ar 16000 -ac 1 out.pcmConfigure via environment variables:
# Ollama (default, self-hosted) LLM_PROVIDER=ollama OLLAMA_MODEL=ministral-3:8b # OpenAI LLM_PROVIDER=openai OPENAI_API_KEY=sk-...
Choose your text-to-speech engine:
# Piper (default, offline) TTS_PROVIDER=piper PIPER_VOICE_EN=en_US-lessac-medium PIPER_VOICE_RU=ru_RU-dmitri-medium # Edge-TTS (Microsoft, cloud) TTS_PROVIDER=edge EDGE_VOICE_EN=en-US-ChristopherNeural EDGE_VOICE_RU=ru-RU-DmitryNeural # Test mode (sine wave) TTS_PROVIDER=test
Voice input uses OpenAI Whisper for transcription:
STT_MODEL=base # Options: base, small, medium STT_LANGUAGE=auto # Auto-detect, or: en, ru STT_DEVICE=cpu # Uses int8 quantization
| Endpoint | Description |
|---|---|
GET /health | Health check |
POST /v1/ask | Text prompt → LLM → TTS audio stream |
POST /v1/ask_audio | Audio input → STT → LLM → TTS audio stream |
Response format: chunked application/octet-stream with optional X-Text header containing subtitle text. Audio is PCM16LE mono 16kHz.
. ~/esp/esp-idf/export.sh cd firmware idf.py set-target esp32s3 idf.py build idf.py -p /dev/cu.usbmodemXXXX flash monitor
First boot launches Wi‑Fi provisioning via captive portal (AP name: VoiceTerminal-XXXX). Connect and configure your backend endpoint. Then type a prompt and press Enter.
Reference device: M5Stack Cardputer Adv (ESP32-S3)
The firmware architecture supports additional ESP32-S3 devices through hardware abstraction layers.
Core repo: https://github.com/koz-tv/voice-terminal Website: https://github.com/koz-tv/voiceterminal-web