Documentation

Quickstart guide for self-hosting the server and flashing the firmware.

Server Setup (Docker)

cd server
docker compose up --build

# Health check
curl http://localhost:8080/health

# Text → Speech (PCM16LE mono 16kHz)
curl -X POST http://localhost:8080/v1/ask \
  -H "Content-Type: application/json" \
  -d '{"text":"hello"}' --output out.pcm

# Audio → Speech (voice input)
curl -X POST http://localhost:8080/v1/ask_audio \
  -F "audio=@recording.wav" --output out.pcm

# Play the audio
ffplay -f s16le -ar 16000 -ac 1 out.pcm

LLM Providers

Configure via environment variables:

# Ollama (default, self-hosted)
LLM_PROVIDER=ollama
OLLAMA_MODEL=ministral-3:8b

# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

TTS Providers

Choose your text-to-speech engine:

# Piper (default, offline)
TTS_PROVIDER=piper
PIPER_VOICE_EN=en_US-lessac-medium
PIPER_VOICE_RU=ru_RU-dmitri-medium

# Edge-TTS (Microsoft, cloud)
TTS_PROVIDER=edge
EDGE_VOICE_EN=en-US-ChristopherNeural
EDGE_VOICE_RU=ru-RU-DmitryNeural

# Test mode (sine wave)
TTS_PROVIDER=test

Speech-to-Text (Whisper)

Voice input uses OpenAI Whisper for transcription:

STT_MODEL=base          # Options: base, small, medium
STT_LANGUAGE=auto       # Auto-detect, or: en, ru
STT_DEVICE=cpu          # Uses int8 quantization

API Reference

EndpointDescription
GET /healthHealth check
POST /v1/askText prompt → LLM → TTS audio stream
POST /v1/ask_audioAudio input → STT → LLM → TTS audio stream

Response format: chunked application/octet-stream with optional X-Text header containing subtitle text. Audio is PCM16LE mono 16kHz.

Firmware (ESP-IDF)

. ~/esp/esp-idf/export.sh

cd firmware
idf.py set-target esp32s3
idf.py build
idf.py -p /dev/cu.usbmodemXXXX flash monitor

First boot launches Wi‑Fi provisioning via captive portal (AP name: VoiceTerminal-XXXX). Connect and configure your backend endpoint. Then type a prompt and press Enter.

Hardware

Reference device: M5Stack Cardputer Adv (ESP32-S3)

  • Audio codec: ES8311 (I2C)
  • I2S speaker + microphone (PCM16 mono 16kHz)
  • Keyboard: TCA8418 matrix controller
  • Display: ST7789 LCD (135×240)
  • Push-to-talk: GPIO0 button
  • Storage: microSD slot

The firmware architecture supports additional ESP32-S3 devices through hardware abstraction layers.

Links

Core repo:  https://github.com/koz-tv/voice-terminal
Website:    https://github.com/koz-tv/voiceterminal-web