Voice Configuration
A reference for choosing and configuring voices, languages, and speech behavior for your AI agents. Covers both the dashboard UX and the API.
For browsing and previewing voices visually, see Voice Library. This page is for understanding the underlying configuration model.
The configuration split (read this first)
Akol’s data model has two pieces:
| Lives on | What |
|---|---|
| Agent | Voice ID, agent name, default language, additional languages, avatar |
| Business | System prompt, greeting, fallback message, voice settings (speed, etc.), interruption sensitivity, silence threshold, transfer number |
When a call starts, the engine merges them: the Business is the source of truth for behavior; the Agent is the source of truth for who sounds like what. If a value is missing on Business, the engine falls back to Agent (this fallback is being deprecated — write to Business when you can).
Voices
Listing voices
GET /api/v1/voices
Authorization: Bearer <token>Returns the full ElevenLabs catalog filtered to voices Akol supports:
{
"success": true,
"data": [
{
"id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"name": "Sarah",
"gender": "female",
"category": "curated",
"languages": ["en"],
"previewUrl": "https://cdn.akol.ai/voice-previews/sarah.mp3",
"description": "Warm female voice, professional and approachable",
"isFavorite": false,
"isHidden": false
}
]
}Voice categories
| Category | When to use |
|---|---|
curated | Hand-picked, work well across most use cases |
stable | Most consistent across long calls / unusual phrasings |
emotive | More expressive — good for hospitality, healthcare, sales |
support | Tuned for customer support cadence (shorter pauses, calmer) |
Setting an agent’s voice
PATCH /api/v1/agents/:id
Content-Type: application/json
{
"voiceId": "a0e99841-438c-4a64-b679-ae501e7d6091",
"language": "en-US",
"primaryLanguage": "en"
}The voice ID must come from /api/v1/voices. Arbitrary ElevenLabs voice IDs
not in our curated list (e.g. copied from another platform) are rejected with 422.
Voice favorites and hidden
Per-user preferences:
POST /api/v1/voices/:voiceId/favorite
DELETE /api/v1/voices/:voiceId/favorite
POST /api/v1/voices/:voiceId/hide
DELETE /api/v1/voices/:voiceId/hideHidden voices don’t appear in the picker but are still valid — existing agents using them continue to work.
Languages
Akol supports the following BCP-47 language codes for STT, LLM, and TTS:
| Language | Code | STT | LLM | TTS (voice support varies) |
|---|---|---|---|---|
| English (US) | en-US | ✓ Nova-3 / Flux | ✓ | All voices |
| English (UK) | en-GB | ✓ Nova-3 | ✓ | Voices with en in languages |
| German | de-DE | ✓ Nova-3 / Flux-de | ✓ | Voices with de in languages |
| Spanish | es-ES | ✓ Nova-3 | ✓ | Subset |
| French | fr-FR | ✓ Nova-3 | ✓ | Subset |
| Portuguese | pt-BR | ✓ Nova-3 | ✓ | Subset |
primaryLanguage is the ISO 639-1 code (e.g. en, de) used for
language-specific prompt rules. language is the full BCP-47 code passed to
Deepgram.
Multilingual agents
Set additionalLanguages to allow the agent to switch mid-call:
{
"language": "en-US",
"primaryLanguage": "en",
"additionalLanguages": ["es", "de"]
}The voice engine detects the caller’s language from STT confidence and
switches the LLM context. The voice itself doesn’t change — pick a voice
whose languages array covers all your target languages.
Voice behavior tuning (Business-level)
These settings live on the Business. Tune them per business, not per agent.
PATCH /api/v1/businesses/:id
Content-Type: application/json
{
"voiceSettings": { "speed": 1.0 },
"interruptionSensitivity": 0.7,
"silenceThresholdMs": 800,
"maxCallDurationMinutes": 15
}| Field | Range | Default | What it does |
|---|---|---|---|
voiceSettings.speed | 0.5 – 2.0 | 1.0 | Playback rate. 1.1 is barely noticeable; 1.3+ sounds rushed. |
interruptionSensitivity | 0.0 – 1.0 | 0.7 | How quickly the agent stops speaking when the caller talks. Higher = more interruptible. Lower = agent finishes the sentence. |
silenceThresholdMs | 400 – 2000 | 800 | How long the caller must be silent before the agent assumes they’re done speaking. Shorter = faster turn-taking but more interruptions of slow speakers. |
maxCallDurationMinutes | 1 – 60 | 15 | Hard cap. The agent ends the call cleanly at this point. |
Picking sensitivity values
| Use case | interruptionSensitivity | silenceThresholdMs |
|---|---|---|
| Customer service | 0.7 | 800 |
| Healthcare intake | 0.5 (let people finish) | 1200 |
| Outbound sales | 0.8 (quick, responsive) | 600 |
| Elderly users / accessibility | 0.4 | 1500 |
Pronunciation overrides
For brand names, place names, or jargon that TTS mispronounces, add SSML
<phoneme> or <sub> tags directly in your systemPrompt or
greetingTemplate:
You work for <sub alias="Akol">Akol</sub>, a voice AI platform.
Always pronounce <phoneme alphabet="ipa" ph="ˈɑːkoʊl">Akol</phoneme> correctly.The TTS provider strips the SSML and applies the pronunciation. This works in any field passed to TTS (greeting, fallback, system prompt context).
Voice quality troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Agent talks over the caller | interruptionSensitivity too low | Raise to 0.7+ |
| Agent waits too long before speaking | silenceThresholdMs too high, or LLM provider slow | Lower threshold; check /api/v1/health for provider status |
| Agent voice sounds “off” in German | Voice ID doesn’t include de in its languages array | Pick a voice from /api/v1/voices?language=de |
| Voice cuts off mid-word | Network buffering on caller side | Usually carrier-side. Check call’s outcome for hints. |
| Voice prosody resets between sentences | ElevenLabs request continuation context not preserved | This is a known optimization — speak in shorter sentences for now |
Latency budget
End-to-end first-audio-out latency targets:
Caller speech → STT (Deepgram Flux ~150ms)
→ LLM (Groq ~120ms first token)
→ TTS (ElevenLabs Flash ~75ms first chunk)
→ Caller hears agent
────────────
~470ms typicalIf your calls feel slow, check /api/v1/calls/:id and look at the
metadata.latencies object. The biggest knobs you control:
- System prompt length — over ~3000 tokens significantly increases TTFT
- Function tools — every tool definition adds context; trim unused ones
- Voice category —
emotivevoices have ~50ms more latency thanstable