Voice Configuration

A reference for choosing and configuring voices, languages, and speech behavior for your AI agents. Covers both the dashboard UX and the API.

For browsing and previewing voices visually, see Voice Library. This page is for understanding the underlying configuration model.

The configuration split (read this first)

Akol’s data model has two pieces:

Lives on	What
Agent	Voice ID, agent name, default language, additional languages, avatar
Business	System prompt, greeting, fallback message, voice settings (speed, etc.), interruption sensitivity, silence threshold, transfer number

When a call starts, the engine merges them: the Business is the source of truth for behavior; the Agent is the source of truth for who sounds like what. If a value is missing on Business, the engine falls back to Agent (this fallback is being deprecated — write to Business when you can).

Voices

Listing voices


GET /api/v1/voices
Authorization: Bearer <token>

Returns the full ElevenLabs catalog filtered to voices Akol supports:


{
  "success": true,
  "data": [
    {
      "id": "a0e99841-438c-4a64-b679-ae501e7d6091",
      "name": "Sarah",
      "gender": "female",
      "category": "curated",
      "languages": ["en"],
      "previewUrl": "https://cdn.akol.ai/voice-previews/sarah.mp3",
      "description": "Warm female voice, professional and approachable",
      "isFavorite": false,
      "isHidden": false
    }
  ]
}

Voice categories

Category	When to use
`curated`	Hand-picked, work well across most use cases
`stable`	Most consistent across long calls / unusual phrasings
`emotive`	More expressive — good for hospitality, healthcare, sales
`support`	Tuned for customer support cadence (shorter pauses, calmer)

Setting an agent’s voice


PATCH /api/v1/agents/:id
Content-Type: application/json
 
{
  "voiceId": "a0e99841-438c-4a64-b679-ae501e7d6091",
  "language": "en-US",
  "primaryLanguage": "en"
}

The voice ID must come from /api/v1/voices. Arbitrary ElevenLabs voice IDs not in our curated list (e.g. copied from another platform) are rejected with 422.

Voice favorites and hidden

Per-user preferences:


POST /api/v1/voices/:voiceId/favorite
DELETE /api/v1/voices/:voiceId/favorite
POST /api/v1/voices/:voiceId/hide
DELETE /api/v1/voices/:voiceId/hide

Hidden voices don’t appear in the picker but are still valid — existing agents using them continue to work.

Languages

Akol supports the following BCP-47 language codes for STT, LLM, and TTS:

Language	Code	STT	LLM	TTS (voice support varies)
English (US)	`en-US`	✓ Nova-3 / Flux	✓	All voices
English (UK)	`en-GB`	✓ Nova-3	✓	Voices with `en` in `languages`
German	`de-DE`	✓ Nova-3 / Flux-de	✓	Voices with `de` in `languages`
Spanish	`es-ES`	✓ Nova-3	✓	Subset
French	`fr-FR`	✓ Nova-3	✓	Subset
Portuguese	`pt-BR`	✓ Nova-3	✓	Subset

primaryLanguage is the ISO 639-1 code (e.g. en, de) used for language-specific prompt rules. language is the full BCP-47 code passed to Deepgram.

Multilingual agents

Set additionalLanguages to allow the agent to switch mid-call:


{
  "language": "en-US",
  "primaryLanguage": "en",
  "additionalLanguages": ["es", "de"]
}

The voice engine detects the caller’s language from STT confidence and switches the LLM context. The voice itself doesn’t change — pick a voice whose languages array covers all your target languages.

Voice behavior tuning (Business-level)

These settings live on the Business. Tune them per business, not per agent.


PATCH /api/v1/businesses/:id
Content-Type: application/json
 
{
  "voiceSettings": { "speed": 1.0 },
  "interruptionSensitivity": 0.7,
  "silenceThresholdMs": 800,
  "maxCallDurationMinutes": 15
}

Field	Range	Default	What it does
`voiceSettings.speed`	`0.5 – 2.0`	`1.0`	Playback rate. `1.1` is barely noticeable; `1.3+` sounds rushed.
`interruptionSensitivity`	`0.0 – 1.0`	`0.7`	How quickly the agent stops speaking when the caller talks. Higher = more interruptible. Lower = agent finishes the sentence.
`silenceThresholdMs`	`400 – 2000`	`800`	How long the caller must be silent before the agent assumes they’re done speaking. Shorter = faster turn-taking but more interruptions of slow speakers.
`maxCallDurationMinutes`	`1 – 60`	`15`	Hard cap. The agent ends the call cleanly at this point.

Picking sensitivity values

Use case	`interruptionSensitivity`	`silenceThresholdMs`
Customer service	0.7	800
Healthcare intake	0.5 (let people finish)	1200
Outbound sales	0.8 (quick, responsive)	600
Elderly users / accessibility	0.4	1500

Pronunciation overrides

For brand names, place names, or jargon that TTS mispronounces, add SSML <phoneme> or <sub> tags directly in your systemPrompt or greetingTemplate:


You work for <sub alias="Akol">Akol</sub>, a voice AI platform.
Always pronounce <phoneme alphabet="ipa" ph="ˈɑːkoʊl">Akol</phoneme> correctly.

The TTS provider strips the SSML and applies the pronunciation. This works in any field passed to TTS (greeting, fallback, system prompt context).

Voice quality troubleshooting

Symptom	Likely cause	Fix
Agent talks over the caller	`interruptionSensitivity` too low	Raise to 0.7+
Agent waits too long before speaking	`silenceThresholdMs` too high, or LLM provider slow	Lower threshold; check `/api/v1/health` for provider status
Agent voice sounds “off” in German	Voice ID doesn’t include `de` in its `languages` array	Pick a voice from `/api/v1/voices?language=de`
Voice cuts off mid-word	Network buffering on caller side	Usually carrier-side. Check call’s `outcome` for hints.
Voice prosody resets between sentences	ElevenLabs request continuation context not preserved	This is a known optimization — speak in shorter sentences for now

Latency budget

End-to-end first-audio-out latency targets:


Caller speech → STT (Deepgram Flux ~150ms)
              → LLM (Groq ~120ms first token)
              → TTS (ElevenLabs Flash ~75ms first chunk)
              → Caller hears agent
              ────────────
              ~470ms typical

If your calls feel slow, check /api/v1/calls/:id and look at the metadata.latencies object. The biggest knobs you control:

System prompt length — over ~3000 tokens significantly increases TTFT
Function tools — every tool definition adds context; trim unused ones
Voice category — emotive voices have ~50ms more latency than stable