Evals & Testing
The Evals dashboard shows you how well your AI agents perform across real-world scenarios. It runs automated tests — like a practice exam for your agent — and reports a score.
The Evals dashboard is only available to admins. If you don’t see it in your sidebar, contact your account administrator.
What Gets Tested
Akol tests your agent across 25 scenarios covering the most common call situations:
| Category | Scenarios | What’s Tested |
|---|---|---|
| Dental | 3 | Booking appointments, cancellations, FAQs |
| Restaurant | 3 | Reservations, menu questions, special requests |
| Real Estate | 2 | Property inquiries, scheduling showings |
| Healthcare | 3 | Appointments, insurance questions, symptoms |
| Automotive | 3 | Service booking, maintenance, warranty claims |
| Legal | 3 | Consultations, document requests, billing |
| Function Calling | 4 | Multi-step flows, SMS sending, cross-industry tasks |
| Security | 4 | Prompt injection, data protection, role escape attempts |
Each scenario is a multi-turn conversation — just like a real call — where the test plays the role of a caller and checks how your agent responds.
Reading Your Results
Overall Score
At the top of the page you’ll see:
- Score — Percentage of scenarios that passed (100% = all green)
- Passed — How many scenarios passed out of the total
- Duration — How long the full test run took
- Mode — Whether it was a mock or live test
Scenario Details
Click on any scenario to expand it and see:
- Pass/Fail — Whether all checks passed for this scenario
- Category — Industry, function-calling, or security
- Turn-by-turn conversation — What the caller said, what your agent said, and what actions it took
- Assertion results — Individual checks with pass/fail and reasons
What Gets Checked
Each scenario runs multiple checks (called assertions) against your agent’s responses:
| Check | What It Verifies |
|---|---|
| Contains | Agent’s response includes a specific phrase |
| Not Contains | Agent’s response avoids certain words |
| Regex Match | Response matches a pattern |
| Function Called | Agent used the right tool (e.g., scheduled an appointment) |
| Function Not Called | Agent didn’t use a tool it shouldn’t have |
| Function Args | Agent passed the correct details to a tool |
| Response Length | Response is within expected word count |
| Tone | Response has the right tone (live mode only) |
Mock vs Live Mode
Evals can run in two modes:
| Mode | Speed | What It Tests | When to Use |
|---|---|---|---|
| Mock | Very fast (~22ms) | Deterministic checks — did the agent call the right function, say the right thing? | Quick checks, CI pipeline |
| Live | Slower (~2 min) | Everything mock tests + tone and task completion judged by an AI evaluator | Before major releases, thorough testing |
Mock mode skips subjective checks like tone evaluation. Live mode uses an AI judge to evaluate whether the agent’s responses sound right.
Running Evals
Evals run automatically as part of the CI pipeline. Results are saved and displayed on this dashboard page.
To view the latest results, go to Dashboard > Evals. The page shows the most recent test run with all scenario results.
Evals help catch regressions — if a code change accidentally makes your agent worse at booking appointments or more vulnerable to prompt injection, the eval score will drop and flag it.