Recently there is a review paper for multi-turn conversations: Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey.
Why multi-turn conversation evaluation is just so hard? A lot of keywords comes like counterfactual, simulated user, comes all over the world.
Challenges
For a single turn, there are many right way to answer a question or ask a question. Even for the most obvious thing like asking user for the name and DOB.
Turn 1, the agent could ask “May I know your name and DOB?” or “May I know your name?”. Both are correct.
THoughts
Another chanllenge is that prompt and data mismatches with moving targets. Prompt is both the goal (instruction following) and groundtruth. It’s hard to tell the difference
- Prompt is not followed
- Task is not correct
Benchmark vs Conversational
Golden vs metrics
Trade offs in setting up evals, do we have an triangle?
- eval speed - run time compute?
- sensitivity
- alignment