Multi-turn Conversational Agent Evaluation Review

Recently there is a review paper for multi-turn conversations: Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey.

Why multi-turn conversation evaluation is just so hard? A lot of keywords comes like counterfactual, simulated user, comes all over the world.

Challenges

For a single turn, there are many right way to answer a question or ask a question. Even for the most obvious thing like asking user for the name and DOB.

Turn 1, the agent could ask “May I know your name and DOB?” or “May I know your name?”. Both are correct.

THoughts

Another chanllenge is that prompt and data mismatches with moving targets. Prompt is both the goal (instruction following) and groundtruth. It’s hard to tell the difference

  • Prompt is not followed
  • Task is not correct

Benchmark vs Conversational

Golden vs metrics

Trade offs in setting up evals, do we have an triangle?

  • eval speed - run time compute?
  • sensitivity
  • alignment

Sign up for the newsletter

Enter an e-mail address to test out form validation and Flowbite modal via reactive Svelte components. This form does not collect any data.