Multi-turn Conversational Agent Evaluation Review

Jan 1, 1970

Recently there is a review paper for multi-turn conversations: Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey.

Why multi-turn conversation evaluation is just so hard? A lot of keywords comes like counterfactual, simulated user, comes all over the world.

Challenges

For a single turn, there are many right way to answer a question or ask a question. Even for the most obvious thing like asking user for the name and DOB.

Turn 1, the agent could ask “May I know your name and DOB?” or “May I know your name?”. Both are correct.

THoughts

Another chanllenge is that prompt and data mismatches with moving targets. Prompt is both the goal (instruction following) and groundtruth. It’s hard to tell the difference

Prompt is not followed
Task is not correct

Benchmark vs Conversational

Golden vs metrics

Trade offs in setting up evals, do we have an triangle?

eval speed - run time compute?
sensitivity
alignment

Latest Blogs

Book Review of Money Games

How not to be tortured by negotiations

Read in 2 min

Book Review of The Courage to be Disliked

The dialogue between the philosopher and the young man resonates deeply, answering many common questions. Key takeaways include focusing on personal tasks, valuing community contribution, and managing others' judgments objectively. Reflections also cover how these principles can be applied in parenting to raise happy, resilient children.

Read in 2 min

Idea from The Craving Mind: RAIN

There is this RAIN method from the book “The Craving Mind”. It’s an interesting idea to detox and to break bad habits. It’s a nice and little idea that solves one of my long held belief about work.

Read in 2 min

Book Review of Lean Startup

A great book for handling projects with lots of uncertainties. Solving the right problem is the key.

Read in 4 min

Sign up for the newsletter

Enter an e-mail address to test out form validation and Flowbite modal via reactive Svelte components. This form does not collect any data.