Clicky

logo MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

1University of Illinois Urbana-Champaign, 2Renmin University of China
To appear at ICLR 2024

MINT benchmark measures LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback.

This table contains the micro average across all task instances originally featured in the MINT paper. It includes test instances from several sources: HumanEval, MBPP, GSM8K, HotpotQA, MATH, MMLU, TheoremQA, and AlfWorld.

This code subset follows the Eurus paper and contains MBPP and HumanEval.

This math subset follows the Eurus paper and contains TheoremQA, MATH and MMLU.


MINT can measure different LLMs' ability to provide natural language feedback by measuring the benefit of their feedback (Δ Success Rate) to a fixed LLM (gpt-3.5-turbo-0613).


Please refer to our GitHub repo to add your model to the leaderboard.

Abstract

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation.
Our analysis of 20 open- and closed-source LLMs offers intriguing findings.

  • (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback.
  • (b) Better single-turn performance does not guarantee better multi-turn performance.
  • (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.

We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.

Interaction Framework

MINT mirrors the real-world User-LLM-Tool collaborative problem-solving setting. To solve a problem, the LLM can (1) use external tools by generating and executing Python programs and/or (2) collecting natural language feedback to refine its solutions; the feedback is provided by GPT-4, aiming to simulate human users in a reproducible and scalable way.

  • We measure LLMs' tool-augmented task-solving capability by analyzing its performance gain with increased numbers of turns without language feedback (i.e., no red dotted box in the figure below).
  • We quantify LLMs' ability to leverage natural language feedback with the performance gain upon receiving GPT-4 generated feedback (i.e., performance without and with red dotted box in the figure below).
illustrative-example

Evaluation

We evaluate 20 LLMs where 4 are closed- and 16 are open-source. We cover different sizes and training techniques to better understand how they affect LLMs' multi-turn interaction capability. We consider three variants of training techniques:

  • Base: Pre-trained model
  • SIFT: Supervised Instruction-Finetuning
  • RLHF: Reinforcement Learning from Human Feedback

Tool-augmented Task-Solving capabilities of LLMs

  • We find all open-source models fall behind most commercial closed-source models in both success rate at k=5 and improvement rate (slope).
  • Absolute performance and improvement-per-turn (e.g., slope) scale with model size.
  • SIFT on multi-turn data can potentially be helpful. Vicuna-v1.5 (7B), which is a SIFT variant of LLaMA2 trained on ShareGPT conversations (most are multi-turn), exhibit stronger performance compared to LLaMA-2 (Base and RLHF)1. We observe similar trend for Lemur-70b-chat-v1, which continue pre-train LLaMA-2 (70B) on code intensive data followed by SIFT on multi-turn data.
  • We find RLHF hurt LLM-tool multi-turn interaction on LLaMA-2 series. However, it's unclear if RLHF is problematic overall, or if the issue only arise when RLHF is primarily applied to single-turn data.
  1. We find some performance degradation in Vicuna-v1.5 (especially for the 13B one), potential due to training artifacts. We refer to paper Section 3.5 for more details.

LLMs' Ability to Leverage Natural Language Feedback

  • We find no significant difference between open- and closed-source models in terms of Δfeedback.
  • Similar to previous findings, we find that SIFT and RLHF hurt models' ability to leverage feedback on CodeLLama (except 7B) and LLaMA-2, as they all have lower Δfeedback and Success Rate (with feedback) compared to their base variants. Another two exceptions are Vicuna and Lemur-v1; We speculate using multi-turn conversations (ShareGPT) for SIFT contributes to these two exceptions.
  • Models hardly benefit from self-feedback. We find GPT-4-0613 using self-generated feedback has limited benefit: only decision-making has improved slightly.

LLMs' Ability to Provide Natural Language Feedback

In this section, we fixed the evaluated LLM (gpt-3.5-turbo-0613) and use different LLMs to provide language feedback. This allows us to measure different LLMs' effectiveness in providing feedback.
We find that task-solving ability could be orthogonal to feedback-providing ability: LLM's higher task-solving performance does not necessarily translate to better feedback-providing capability and vice versa. For example, despite performing the worst in solving tasks, CodeLLaMA (34B, SIFT) can provide feedback that improves the stronger GPT-3.5.

BibTeX

@misc{wang2023mint,
    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
    year={2023},
    eprint={2309.10691},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}