MT bench

1 introduction

We create MT-bench, a benchmark consisting of 80 high-quality multi-turn questions. MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models. We identify 8 common categories of user prompts to guide its construction: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science). For each category, we then manually designed 10 multi-turn questions. Table 1 lists several sample questions.

note: the introduction is from the paper

2 question categeries

3 how does it evaluation

We propose 3 LLM-as-a-judge variations. They can be implemented independently or in combination:
• Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie. The prompt used is given in Figure 5 (Appendix).
• Single answer grading. Alternatively, an LLM judge is asked to directly assign a score to a single answer. The prompt used for this scenario is in Figure 6 (Appendix).
• Reference-guided grading. In certain cases, it may be beneficial to provide a reference solution if applicable. An example prompt we use for grading math problems is in Figure 8 (Appendix).

note: To view the prompts, please view the paper in Appendix.

4 what's the indicator

We define the agreement between two types of judges as the probability of randomly selected individuals (but not identical) of each type agreeing on a randomly selected question. See more explanation in Appendix D.3. Average win rate is the average of win rates against all other players. These metrics can be computed with or without including tie votes.

note: win or out is based on the human's annotation.

use FastChat

click to view the code

# env mt_bench
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,llm_judge]"

python gen_judgment.py --model-list gpt-3.5-turbo gpt-4 --parallel 2
python show_result.py --model-list gpt-3.5-turbo gpt-4

mixtral

posted @ 2024-01-09 15:38 Daze_Lu 阅读(159) 评论(0) 收藏举报

刷新页面返回顶部

MT bench

MT bench

1 introduction

2 question categeries

3 how does it evaluation

4 what's the indicator

use FastChat

公告