Benchmark for evaluating the performance of the LLM model#

Note

Hey guys, this is my personal reading note. I am not sure there might be some mistakes in my understanding. Please feel free to correct me (hsiangjenli@gmail.com) if you find any. Thanks!

Existing benchmarks can be divided into n categories [5] :

Core-knowledge benchmarks
Instruction-following benchmarks
Conversational benchmarks
Traditional evaluation metrics
- ROUGE [3] (2004), BLEU [4] (2002)

Open Source Benchmark#

taide-bench-eval
- taide-taiwan/taide-bench-eval
Judging llm-as-a-judge with mt-bench and chatbot arena
- lm-sys/FastChat
MMLU [1]
HELM [2]
MT-bench [5]
Chatbot Arena [5]

Paper#

2023 Judging llm-as-a-judge with mt-bench and chatbot arena
- In this paper, the authors argue that the aligned model achieves better user preference, but the results cannot be accurately assessed by current benchmarks.
- LLM-as-a-Judge
  1. Pairwise comparison
    
    Position Bias : LLM judges favor the first position
    
    Verbosity Bias : LLM favors longer, verbose responses
    
    Self-Enhancement Bias : LLM prefer the responses that generate by themselves
  2. Single answer grading
  3. Reference-guided grading

Reference#

[1]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

[2]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and others. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.

[3] (1,2)

Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, 74–81. 2004.

[4] (1,2)

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. 2002.

[5] (1,2,3,4)

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and others. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.

Benchmark for evaluating the performance of the LLM model

Contents

Benchmark for evaluating the performance of the LLM model#

Open Source Benchmark#

Paper#

Reference#