Benchmark for evaluating the performance of the LLM model#
Note
Hey guys, this is my personal reading note. I am not sure there might be some mistakes in my understanding. Please feel free to correct me (hsiangjenli@gmail.com) if you find any. Thanks!
Existing benchmarks can be divided into n categories [5] :
Core-knowledge benchmarks
Instruction-following benchmarks
Conversational benchmarks
Traditional evaluation metrics
Open Source Benchmark#
Paper#
2023 Judging llm-as-a-judge with mt-bench and chatbot arena
In this paper, the authors argue that the aligned model achieves better user preference, but the results cannot be accurately assessed by current benchmarks.
LLM-as-a-Judge
Pairwise comparison
Position Bias : LLM judges favor the first position
Verbosity Bias : LLM favors longer, verbose responses
Self-Enhancement Bias : LLM prefer the responses that generate by themselves
Single answer grading
Reference-guided grading
Reference#
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and others. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, 74–81. 2004.