Benchmark for evaluating the performance of the LLM model

Benchmark for evaluating the performance of the LLM model#

Note

Hey guys, this is my personal reading note. I am not sure there might be some mistakes in my understanding. Please feel free to correct me (hsiangjenli@gmail.com) if you find any. Thanks!

Existing benchmarks can be divided into n categories [5] :

  1. Core-knowledge benchmarks

  2. Instruction-following benchmarks

  3. Conversational benchmarks

  4. Traditional evaluation metrics

Open Source Benchmark#

  1. taide-bench-eval

  2. Judging llm-as-a-judge with mt-bench and chatbot arena

  3. MMLU [1]

  4. HELM [2]

  5. MT-bench [5]

  6. Chatbot Arena [5]

Paper#

  1. 2023 Judging llm-as-a-judge with mt-bench and chatbot arena

    • In this paper, the authors argue that the aligned model achieves better user preference, but the results cannot be accurately assessed by current benchmarks.

    • LLM-as-a-Judge

      1. Pairwise comparison

        • Position Bias : LLM judges favor the first position

        • Verbosity Bias : LLM favors longer, verbose responses

        • Self-Enhancement Bias : LLM prefer the responses that generate by themselves

      2. Single answer grading

      3. Reference-guided grading

Reference#

[1]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

[2]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and others. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.

[3] (1,2)

Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, 74–81. 2004.

[4] (1,2)

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. 2002.

[5] (1,2,3,4)

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and others. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.