M3Exam

M3Exam

其他源码

1.0.0

下载

M3EXAM：多语言？，多模式？，多级？ LLMS的基准

这是M3EXAM的存储库：一种多语言，多模式，多级基准测试，用于检查大型语言模型。

TL; DR：我们介绍了M3Exam，这是一种来自真实和官方人类考试问题的新基准，用于评估多语言，多模式和多层次环境中的LLM。

数据

访问数据

您可以从这里下载数据。
下载的文件夹将进行加密（以防止一些自动爬行脚本）。请从此页面的底部获取密码。
解压缩文件后，您将看到以下文件结构：

 data/
    multimodal-questions/         <- questions requiring images
        xx-questions-image.json   <- file containing the questions, xx is a language
        iamges-xx/                <- folder containg all the images for xx
    text-questions/               <- questions with pure text
        xx-questions-dev.json     <- held-out data (e.g., can be used as in-context examples)
        xx-questions-test.json    <- main test data for evaluation

数据格式

问题以JSON格式存储，您可以读取每个JSON文件以检查数据。例如：

 with open ( f'./data/text-question/ { lang } -questions-dev.json' , 'w' ) as f :
    data = json . load ( f )  # data is a list of questions

每个问题都以JSON格式存储：

 {
    'question_text': 'Which Civil War event occurred first?',
    'background_description': [],
    'answer_text': '2',
    'options': ['(1) battle of Gettysburg',
    '(2) firing on Fort Sumter',
    '(3) assassination of President Lincoln',
    '(4) Emancipation Proclamation'],
    'need_image': 'no',
    'language': 'english',
    'level': 'mid',
    'subject': 'social',
    'subject_category': 'social-science',
    'year': '2006'
}

评估

首先，您需要在BASH文件中填写OpenAI API密钥：

 python main.py 
--setting zero-shot 
--model chat 
--use_api 
--selected_langs "['english']" 
--api_key #put your key here

then you can quickly check by running quick_run.sh , which will run on 10 English questions and produce english-pred.json in the corresponding output folder
to evaluate, you can also run eval.sh to check the performance on this 10 examples!
to run on more data, you can refer to run.sh for more detailed settings

 python main.py 
--setting zero-shot 
--model chat 
--use_api 
--selected_langs "['english']" 
--selected_levels "['low', 'mid', 'high']" 
--num_samples all 
--api_key #put your key here

 * specify the languages you want to run through `--selected_langs`
* running on all questions, set `--num_samples all`

引用

如果您发现这在研究中有用，请考虑引用它：

 @article{zhang2023m3exam,
      title={M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models},
      author={Wenxuan Zhang and Sharifah Mahani Aljunied and Chang Gao and Yew Ken Chia and Lidong Bing},
      year={2023},
      eprint={2306.05179},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

密码：12317

展开

附加信息