M3Exam
1.0.0
这是M3EXAM的存储库:一种多语言,多模式,多级基准测试,用于检查大型语言模型。
TL; DR:我们介绍了M3Exam,这是一种来自真实和官方人类考试问题的新基准,用于评估多语言,多模式和多层次环境中的LLM。

data/
multimodal-questions/ <- questions requiring images
xx-questions-image.json <- file containing the questions, xx is a language
iamges-xx/ <- folder containg all the images for xx
text-questions/ <- questions with pure text
xx-questions-dev.json <- held-out data (e.g., can be used as in-context examples)
xx-questions-test.json <- main test data for evaluation
with open ( f'./data/text-question/ { lang } -questions-dev.json' , 'w' ) as f :
data = json . load ( f ) # data is a list of questions {
'question_text': 'Which Civil War event occurred first?',
'background_description': [],
'answer_text': '2',
'options': ['(1) battle of Gettysburg',
'(2) firing on Fort Sumter',
'(3) assassination of President Lincoln',
'(4) Emancipation Proclamation'],
'need_image': 'no',
'language': 'english',
'level': 'mid',
'subject': 'social',
'subject_category': 'social-science',
'year': '2006'
}
python main.py
--setting zero-shot
--model chat
--use_api
--selected_langs "['english']"
--api_key #put your key here
quick_run.sh , which will run on 10 English questions and produce english-pred.json in the corresponding output foldereval.sh to check the performance on this 10 examples!run.sh for more detailed settings python main.py
--setting zero-shot
--model chat
--use_api
--selected_langs "['english']"
--selected_levels "['low', 'mid', 'high']"
--num_samples all
--api_key #put your key here
* specify the languages you want to run through `--selected_langs`
* running on all questions, set `--num_samples all`
如果您发现这在研究中有用,请考虑引用它:
@article{zhang2023m3exam,
title={M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models},
author={Wenxuan Zhang and Sharifah Mahani Aljunied and Chang Gao and Yew Ken Chia and Lidong Bing},
year={2023},
eprint={2306.05179},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
密码:12317