
大型语言模型看到了数万亿个令牌 - 谁知道里面有什么?最近的作品已经在许多不同的任务上评估了这些模型,但是它们是否确保该模型尚未看到培训甚至评估数据集?在博客文章中,我们显示了一些流行的基准数据集已经由Chatgpt记住,并且可以提示Chatgpt再生它们。
在此存储库中,我们旨在收集(尽可能多地)污染证据,以向研究社区提供可靠的资源,以快速检查该模型是否已经看到了他们的评估数据集。但是,我们意识到该指数的不完整性,因此我们要求研究人员事先进行小型污染实验。
您可以访问搜索工具LM污染指数
数据集和模型的数量令人生畏。因此,我们正在设想社区的努力。如果您对NLP研究充满热情,并想在LLM评估中贡献污染,请遵循贡献指南
如果您想参考这项工作,如果您引用以下内容,我们将不胜感激:
奥斯卡·塞恩斯(Oscar Sainz),乔恩·安德·坎波斯(Jon Ander Campos),伊克·加尔克·弗雷罗(IkerGarćıa-Ferrero),朱伦·埃克斯尼兹(Julen Etxaniz)和恩科·阿吉尔(Eneko Agirre)。 Chatgpt是否在您的测试中作弊?,2023年6月。URLhttps://hitz-zentroa.github.io/lm-contamination/blog/。
@misc { sainz2023chatgpt ,
title = { Did ChatGPT cheat on your test? } ,
url = { https://hitz-zentroa.github.io/lm-contamination/blog/ } ,
author = { Sainz, Oscar and Campos, Jon Ander and García-Ferrero, Iker and Etxaniz, Julen and Agirre, Eneko } ,
year = { 2023 } ,
month = { Jun }
} 奥斯卡·塞恩斯(Oscar Sainz),乔恩·坎波斯(Jon Campos),伊克·加西亚·费雷罗(IkerGarcía-Ferrero),朱伦·埃特克萨兹(Julen Etxaniz),奥伊尔·洛佩兹·德·拉卡勒(Oier Lopez de Lacalle)和恩科·阿吉尔(Eneko Agirre)。 2023。麻烦中的NLP评估:需要测量每个基准测试的LLM数据污染。在计算语言学协会的发现中:EMNLP 2023,第10776–10787页,新加坡。计算语言学协会。
@inproceedings { sainz-etal-2023-nlp ,
title = " {NLP} Evaluation in trouble: On the Need to Measure {LLM} Data Contamination for each Benchmark " ,
author = " Sainz, Oscar and
Campos, Jon and
Garc{'i}a-Ferrero, Iker and
Etxaniz, Julen and
de Lacalle, Oier Lopez and
Agirre, Eneko " ,
editor = " Bouamor, Houda and
Pino, Juan and
Bali, Kalika " ,
booktitle = " Findings of the Association for Computational Linguistics: EMNLP 2023 " ,
month = dec,
year = " 2023 " ,
address = " Singapore " ,
publisher = " Association for Computational Linguistics " ,
url = " https://aclanthology.org/2023.findings-emnlp.722 " ,
doi = " 10.18653/v1/2023.findings-emnlp.722 " ,
pages = " 10776--10787 " ,
abstract = "In this position paper we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.",
}