
大型語言模型看到了數万億個令牌 - 誰知道裡面有什麼?最近的作品已經在許多不同的任務上評估了這些模型,但是它們是否確保該模型尚未看到培訓甚至評估數據集?在博客文章中,我們顯示了一些流行的基準數據集已經由Chatgpt記住,並且可以提示Chatgpt再生它們。
在此存儲庫中,我們旨在收集(盡可能多地)污染證據,以向研究社區提供可靠的資源,以快速檢查該模型是否已經看到了他們的評估數據集。但是,我們意識到該指數的不完整性,因此我們要求研究人員事先進行小型污染實驗。
您可以訪問搜索工具LM污染指數
數據集和模型的數量令人生畏。因此,我們正在設想社區的努力。如果您對NLP研究充滿熱情,並想在LLM評估中貢獻污染,請遵循貢獻指南
如果您想參考這項工作,如果您引用以下內容,我們將不勝感激:
奧斯卡·塞恩斯(Oscar Sainz),喬恩·安德·坎波斯(Jon Ander Campos),伊克·加爾克·弗雷羅(IkerGarćıa-Ferrero),朱倫·埃克斯尼茲(Julen Etxaniz)和恩科·阿吉爾(Eneko Agirre)。 Chatgpt是否在您的測試中作弊? ,2023年6月。 URLhttps://hitz-zentroa.github.io/lm-contamination/blog/。
@misc { sainz2023chatgpt ,
title = { Did ChatGPT cheat on your test? } ,
url = { https://hitz-zentroa.github.io/lm-contamination/blog/ } ,
author = { Sainz, Oscar and Campos, Jon Ander and García-Ferrero, Iker and Etxaniz, Julen and Agirre, Eneko } ,
year = { 2023 } ,
month = { Jun }
} 奧斯卡·塞恩斯(Oscar Sainz),喬恩·坎波斯(Jon Campos),伊克·加西亞·費雷羅(IkerGarcía-Ferrero),朱倫·埃特克薩茲(Julen Etxaniz),奧伊爾·洛佩茲·德·拉卡勒(Oier Lopez de Lacalle)和恩科·阿吉爾(Eneko Agirre)。 2023。麻煩中的NLP評估:需要測量每個基準測試的LLM數據污染。在計算語言學協會的發現中:EMNLP 2023,第10776–10787頁,新加坡。計算語言學協會。
@inproceedings { sainz-etal-2023-nlp ,
title = " {NLP} Evaluation in trouble: On the Need to Measure {LLM} Data Contamination for each Benchmark " ,
author = " Sainz, Oscar and
Campos, Jon and
Garc{'i}a-Ferrero, Iker and
Etxaniz, Julen and
de Lacalle, Oier Lopez and
Agirre, Eneko " ,
editor = " Bouamor, Houda and
Pino, Juan and
Bali, Kalika " ,
booktitle = " Findings of the Association for Computational Linguistics: EMNLP 2023 " ,
month = dec,
year = " 2023 " ,
address = " Singapore " ,
publisher = " Association for Computational Linguistics " ,
url = " https://aclanthology.org/2023.findings-emnlp.722 " ,
doi = " 10.18653/v1/2023.findings-emnlp.722 " ,
pages = " 10776--10787 " ,
abstract = "In this position paper we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.",
}