代碼,數據集,文章“大語模型對歸因的自動評估”的模型

2023年6月26日:1)包括GPT-4在內的更多模型的評估結果。 2)徹底重新檢查Attreval-Gensearch數據集並糾正某些註釋問題。更新了數據集。 3)培訓和評估代碼以及發布的模型檢查點。
我們在以下位置發布數據集(包括培訓和兩個評估集:Attreval-Simulation和Attreval-Gensearch):HuggingFace數據集(更多詳細信息可以在數據集頁面上找到)
#loading dataset
from datasets import load_dataset
# training
attr_train = load_dataset ( "osunlp/AttrScore" , "combined_train" )
# test
# attr_eval_simulation = load_dataset("osunlp/AttrScore", "attreval_simulation")
attr_eval_gensearch = load_dataset ( "osunlp/AttrScore" , "attreval_gensearch" )我們展示了促使LLM和相關任務的重新調整數據的微調LLM的結果。
| 模擬 | Gensearch | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 環境 | 型號(尺寸) | attr。 | 反對。 | 額外的。 | 全面的 | attr。 | 反對。 | 額外的。 | 全面的 |
| 零射 | 羊駝(7b) | 50.0 | 4.0 | 1.4 | 33.6 | 50.7 | 8.6 | 3.6 | 34.3 |
| 羊駝(13b) | 48.3 | 5.6 | 2.2 | 33.5 | 50.6 | 6.1 | 19.3 | 34.7 | |
| 維庫納(13b) | 46.3 | 8.3 | 21.6 | 34.6 | 54.4 | 13.3 | 26.1 | 41.4 | |
| chatgpt | 45.7 | 17.9 | 52.7 | 43.2 | 61.2 | 20.6 | 53.3 | 55.0 | |
| GPT-4 | 58.7 | 23.2 | 61.5 | 55.6 | 87.3 | 45.0 | 89.6 | 85.1 | |
| 幾次 | 羊駝(7b) | 45.4 | 8.2 | 9.6 | 31.9 | 49.6 | 5.2 | 13.5 | 37.2 |
| 羊駝(13b) | 38.9 | 20.1 | 2.2 | 33.1 | 50.5 | 10.3 | 5.6 | 34.8 | |
| 維庫納(13b) | 35.4 | 37.2 | 0.3 | 32.6 | 50.6 | 9.1 | 8.4 | 34.1 | |
| chatgpt | 46.6 | 27.6 | 35.8 | 39.2 | 62.6 | 26.8 | 49.5 | 53.3 | |
| GPT-4 | 61.1 | 31.3 | 68.8 | 60.0 | 85.2 | 53.3 | 88.9 | 84.3 | |
| 微調 | 羅伯塔(330m) | 62.5 | 54.6 | 74.7 | 65.0 | 47.2 | 25.2 | 62.3 | 49.8 |
| GPT2(1.5B) | 63.6 | 54.6 | 71.9 | 63.5 | 51.1 | 18.6 | 60.7 | 47.4 | |
| T5(770m) | 45.9 | 57.1 | 71.6 | 59.1 | 58.5 | 24.3 | 72.5 | 61.6 | |
| Flan-T5(770m) | 57.3 | 50.1 | 70.5 | 59.3 | 64.3 | 27.6 | 72.9 | 64.5 | |
| Flan-T5(3b) | 48.1 | 48.7 | 67.1 | 55.7 | 77.7 | 44.4 | 80.0 | 75.2 | |
| Flan-T5(11b) | 48.4 | 49.9 | 66.5 | 55.4 | 81.6 | 38.9 | 76.9 | 72.7 | |
| 駱駝(7b) | 62.2 | 50.7 | 74.6 | 62.8 | 77.9 | 41.1 | 78.3 | 72.5 | |
| 羊駝(7b) | 66.8 | 41.1 | 76.8 | 64.5 | 73.0 | 30.2 | 80.0 | 72.5 | |
| 羊駝(13b) | 63.6 | 48.9 | 75.8 | 63.6 | 77.5 | 34.5 | 79.4 | 73.3 | |
| 維庫納(13b) | 66.2 | 49.1 | 78.6 | 66.0 | 69.4 | 37.7 | 79.9 | 72.1 |
我們可以提示LLMS(例如ChatGpt和GPT-4)評估歸因。輸入是評估任務提示,索賠(查詢 +答案的串聯)和參考。例如,
驗證給定的參考是否可以支持索賠。選項:歸因,外推或矛盾。歸因於該參考文獻完全支持該索賠,外推是指參考缺乏足夠的信息來驗證索賠,而矛盾的是,該索賠與參考文獻中提供的信息相矛盾。
主張:誰是Twitter的現任首席執行官? Twitter的現任首席執行官是Elon Musk
參考:Elon Musk是Twitter的首席執行官。馬斯克(Musk)於2022年10月接任首席執行官,此前億萬富翁提議以440億美元的價格購買社交媒體公司,試圖退出,然後最終通過收購進行了收購。在成為首席執行官,前首席執行官Parag Agrawal之後,CFO NED SEGAL以及法律事務和政策主管Vijaya Gadde都被駁回了公司。
要復制Chatgpt/GPT4表中的編號,請在"./api_key.txt"中復制OpenAI API鍵,然後運行筆記本示例prompt_chatgpt_gpt4.ipynb
為了提示美洲駝/羊駝/維庫納,請參見下面有關這些模型的推斷。
您可以在我們重新使用的數據集中微調所有LMS來評估歸因。
在這裡,我們舉例說明了駱駝/羊駝/維庫納(Vicuna)的示例。您可以與任何Llama家庭模型一起使用--model_name_or_path 。我們對具有4個A100 80GB GPU的Llama/羊駝/Vicuna 7b/13b型號進行全面微調。
torchrun --nproc_per_node 4 train_alpaca.py
--model_name_or_path chavinlo/alpaca-13b
--data_path osunlp/AttrScore
--train_subset ' combined_train '
--input_has_query True
--num_train_samples -1
--bf16 True
--output_dir tmp/alpaca_13b_combined_train/
--evaluation_strategy steps
--eval_steps 500
--num_train_epochs 1
--model_max_length 512
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--save_strategy steps
--save_steps 5000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type cosine
--logging_steps 1
--fsdp ' full_shard auto_wrap '
--fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer '
--tf32 True
您還可以加載我們的微調模型進行評估。我們提供以下在HuggingFace模型中combined_train的檢查點:
例如,
from transformers import AutoTokenizer , AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer . from_pretrained ( "osunlp/attrscore-flan-t5-xl" )
model = AutoModelForSeq2SeqLM . from_pretrained ( "osunlp/attrscore-flan-t5-xl" )
input = "As an Attribution Validator, your task is to verify whether a given reference can support the given claim. A claim can be either a plain sentence or a question followed by its answer. Specifically, your response should clearly indicate the relationship: Attributable, Contradictory or Extrapolatory. A contradictory error occurs when you can infer that the answer contradicts the fact presented in the context, while an extrapolatory error means that you cannot infer the correctness of the answer based on the information provided in the context. n n Claim: Who is the current CEO of Twitter? The current CEO of Twitter is Elon Musk n Reference: Elon Musk is the CEO of Twitter. Musk took over as CEO in October 2022 following a back-and-forth affair in which the billionaire proposed to purchase the social media company for $44 billion, tried to back out, and then ultimately went through with the acquisition. After becoming CEO, former CEO Parag Agrawal, CFO Ned Segal, and legal affairs and policy chief Vijaya Gadde were all dismissed from the company."
input_ids = tokenizer . encode ( input , return_tensors = "pt" )
outputs = model . generate ( input_ids )
output = tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True )
print ( output ) #'Attributable'或簡單地使用pipeline
from transformers import pipeline
model = pipeline ( "text2text-generation" , "osunlp/attrscore-flan-t5-xl" )
input = "As an Attribution Validator, your task is to verify whether a given reference can support the given claim. A claim can be either a plain sentence or a question followed by its answer. Specifically, your response should clearly indicate the relationship: Attributable, Contradictory or Extrapolatory. A contradictory error occurs when you can infer that the answer contradicts the fact presented in the context, while an extrapolatory error means that you cannot infer the correctness of the answer based on the information provided in the context. n n Claim: Who is the current CEO of Twitter? The current CEO of Twitter is Elon Musk n Reference: Elon Musk is the CEO of Twitter. Musk took over as CEO in October 2022 following a back-and-forth affair in which the billionaire proposed to purchase the social media company for $44 billion, tried to back out, and then ultimately went through with the acquisition. After becoming CEO, former CEO Parag Agrawal, CFO Ned Segal, and legal affairs and policy chief Vijaya Gadde were all dismissed from the company."
output = model ( input )[ 0 ][ 'generated_text' ]
print ( output ) #'Attributable'我們顯示了基於駱駝的模型的推論和評估腳本:
python inference_alpaca.py
--model_name_or_path osunlp/attrscore-vicuna-13b
--test_data_path osunlp/AttrScore
--subset_name attreval_simulation
--model_max_length 512該項目中的所有數據集均僅用於研究目的。我們收集並註釋數據,以在新的bing生成搜索引擎的幫助下,使用網絡上的公開信息進行評估。我們承認,LLM有可能複制和擴大數據中存在的有害信息。我們通過仔細選擇我們的評估數據並進行分析以識別和減輕在此過程中的潛在風險來減輕這種風險。
我們註釋的評估集Attreval-Gensearch源自新的Bing,該新的Bing使用GPT-4作為骨幹。至關重要的是,我們還使用GPT-4來評估Attreval-Gensearch的歸因,該搜索的總體準確性約為85%。一些偏見可能來自GPT-4生成測試示例和評估歸因,這可能會使我們對模型的真實性能的理解有可能偏斜。因此,我們警告不要過度優勢。我們還承認,Attreval-Gensearch的大小是中等的,這可能不能完全代表屬性LLMS的真正使用設置。
此外,Attreval-Simulation數據集仍然具有實際情況的差距。此模擬數據集中的錯誤模式可能過於簡單,並且缺乏多樣性,這可能會限制模型有效處理更複雜和多樣化的現實世界錯誤的能力。還值得注意的是,該模擬數據集可能包含噪聲和錯誤標籤,這可能進一步阻礙了模型的學習和隨後的性能。如何獲得高質量的培訓數據以進行大規模歸因評估可能是未來發展的主要重點。
如果您發現此代碼或數據集有用,請考慮引用我們的論文:
@article { yue2023automatic ,
title = { Automatic Evaluation of Attribution by Large Language Models } ,
author = { Yue, Xiang and Wang, Boshi and Zhang, Kai and Chen, Ziru and Su, Yu and Sun, Huan } ,
journal = { arXiv preprint arXiv:2305.06311 } ,
year = { 2023 }
}如果您有任何疑問,請隨時接觸。 Xiang Yue,Yu Su,Huan Sun