ACL 2022論文的源代碼“連貫性提升:當您驗證的語言模型不引起足夠的注意時”(Arxiv,ACL選集)
遠程語義連貫性仍然是自動語言產生和理解的挑戰。我們證明,大型語言模型不足以了解遙遠的單詞對下一句話預測的影響。我們提出了連貫性的提升,這是一種推理過程,它增加了LM對漫長背景的關注。我們通過對普通文本和對話響應的分配分析來展示與驗證模型相干提高的好處。還發現,針對各種零射門NLP任務的最先進模型的連貫性提高可帶來性能提高,而沒有額外的培訓。
如果您發現紙張和代碼很有用,請稍好播放此倉庫,並引用紙張。非常感謝!
@inproceedings { malkin-etal-2022-coherence , title = " Coherence boosting: When your pretrained language model is not paying enough attention " , author = " Malkin, Nikolay and Wang, Zhen and Jojic, Nebojsa " , booktitle = " Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) " , month = may, year = " 2022 " , address = " Dublin, Ireland " , publisher = " Association for Computational Linguistics " , url = " https://aclanthology.org/2022.acl-long.565 " , doi = " 10.18653/v1/2022.acl-long.565 " , pages = " 8214--8236 " }我們提出了一個演示程序,以證明對現有的預訓練的LMS缺乏連貫性,即,未能在給定的上下文中固定預測下一個標記,這顯然需要對遙遠的單詞的理解。我們提出的連貫性提升可以解決此類錯誤,這相反,通過對數線性將兩個分佈從完整的上下文和部分上下文中進行對比來預測下一步的令牌。
> >> from cb_demo import contrasting > >> contrasting ( model_name = 'gpt2' , context = ' Ballad metre is "less regular and more conversational" than common metre' , - - partial_length = 8 , - - alpha = 0.5 ) [ out ] Top tokens based on full context : Ballad metre is "less regular and more conversational" than common Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġsense - 2.405 9.03 % 2 Ġin - 3.900 2.02 % 3 . - 3.978 1.87 % 4 , - 4.097 1.66 % 5 Ġpractice - 4.287 1.37 % ... ... ... ... 13 Ġmetre ** - 5.098 0.610609 % * * Target Token Top tokens based on partial context : regular and more conversational " than common Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġsense - 2.547 7.83 % 2 ĠEnglish - 3.352 3.50 % 3 . - 3.427 3.25 % 4 Ġconversation - 3.445 3.19 % 5 , - 3.634 2.64 % ... ... ... ... 14103 Ġmetre ** - 13.450 0.000144 % * * Target Token Contrastive next token prediction : Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġmetre ** - 0.923 39.74 % 2 Ġsense - 2.334 9.69 % 3 Ġmeter - 2.785 6.17 % 4 Ġin - 3.210 4.03 % 5 Ġfoot - 3.220 3.99 % * * Target Tokenpython cb_demo . py - - context = ' Ballad metre is "less regular and more conversational" than common metre' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Isley Brewing Company: Going Mintal — a minty milk chocolate stout' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Other times anxiety is not as easy to see, but can still be just as debilitating' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5複製本文圖1中的某些示例的結果
。python cb_demo . py - - context = ' Ballad metre is "less regular and more conversational" than common metre' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Isley Brewing Company: Going Mintal — a minty milk chocolate stout' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Other times anxiety is not as easy to see, but can still be just as debilitating' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5lambada任務的單詞的預測類似於上面顯示的示例,其中預期模型可以預測幾句句子的段落中的最終單詞。該數據集是評估現代Langauge模型的標準基準(示例)。
更重要的是,此任務明確需要在廣泛的上下文中進行推理:當給出整個段落時,人類可以可靠地猜測最後一個單詞,而僅當只給出最後一句話時就可以可靠地猜測。這樣的屬性使該基準成為完美的測試台,以評估我們提出的連貫性提升的有效性。
要運行Lambada實驗,只需運行以下命令:
python main . py - - tasks = 'lambada' - - models = 'gpt2-small' - - use_val = False - - alpha_start = 1 - - alpha_end = 1 - - alpha_step = 0.1 - - slen_start = 10 - - slen_end = 10重要的參數,請列出list secks to list,請列出python main.py --help 。
--models :預先訓練的語言模型的名稱,如果您想同時運行多個模型,例如,例如, 'gpt2-small;gpt2-medium' ;如果要使用GPT-3型號,請參見有關GPT-3的註釋。--use_val : Whether to use a validation set to select two hyperparameters, alpha and slen representing the boosting coefficient and length for the partial context--alpha_start , --alpha_end , --alpha_step : Grid search parameters for the alpha hyperparameter--slen_start , --slen_end , --slen_step : Grid search parameters for the slen超參數;請注意,兩個超參數設置都會影響Lambada任務的推理速度我們評估了以下NLU任務的提高一致性的提升。
| 任務 | 關閉任務 | 問題回答 | 文本分類 | NLI | 事實知識檢索 |
|---|---|---|---|---|---|
| 數據 | 集Storycloze Hellaswag Copa | Commonsenseqa OpenBookQa 弧易/挑戰 PIQA | SST-2/5 trec Agnews | Rte CB 布爾克 | 喇嘛 |
大多數數據集可以由HugginFace的數據集加載;他們中只有少數需要手動下載,並在運行代碼時提示說明。
要運行NLU實驗,只需運行以下命令:
python main . py - - tasks = 'storycloze;csqa;openbookqa' - - models = 'gpt2-small;gpt2-medium;gpt2-large' - - alpha_start = 2 - - alpha_end = - 3 - - alpha_step = 0.01列出了一些重要的python main.py --help ,請列出某些重要參數。
--models :預先訓練的語言模型的名稱,如果要同時運行多個模型,例如,例如'gpt2-small;gpt2-medium'--use_val :是否使用驗證設置來選擇兩個超級公寓, alpha和slen代表增強係數和長度為--alpha_end --alpha_start --alpha_step : alpha超參數的網格搜索參數;請注意,代碼緩存了中間結果,並且在運行一次api_key.txt的api_key.txt中的,我們的代碼庫還足夠靈活,可以將任何新的多選擇數據集納入很小的努力(靈感來自開源項目LM-Evaluation-Harness)。大約三個步驟:
__init__.py中註冊tasks文件夾中的新數據集。load_data , standardize )繼承MultipleChoiceTask類別類get_contrast_ctx ,這是您定義自己的前提提示的地方,以啟動其他任務類別,並在採用我們的代碼時,請讓我們自由地知道其他任務類別,並在採用任何問題。
我們提供了一代模型包裝器與HuggingFace transformers庫兼容的generation/generation.py 。您可以使用示例腳本中的類創建任何自回歸LM的相干增強變體:
> >> boosted_model = generation . BoostedModel ( base_model , k = 8 , alpha_long = 1.5 , alpha_short = - 0.5 )> >> ins = T . LongTensor ([ tokenizer . encode ( 'Once upon a midnight dreary,' )]) > >> outputs = boosted_model . generate ( input_ids = ins , do_sample = True , max_length = 100 , top_p = 0.95 ) > >> tokenizer . decode ( outputs [ 0 ]) "Once upon a midnight dreary, while I pondered over these things, I suddenly became aware of a strange and terrible noise. I turned round, and saw that the old man was standing near me. He was wearing a black suit, with a black tie, and a black hat. He had a long, thin, black beard, and his eyes were black. His hair was of a dark brown colour, and was very long. His face was rather large, and his lips were somewhat"generate靈活地使用boosted_model
> >> ins = T . LongTensor ([ tokenizer . encode ( 'Once upon a midnight dreary,' )]) > >> outputs = boosted_model . generate ( input_ids = ins , do_sample = True , max_length = 100 , top_p = 0.95 ) > >> tokenizer . decode ( outputs [ 0 ]) "Once upon a midnight dreary, while I pondered over these things, I suddenly became aware of a strange and terrible noise. I turned round, and saw that the old man was standing near me. He was wearing a black suit, with a black tie, and a black hat. He had a long, thin, black beard, and his eyes were black. His hair was of a dark brown colour, and was very long. His face was rather large, and his lips were somewhat"。適用於短上下文是當前生成的文本減去一定長度的前綴(例如,對話中的上一折),通過動態設置boosted_model.k為負前綴長度。
我們提出一些有條件的生成輸出。表1中顯示的評估指標可以使用該存儲庫中的前四列的代碼進行評估,也可以在此處使用代碼為我們介紹的新的遠程相干指標。
如果您有任何疑問,請隨時聯繫Kolya(Mila.Quebec的Nikolay.malkin)和Zhen(wang.9215在OSU.EDU)。