ACL 2022论文的源代码“连贯性提升:当您验证的语言模型不引起足够的注意时”(Arxiv,ACL选集)
远程语义连贯性仍然是自动语言产生和理解的挑战。我们证明,大型语言模型不足以了解遥远的单词对下一句话预测的影响。我们提出了连贯性的提升,这是一种推理过程,它增加了LM对漫长背景的关注。我们通过对普通文本和对话响应的分配分析来展示与验证模型相干提高的好处。还发现,针对各种零射门NLP任务的最先进模型的连贯性提高可带来性能提高,而没有额外的培训。
如果您发现纸张和代码很有用,请稍好播放此仓库,并引用纸张。非常感谢!
@inproceedings { malkin-etal-2022-coherence , title = " Coherence boosting: When your pretrained language model is not paying enough attention " , author = " Malkin, Nikolay and Wang, Zhen and Jojic, Nebojsa " , booktitle = " Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) " , month = may, year = " 2022 " , address = " Dublin, Ireland " , publisher = " Association for Computational Linguistics " , url = " https://aclanthology.org/2022.acl-long.565 " , doi = " 10.18653/v1/2022.acl-long.565 " , pages = " 8214--8236 " }我们提出了一个演示程序,以证明对现有的预训练的LMS缺乏连贯性,即,未能在给定的上下文中固定预测下一个标记,这显然需要对遥远的单词的理解。我们提出的连贯性提升可以解决此类错误,这相反,通过对数线性将两个分布从完整的上下文和部分上下文中进行对比来预测下一步的令牌。
> >> from cb_demo import contrasting > >> contrasting ( model_name = 'gpt2' , context = ' Ballad metre is "less regular and more conversational" than common metre' , - - partial_length = 8 , - - alpha = 0.5 ) [ out ] Top tokens based on full context : Ballad metre is "less regular and more conversational" than common Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġsense - 2.405 9.03 % 2 Ġin - 3.900 2.02 % 3 . - 3.978 1.87 % 4 , - 4.097 1.66 % 5 Ġpractice - 4.287 1.37 % ... ... ... ... 13 Ġmetre ** - 5.098 0.610609 % * * Target Token Top tokens based on partial context : regular and more conversational " than common Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġsense - 2.547 7.83 % 2 ĠEnglish - 3.352 3.50 % 3 . - 3.427 3.25 % 4 Ġconversation - 3.445 3.19 % 5 , - 3.634 2.64 % ... ... ... ... 14103 Ġmetre ** - 13.450 0.000144 % * * Target Token Contrastive next token prediction : Rank Tokens Logprobs Probs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 Ġmetre ** - 0.923 39.74 % 2 Ġsense - 2.334 9.69 % 3 Ġmeter - 2.785 6.17 % 4 Ġin - 3.210 4.03 % 5 Ġfoot - 3.220 3.99 % * * Target Tokenpython cb_demo . py - - context = ' Ballad metre is "less regular and more conversational" than common metre' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Isley Brewing Company: Going Mintal — a minty milk chocolate stout' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Other times anxiety is not as easy to see, but can still be just as debilitating' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5复制本文图1中的某些示例的结果
。python cb_demo . py - - context = ' Ballad metre is "less regular and more conversational" than common metre' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Isley Brewing Company: Going Mintal — a minty milk chocolate stout' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5 python cb_demo . py - - context = ' Other times anxiety is not as easy to see, but can still be just as debilitating' - - model_name = 'gpt2' - - partial_length = 8 - - alpha = 0.5lambada任务的单词的预测类似于上面显示的示例,其中预期模型可以预测几句句子的段落中的最终单词。该数据集是评估现代Langauge模型的标准基准(示例)。
更重要的是,此任务明确需要在广泛的上下文中进行推理:当给出整个段落时,人类可以可靠地猜测最后一个单词,而仅当只给出最后一句话时就可以可靠地猜测。这样的属性使该基准成为完美的测试台,以评估我们提出的连贯性提升的有效性。
要运行Lambada实验,只需运行以下命令:
python main . py - - tasks = 'lambada' - - models = 'gpt2-small' - - use_val = False - - alpha_start = 1 - - alpha_end = 1 - - alpha_step = 0.1 - - slen_start = 10 - - slen_end = 10重要的参数,请列出list secks to list,请列出python main.py --help 。
--models :预先训练的语言模型的名称,如果您想同时运行多个模型,例如,例如, 'gpt2-small;gpt2-medium' ;如果要使用GPT-3型号,请参见有关GPT-3的注释。--use_val : Whether to use a validation set to select two hyperparameters, alpha and slen representing the boosting coefficient and length for the partial context--alpha_start , --alpha_end , --alpha_step : Grid search parameters for the alpha hyperparameter--slen_start , --slen_end , --slen_step : Grid search parameters for the slen超参数;请注意,两个超参数设置都会影响Lambada任务的推理速度我们评估了以下NLU任务的提高一致性的提升。
| 任务 | 关闭任务 | 问题回答 | 文本分类 | NLI | 事实知识检索 |
|---|---|---|---|---|---|
| 数据 | 集Storycloze Hellaswag Copa | Commonsenseqa OpenBookQa 弧易/挑战 PIQA | SST-2/5 trec Agnews | Rte CB 布尔克 | 喇嘛 |
大多数数据集可以由HugginFace的数据集加载;他们中只有少数需要手动下载,并在运行代码时提示说明。
要运行NLU实验,只需运行以下命令:
python main . py - - tasks = 'storycloze;csqa;openbookqa' - - models = 'gpt2-small;gpt2-medium;gpt2-large' - - alpha_start = 2 - - alpha_end = - 3 - - alpha_step = 0.01列出了一些重要的python main.py --help ,请列出某些重要参数。
--models :预先训练的语言模型的名称,如果要同时运行多个模型,例如,例如'gpt2-small;gpt2-medium'--use_val :是否使用验证设置来选择两个超级公寓, alpha和slen代表增强系数和长度为--alpha_end --alpha_start --alpha_step : alpha超参数的网格搜索参数;请注意,代码缓存了中间结果,并且在运行一次api_key.txt的api_key.txt中的,我们的代码库还足够灵活,可以将任何新的多选择数据集纳入很小的努力(灵感来自开源项目LM-Evaluation-Harness)。大约三个步骤:
__init__.py中注册tasks文件夹中的新数据集。load_data , standardize )继承MultipleChoiceTask类别类get_contrast_ctx ,这是您定义自己的前提提示的地方,以启动其他任务类别,并在采用我们的代码时,请让我们自由地知道其他任务类别,并在采用任何问题。
我们提供了一代模型包装器与HuggingFace transformers库兼容的generation/generation.py 。您可以使用示例脚本中的类创建任何自回归LM的相干增强变体:
> >> boosted_model = generation . BoostedModel ( base_model , k = 8 , alpha_long = 1.5 , alpha_short = - 0.5 )> >> ins = T . LongTensor ([ tokenizer . encode ( 'Once upon a midnight dreary,' )]) > >> outputs = boosted_model . generate ( input_ids = ins , do_sample = True , max_length = 100 , top_p = 0.95 ) > >> tokenizer . decode ( outputs [ 0 ]) "Once upon a midnight dreary, while I pondered over these things, I suddenly became aware of a strange and terrible noise. I turned round, and saw that the old man was standing near me. He was wearing a black suit, with a black tie, and a black hat. He had a long, thin, black beard, and his eyes were black. His hair was of a dark brown colour, and was very long. His face was rather large, and his lips were somewhat"generate灵活地使用boosted_model
> >> ins = T . LongTensor ([ tokenizer . encode ( 'Once upon a midnight dreary,' )]) > >> outputs = boosted_model . generate ( input_ids = ins , do_sample = True , max_length = 100 , top_p = 0.95 ) > >> tokenizer . decode ( outputs [ 0 ]) "Once upon a midnight dreary, while I pondered over these things, I suddenly became aware of a strange and terrible noise. I turned round, and saw that the old man was standing near me. He was wearing a black suit, with a black tie, and a black hat. He had a long, thin, black beard, and his eyes were black. His hair was of a dark brown colour, and was very long. His face was rather large, and his lips were somewhat"。适用于短上下文是当前生成的文本减去一定长度的前缀(例如,对话中的上一折),通过动态设置boosted_model.k为负前缀长度。
我们提出一些有条件的生成输出。表1中显示的评估指标可以使用该存储库中的前四列的代码进行评估,也可以在此处使用代码为我们介绍的新的远程相干指标。
如果您有任何疑问,请随时联系Kolya(Mila.Quebec的Nikolay.malkin)和Zhen(wang.9215在OSU.EDU)。