ContinualLM下載 - ContinualLM源代碼下載

ContinualLM

Ai源碼

1.0.0

下載

連續

想像一下，LM不僅可以輕鬆地獲取新知識，而且還保留了對技能的掌握，同時成功地轉移了知識。有可能嗎？

消息

我們已經在擁抱臉上添加了檢查站，以便於復制！
我們已經添加了continual_pretrain.ipynb，作為軟覆蓋方案的獨立示例。它運行良好，沒有GPU！
軟掩模也可以在常規的持續微調中起作用。查看我們最新的EMNLP23紙！
想知道您是否可以在不擔心其參數更新的情況下調整Black-Box LLM ？在此處查看我們有關檢索的最新報紙（RAG）！

快速鏈接

介紹
簡單示例
數據集
建築學
安裝
域自適應預訓練
端任務微調
擁抱面中的檢查點
參考
接觸

介紹

在2021年，我們引入了Pycontinual，這是一個直接而靈活的框架，用於持續學習。我們的研究從這個框架中受益匪淺。如今，我們很高興分享連續性的持續學習框架，旨在維持該領域的持續學習（CL）的好處（LMS）。

LMS的持續學習與傳統CL不同，因為

每個任務都被視為特定領域的語料庫（目前，我們的主要重點是自適應前訓練，這也稱為訓練前或訓練後）。
此外，評估過程涉及對相應的端任務進行微調。

我們的存儲庫包括使用相同的培訓和評估管道的Pytorch實施（SOTA）方法集合（SOTA）。該存儲庫致力於推進LMS持續學習的領域。包括的方法是：

來自我們的小組：
- DAS ：語言模型的持續學習，ICLR 2023
- CPT ：持續培訓用於幾次學習的語言模型，EMNLP 2022
- DGA ：在保留其通用知識的同時適應語言模型，EMNLP 2022
- CTR ：在持續學習中實現忘記預防和知識轉移，神經2021
- 經典：經典：持續和對比度學習方面情感分類任務，EMNLP 2021
- B-CL ：調整BERT，以持續學習一系列方面分類任務，NAACL 2021
來自其他團體（未來更多） ：
- Demix ：Demix層：模塊化語言建模的解開域； Gururangan等，NAACL 2022）
- EWC ：克服神經網絡中的災難性遺忘，Kirkpatrick等，PNAS，2017
- DER ++ ：一般持續學習的黑暗體驗：強大，簡單的基線，Buzzega等，Neurips 2020
- 帽子：克服災難性的遺忘，並嚴重關注任務，Serrà等，ICML 2018
廣泛使用的基準持續學習：
- NCL ：幼稚的持續學習：一系列域序列的連續域自適應預訓練，而沒有對遺忘或轉移問題的任何特定關注。
- 一個：每個域的單獨進行域自適應預訓練。
- 適配器一：將適配器添加到每個域的變壓器中
- 提示：將提示提示到每個域的變壓器
- KD ：天真的知識蒸餾

簡單示例

我們已經添加了continual_pretrain.ipynb ，作為軟覆蓋方案的獨立示例。它運行良好，沒有GPU！

數據集

當涉及語言模型（LMS）的持續學習時，找到合適的數據集至關重要。我們提供的數據集遵守以下原則：

特定領域：域語料庫必須足夠具體以提高任務性能。
可用的端任務：我們贊成通過終端任務評估訓練有素的語言模型，而不是依靠困惑，因為前者代表了一種更可靠的評估方法。

我們發布包括6個不同域的數據集，每個域都伴隨其相應的端任務。數據集可以在此處找到。以下是每個領域的一些統計數據：

域語料庫	尺寸	端任務	任務	＃訓練	#Testing	#Classes
Yelp餐廳	758MB	餐廳	方面情感分類（ASC）	3,452	1,120	3
亞馬遜電話	724MB	電話	方面情感分類（ASC）	239	553	2
亞馬遜相機	319MB	相機	方面情感分類（ASC）	230	626	2
ACL論文	867MB	ACL	引用意圖分類	1,520	421	6
AI論文	507MB	人工智慧	關係分類	2,260	2,388	7
PubMed論文	989MB	PubMed	化學蛋白相互作用預測	2,667	7,398	13

建築學

連續性的架構在很大程度上遵循了Pycontinual，CPT和DGA的結構。

安裝

conda create --name continuallm --file requirements.txt

配x我們的模型基於transformers==4.17.0 ，並且adapter-transformers==3.0.1 。我們建議使用這些特定版本，因為使用其他版本可能會導致意外的錯誤。

域自適應預訓練

這是不斷學習的地方。我們將學習一個域的信息。

max_samples=640000 
for idrandom in 0 
do    
 for pt_task in 0 1 2 3 4 5    
  do    
 python -m torch.distributed.launch --nproc_per_node 4 --use_env posttrain.py     
 --per_device_train_batch_size 62  
 --fp16     
 --max_seq_length 164  
 --max_samples ${max_samples}  
 --idrandom ${idrandom}  
 --ntasks 6  
 --pt_task ${pt_task}  
 --baseline ' das '
 done 
done

--idrandom ：選擇任務序列。有關更多詳細信息，請參見./sequences 。
--baseline ：有關可用基線模型的簡介（請參閱config.py中的choices ）。

端任務微調

經過LMS的孔子學習後，現在我們能夠通過單獨運行端任務微調來評估性能。

max_samples=640000    
 seed=(2021 111 222 333 444 555 666 777 888 999)    
 for round in 0 ; do    
  for idrandom in 0 ;    
  do    
    for pt_task in 0 1 2 3 4 5   
    do    
      for ft_task in $( seq 0 ${pt_task} ) ;    
      do    
       python finetune.py     
       --max_seq_length 164  
       --pt_task ${pt_task}  
       --ft_task ${ft_task}  
       --idrandom ${idrandom}  
       --ntasks 6  
       --max_samples ${max_samples} 
       --seed ${seed[$round]}  
       --baseline ' das '    
       done    
    done   
  done  
done

擁抱面中的檢查點

對於那些僅對最終模型感興趣或想繼續使用自己的數據進行訓練的人，我們有個好消息！我們通過擁抱的臉提供檢查站。

您可以輕鬆地使用HuggingFace的transformers導入我們持續的訓練後模型！

 import torch
from transformers import AutoTokenizer , AutoModelForSequenceClassification

# Import our model. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer . from_pretrained ( "UIC-Liu-Lab/DAS-Rest2Cam" )
model = AutoModelForSequenceClassification . from_pretrained ( "UIC-Liu-Lab/DAS-Rest2Cam" , trust_remote_code = True )

# Tokenize input texts
texts = [
    "There's a kid on a skateboard." ,
    "A kid is skateboarding." ,
    "A kid is inside the house."
]
inputs = tokenizer ( texts , padding = True , truncation = True , return_tensors = "pt" )

# Get the model output!
res = model ( ** inputs )

如果您通過HuggingFace的API直接加載模型時遇到任何問題，也可以手動從存儲庫手動下載模型，並使用model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}) 。

持續的訓練序列是./sequences/posttrain （從餐廳到相機）的第一個序列，您可以使用下載的權重來調整相應的端任務。

如果您對重要性文件感興趣，請參閱before_distill0和after_mlm{domain_id} 。在預先培訓之前計算出的重要性before ，這僅在第一個領域之前進行一次以供一般培訓的知識進行。 after指示domain_id預訓練後計算的重要性。

參考

我們非常感謝您凝視和引用的行為。您對細節和認可的關注非常重視。

  
@inproceedings { ke2022dgs ,  
 title = { Continual Learning of Language Models } , author = { Ke, Zixuan and Shao, Yijia and Lin, Haowei and Konishi, Tatsuya and Kim, Gyuhak and Liu, Bing } , booktitle = { International Conference on Learning Representations (ICLR) } , year = { 2023 } }  
  
@inproceedings { ke2022dga ,  
 title = { Adapting a Language Model While Preserving its General Knowledge } , author = { Ke, Zixuan and Shao, Yijia and Lin, Haowei and Xu, Hu and Shu, Lei, and Liu, Bing } , booktitle = { Empirical Methods in Natural Language Processing (EMNLP) } , year = { 2022 } }  
  
@inproceedings { ke2022continual ,  
 title = { Continual Training of Language Models for Few-Shot Learning } , author = { Ke, Zixuan and Lin, Haowei and Shao, Yijia and Xu, Hu and Shu, Lei, and Liu, Bing } , booktitle = { Empirical Methods in Natural Language Processing (EMNLP) } , year = { 2022 } }