การดาวน์โหลด Causal Distill

การกลั่นสาเหตุสำหรับแบบจำลองภาษา (diito)

Zhengxuan Wu*, Atticus Geiger*, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman

การดำเนินการของการกลั่นสาเหตุล่วงหน้าของเราสำหรับแบบจำลองภาษา วิธีการมาตรฐานในการกลั่นฝึกอบรมแบบจำลองนักเรียนกับวัตถุประสงค์สองประการ: วัตถุประสงค์เฉพาะงาน (เช่นการสร้างแบบจำลองภาษา) และวัตถุประสงค์การเลียนแบบที่ส่งเสริมสถานะที่ซ่อนอยู่ของโมเดลนักเรียนจะคล้ายกับโมเดลครูขนาดใหญ่ ในบทความนี้เราแสดงให้เห็นว่ามันเป็นประโยชน์ในการเพิ่มการกลั่นด้วยวัตถุประสงค์ที่สามที่กระตุ้นให้นักเรียนเลียนแบบกระบวนการคำนวณเชิงสาเหตุของครูผ่านการฝึกอบรมการแทรกแซงการแทรกแซง (IIT) เราตั้งชื่อวิธีการของเรา เกี่ยวกับวัตถุประสงค์การฝึกอบรมการแทรกแซงการแทรกแซงการกลั่น (DIITO)

เราพบว่า Diito มีประโยชน์ในการตั้งค่าทรัพยากรต่ำ Diito ดำเนินการกับการกลั่นมาตรฐาน (97%) แต่การฝึกอบรมด้วยข้อมูลน้อยกว่า 97%

เราแยกรหัสฐานหลักของเราจากอินเทอร์เฟซการกลั่น HuggingFace

บันทึกย่อ

✅ 12/02/2021 กระดาษของเราเกี่ยวกับการฝึกอบรมการแทรกแซงการแลกเปลี่ยน (IIT) ได้รับการปล่อยตัว! อ่านสิ่งนี้เพื่อนิยามอย่างเป็นทางการของวิธีการ
✅ 12/06/2021 ปล่อยรหัสการกลั่นเชิงสาเหตุด้วยการพิมพ์ล่วงหน้า
✅ 12/06/2021 เปิดตัวผลการประเมินผลใน Tiny-Bert กลั่น (3 เลเยอร์) ด้วยชุดข้อมูล Wiki-text 103m
✅ 01/14/2022 เปิดตัว Diito รุ่นใหม่และผลการประเมินผล คุณสามารถดู Preprint ที่ได้รับการอัปเดตแบบส่วนตัวของเราสำหรับรายละเอียดเพิ่มเติม
✅ 02/21/2022 เปิดตัว codebase สำหรับ diito-xxs ที่ใช้ ditto เพื่อกลั่นโมเดลเฉพาะงานใน NLP โดยเน้นไปที่การกลั่นแบบจำลองการกลั่นในการตั้งค่าที่มีทรัพยากรต่ำ ตรวจสอบ Repo สำหรับข้อมูลเพิ่มเติม!
⬜รุ่น Diito (6 เลเยอร์) ที่ได้รับการฝึกฝนด้วยภาษาอังกฤษ Wikipedia + Bookcorpus

หากคุณพบปัญหาใด ๆ หรือมีข้อเสนอแนะโปรดติดต่อฉันทั้งหน้าปัญหาหรือที่ [email protected]

ผลการวัดผล

นี่คือผลลัพธ์ในชุดกาว dev:

แบบอย่าง	# ของโทเค็นการฝึกอบรม	คะแนนเฉลี่ย	โคล่า	mnli	MRPC	qnli	qqp	rte	SST-2	STS-B
Distilbert (6 ชั้น) Devlin et al., 2019	3.3b	79.59	51.30	82.10	87.50	89.20	88.50	59.90	91.30	86.90
Distilbert (6 ชั้น)	0.1b	75.80	40.43	78.95	87.45	84.76	84.96	60.10	89.38	80.40
Diito (6 ชั้น)	0.1b	77.14	45.17	79.68	88.18	85.83	85.31	60.94	90.32	81.69
Diito (6 ชั้น)	3.3b	-	-	-	-	-	-	-	-	-

เนื้อหาหลัก

การอ้างอิง
ความต้องการ
ชุดข้อมูล
การกลั่น
การประเมิน

การอ้างอิง

หากคุณใช้พื้นที่เก็บข้อมูลนี้โปรดอ้างอิงเอกสารสองฉบับต่อไปนี้: กระดาษสำหรับการฝึกอบรมการแทรกแซงการแลกเปลี่ยนและกระดาษสำหรับวิธีการกลั่นของเรา

  @article{geiger-etal-2021-iit,
        title={Inducing Causal Structure for Interpretable Neural Networks}, 
        author={Geiger, Atticus and Wu, Zhengxuan and Lu, Hanson and Rozner, Josh and Kreiss, Elisa and Icard, Thomas and Goodman, Noah D. and Potts, Christopher},
        year={2021},
        eprint={2112.00826},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
  }

  @article{wu-etal-2021-distill,
        title={Causal Distillation for Language Models}, 
        author={Wu, Zhengxuan and Geiger, Atticus and Rozner, Josh and Kreiss, Elisa and Lu, Hanson and Icard, Thomas and Potts, Christopher and Goodman, Noah D.},
        year={2021},
        eprint={2112.02505},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }

ความต้องการ

รองรับ Python 3.6 หรือ 3.7
Pytorch เวอร์ชัน: 1.9.0
Transfermers เวอร์ชัน: 4.11.3
ชุดข้อมูลเวอร์ชัน: เวอร์ชัน: 1.8.0
เนื่องจากเราสร้าง codebase ของเราออกจากอินเทอร์เฟซการกลั่น HuggingFace โปรดตรวจสอบเอกสารของพวกเขาสำหรับข้อกำหนด

ชุดข้อมูล

หลังจากอินเทอร์เฟซการกลั่น HuggingFace เราจำเป็นต้องประมวลผลชุดข้อมูลก่อนที่เราจะทำการกลั่น คุณสามารถอ้างถึง Repo ของพวกเขาสำหรับรายละเอียด เราปรับสคริปต์การประมวลผลล่วงหน้าและอัปเดตด้วยการปรับปรุงเล็กน้อย ตัวอย่างเช่นตอนนี้เราสามารถเพิ่มชุดข้อมูลจากฮับชุดข้อมูลจาก HuggingFace โดยตรง

 # preprocessing from disk
python script/binarized_data.py 
--file_path ../../bert-mid-tuning/data-files/wikitext-15M 
--split train 
--field_name text 
--max_parsing_example 1000 
--tokenizer_type bert 
--tokenizer_name bert-base-uncased 
--dump_file ./data/binarized_text

# preprocessing from huggingface.
python scripts/binarized_data.py 
--dataset_name bookcorpus 
--split train 
--field_name text 
--tokenizer_type bert 
--tokenizer_name bert-base-uncased 
--dump_file bookcorpus-dataset/binarized_text 
--cache_dir ./distill_cache/

python scripts/binarized_data.py 
--dataset_name wikitext 
--split train 
--field_name text 
--tokenizer_type bert 
--tokenizer_name bert-base-uncased 
--dump_file wikitext-dataset/binarized_text 
--cache_dir ./distill_cache/

python scripts/binarized_data.py 
--dataset_name wikitext+bookcorpus 
--split train 
--field_name text 
--tokenizer_type bert 
--tokenizer_name bert-base-uncased 
--dump_file wikitext+bookcorpus-dataset/binarized_text 
--cache_dir ./distill_cache/

# helper scripts to combine two binarized data files
python scripts/data_combinator.py 
--file_path_left ./bookcorpus-dataset/binarized_text.train.bert-base-uncased.pickle 
--file_path_right ./wikitext-dataset/binarized_text.train.bert-base-uncased.pickle 
--split train 
--tokenizer_name bert-base-uncased 
--dump_file wikitext+bookcorpus-dataset/binarized_text

# multiprocessing preprocessor.
python scripts/binarized_data.py 
--dataset_name bookcorpus 
--split train 
--field_name text 
--tokenizer_type bert 
--tokenizer_name bert-base-uncased 
--dump_file bookcorpus-dataset/binarized_text 
--cache_dir ./distill_cache/ 
--fast_process 
--preprocessing_num_workers 48

หลังจากที่คุณเตรียมชุดข้อมูลให้พร้อมคุณจะต้องสร้างโทเค็นนับเช่นกัน

python scripts/token_counts.py 
--data_file data/binarized_text.train.bert-base-uncased.pickle 
--token_counts_dump data/binarized_text.train.token_counts.bert-base-uncased.pickle 
--vocab_size 30522

การกลั่น

ก่อนการฝึกอบรมเราขอแนะนำให้คุณเริ่มต้นโมเดลนักเรียนของคุณด้วยน้ำหนักที่สกัดจากโมเดลครู

python scripts/extract_distilbert.py 
--model_type bert 
--model_name bert-base-uncased 
--dump_checkpoint ./distillation_checkpoints/bert-base-uncased_num_layer_3.pth 
--num_layers 3

ตอนนี้นี่คือตัวอย่างสำหรับคุณที่จะกลั่นด้วยวัตถุประสงค์การกลั่นสาเหตุของเราหรือไม่มี

CUDA_VISIBLE_DEVICES=0,1,2,3 python causal_train.py 
--force 
--n_gpu 4 
--log_interval 10 
--student_type distilbert 
--student_config ./training_configs/distilbert-base-uncased-large.json 
--student_pretrained_weights ./distillation_checkpoints/bert-base-uncased_num_layer_6.pth 
--teacher_type bert 
--teacher_name bert-base-uncased 
--neuron_mapping ./training_configs/single_middle_layer_6.nm 
--mlm --alpha_ce 0.25 --alpha_mlm 0.25 --alpha_cos 0.25 --alpha_clm 0.0 --alpha_causal_ce 0.25 --alpha_causal_cos 0.0 
--interchange_prop 0.3 --interchange_max_token -1 --interchange_consecutive_only 
--freeze_pos_embs 
--dump_path ./results/ 
--data_file ./wikitext-dataset/binarized_text.train.bert-base-uncased.pickle 
--token_counts ./wikitext-dataset/binarized_text.train.token_counts.bert-base-uncased.pickle 
--seed 42 
--n_epoch 3 
--gradient_accumulation_steps 6 
--batch_size 40

โปรดทราบว่าคุณสามารถเปิด/ปิดวัตถุประสงค์การกลั่นเชิงสาเหตุของเราผ่านการตั้งค่าอาร์กิวเมนต์ ตัวอย่างเช่นเราเพิ่งเพิ่มอาร์กิวเมนต์นี้ --alpha_causal_cos เพื่อสนับสนุนการสูญเสียสาเหตุในระยะการสูญเสียโคไซน์ โปรดทราบว่าขนาดแบทช์ที่มีประสิทธิภาพในการตั้งค่าของเราถูกตั้งค่าเป็น 240

การประเมิน

หลังจากที่คุณได้รับโมเดลกลั่นแล้วคุณจะต้องปรับแต่งและประเมินพวกเขาด้วยงานดาวน์สตรีม เราให้สคริปต์ทั้งหมดที่คุณต้องการเรียกใช้

การประเมิน MLM

CUDA_VISIBLE_DEVICES=0 python run_mlm.py 
--model_name_or_path ./path_to_your_model/ 
--dataset_dir ../path_to_your_data/ 
--tokenizer_name bert-base-uncased 
--do_eval 
--output_dir /tmp/test-mlm 
--cache_dir ./distill_cache/

การประเมินกาว

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_glue.py 
--model_name_or_path ./path_to_your_model/ 
--tokenizer_name bert-base-uncased 
--task_name sst2 
--do_train 
--do_eval 
--max_seq_length 128 
--per_device_train_batch_size 32 
--learning_rate 2e-5 
--num_train_epochs 3 
--output_dir ./results/ 
--save_total_limit 1 
--cache_dir ./distill_cache/

การประเมินผล conll

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_ner.py 
--model_name_or_path ./path_to_your_model/ 
--tokenizer_name bert-base-uncased 
--dataset_name conll2003 
--do_train 
--do_eval 
--output_dir ./ner_results/ 
--save_total_limit 1 
--cache_dir ./distill_cache/

การประเมินผลของทีม

CUDA_VISIBLE_DEVICES=0,1,2,3 python run_qa.py 
--model_name_or_path ./path_to_your_model/ 
--tokenizer_name bert-base-uncased 
--dataset_name squad 
--do_train 
--do_eval 
--per_device_train_batch_size 12 
--learning_rate 3e-5 
--num_train_epochs 2 
--max_seq_length 384 
--doc_stride 128 
--save_total_limit 1 
--output_dir ./qa_results/

ขยาย