ดาวน์โหลด Few NERD - ดาวน์โหลดซอร์สโค้ด Few NERD

Few NERD

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

ไม่กี่คน: ไม่เพียง แต่ชุดข้อมูล ner ไม่กี่นัด

นี่คือซอร์สโค้ดของกระดาษ ACL-IJCNLP 2021: ไม่กี่คน: ชุดข้อมูลการจดจำเอนทิตีสองสามนัด ตรวจสอบเว็บไซต์ของไม่กี่คน

********************************* การอัปเดต *********************************************************************

09/03/2022: เราได้เพิ่มสคริปต์การฝึกอบรมสำหรับการฝึกอบรมภายใต้การดูแลโดยใช้ Bert Tagger เรียกใช้ bash data/download.sh supervised เพื่อดาวน์โหลดข้อมูลจากนั้นเรียกใช้ bash run_supervised.sh
01/09/2021: เราได้ปรับเปลี่ยนผลลัพธ์ของการตั้งค่าภายใต้การดูแลของไม่กี่คนในอาร์กซ์ขอบคุณสำหรับความช่วยเหลือของ Pedromlf
19/08/2021: สำคัญ? ในการมาพร้อมกับข้อมูลตอนที่ปล่อยออกมาเราได้อัปเดตสคริปต์การฝึกอบรมแล้ว เพียงเพิ่ม --use_sampled_data เมื่อเรียกใช้ train_demo.py เพื่อฝึกอบรมและทดสอบข้อมูลตอนที่ปล่อยออกมา
02/06/2021: เพื่อให้การฝึกอบรมง่ายขึ้นเราได้ปล่อยข้อมูลตัวอย่างโดยตอน คลิกที่นี่เพื่อดาวน์โหลด ไฟล์มีชื่อดังกล่าว: {train/dev/test}_{N}_{K}.jsonl เราสุ่มตัวอย่าง 20000, 1,000, 5,000 ตอนสำหรับรถไฟ, dev, การทดสอบ, ตามลำดับ
26/05/2021: ไม่กี่คนในปัจจุบัน (SUP) คือระดับประโยค ในไม่ช้าเราจะเปิดตัวไม่กี่คน (SUP) 1.1 ซึ่งเป็นระดับย่อหน้าและมีข้อมูลบริบทมากขึ้น
11/06/2021: เราได้แก้ไขคำว่าโทเค็นและเราจะอัปเดตผลลัพธ์ล่าสุดในไม่ช้า เราขอขอบคุณ Tingtingma และ Chandan Akiti อย่างจริงใจ

สารบัญ

เว็บไซต์
ภาพรวม
เริ่มต้น
- ความต้องการ
- ชุดข้อมูลไม่กี่คน
  - รับข้อมูล
  - รูปแบบข้อมูล
- โครงสร้าง
- การใช้งานที่สำคัญ
  - n way k ~ 2k sampler sampler
- วิธีการวิ่ง
การอ้างอิง
การเชื่อมต่อ

ภาพรวม

ไม่กี่คนที่มีขนาดใหญ่เป็นชุดข้อมูลการจดจำเอนทิตีที่มีชื่อว่ามีขนาดใหญ่และมีความละเอียดด้วยตนเองซึ่งมี 8 ประเภทที่มีเนื้อหยาบ, 66 ประเภทละเอียด, 188,200 ประโยค, 491,711 เอนทิตีและ 4,601,223 โทเค็น มีการสร้างภารกิจมาตรฐานสามอย่างที่มีการควบคุม: ไม่กี่คน (SUP) และอีกสองตัวคือไม่กี่นัด: ไม่กี่คน (ภายใน) และไม่กี่คน (อินเตอร์)

สคีมาของไม่กี่คนคือ:

ไม่กี่คนที่มีคำอธิบายประกอบด้วยตนเองตามบริบทเช่นในประโยค " ลอนดอนเป็นอัลบั้มที่ห้าของวงดนตรีร็อคอังกฤษ ... " Entity London ที่มีชื่ออยู่ถูกระบุว่าเป็น Art-Music

ความต้องการ

เรียกใช้สคริปต์ต่อไปนี้เพื่อติดตั้งการอ้างอิงที่เหลืออยู่

pip install -r requirements.txt

ชุดข้อมูลไม่กี่คน

รับข้อมูล

ไม่กี่คนที่มี 8 ประเภทหยาบ, 66 ประเภทละเอียด, 188,200 ประโยค, 491,711 หน่วยงานและ 4,601,223 โทเค็น
เราได้แยกข้อมูลออกเป็น 3 โหมดการฝึกอบรม หนึ่งสำหรับการตั้งค่า supervised การดูแล- อีกสองสำหรับการตั้งค่าไม่กี่นัด inter และ intra แต่ละไฟล์มีสามไฟล์ train.txt , dev.txt , test.txt ชุดข้อมูล supervised จะถูกสุ่มแยก ชุดข้อมูล inter ถูกแยกแบบสุ่มภายในประเภทหยาบเช่นแต่ละไฟล์มีทั้ง 8 ประเภทหยาบ แต่ประเภทที่แตกต่างกันแตกต่างกัน ชุดข้อมูล intra จะถูกสุ่มแยกตามประเภทหยาบ
ชุดข้อมูลที่แยกได้สามารถดาวน์โหลดได้โดยอัตโนมัติเมื่อคุณเรียกใช้รุ่น หากคุณต้องการดาวน์โหลดข้อมูลด้วยตนเองให้เรียกใช้ข้อมูล/download.sh อย่าลืมเพิ่มพารามิเตอร์ภายใต้การดูแล/inter/intra เพื่อระบุประเภทของชุดข้อมูล

หากต้องการรับชุดข้อมูลมาตรฐานสาม supervised/inter/intra ของ NEND เพียงไม่กี่ตัวเพียงเรียกใช้ data/download.sh

bash data/download.sh supervised

หากต้องการรับข้อมูลตัวอย่างโดยตอนให้เรียกใช้

bash data/download.sh episode-data
unzip -d data/ data/episode-data.zip

รูปแบบข้อมูล

ข้อมูลจะถูกประมวลผลล่วงหน้าลงในแบบฟอร์มข้อมูล NER ทั่วไปดังด้านล่าง ( tokentlabel )

Between	O
1789	O
and	O
1793	O
he	O
sat	O
on	O
a	O
committee	O
reviewing	O
the	O
administrative	MISC-law
constitution	MISC-law
of	MISC-law
Galicia	MISC-law
to	O
little	O
effect	O
.	O

โครงสร้าง

โครงสร้างของโครงการของเราคือ:

--util
| -- framework.py
| -- data_loader.py
| -- viterbi.py             # viterbi decoder for structshot only
| -- word_encoder
| -- fewshotsampler.py

-- proto.py                 # prototypical model
-- nnshot.py                # nnshot model

-- train_demo.py            # main training script

การใช้งานที่สำคัญ

ตัวอย่าง

ตามที่ก่อตั้งขึ้นในบทความของเราเราออกแบบกลยุทธ์การสุ่มตัวอย่าง K ~ 2K util/fewshotsampler.py งานของเรา

โปรโตเบิร์ต

อวนต้นแบบที่มี Bert ถูกนำไปใช้ใน model/proto.py

nnshot & structshot

nnshot กับ bert ถูกนำไปใช้ใน model/nnshot.py

structshot รับรู้โดยการเพิ่มตัวถอดรหัส viterbi พิเศษใน util/framework.py

โปรดทราบว่าตัวเข้ารหัส Backbone Bert ที่เราใช้สำหรับ structshot model ไม่ได้รับการฝึกอบรมล่วงหน้าด้วยงาน NER

วิธีการวิ่ง

เรียกใช้ train_demo.py อาร์กิวเมนต์แสดงอยู่ด้านล่าง พารามิเตอร์เริ่มต้นใช้สำหรับโมเดล proto บนชุดข้อมูล inter โหมด

-- mode                 training mode, must be inter, intra, or supervised
-- trainN               N in train
-- N                    N in val and test
-- K                    K shot
-- Q                    Num of query per class
-- batch_size           batch size
-- train_iter           num of iters in training
-- val_iter             num of iters in validation
-- test_iter            num of iters in testing
-- val_step             val after training how many iters
-- model                model name, must be proto, nnshot or structshot
-- max_length           max length of tokenized sentence
-- lr                   learning rate
-- weight_decay         weight decay
-- grad_iter            accumulate gradient every x iterations
-- load_ckpt            path to load model
-- save_ckpt            path to save model
-- fp16                 use nvidia apex fp16
-- only_test            no training process, only test
-- ckpt_name            checkpoint name
-- seed                 random seed
-- pretrain_ckpt        bert pre-trained checkpoint
-- dot                  use dot instead of L2 distance in distance calculation
-- use_sgd_for_bert     use SGD instead of AdamW for BERT.
# only for structshot
-- tau                  StructShot parameter to re-normalizes the transition probabilities

สำหรับ HyperParameter --tau ใน Structshot เราใช้ 0.32 ในการตั้งค่า 1-shot, 0.318 สำหรับการตั้งค่า 5-way-5-shot และ 0.434 สำหรับการตั้งค่า 10-way-5-shot
ใช้โมเดล structshot บนชุดข้อมูล inter ตัวอย่างเช่น expriments สามารถทำงานได้ดังนี้

5-way-1 ~ 5-shot

python3 train_demo.py  --mode inter 
--lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 
--max_length 64 --model structshot --tau 0.32

5-way-5 ~ 10-shot

python3 train_demo.py  --mode inter 
--lr 1e-4 --batch_size 1 --trainN 5 --N 5 --K 5 --Q 5 
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 
--max_length 32 --model structshot --tau 0.318

10-way-1 ~ 5-shot

python3 train_demo.py  --mode inter 
--lr 1e-4 --batch_size 4 --trainN 10 --N 10 --K 1 --Q 1 
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 
--max_length 64 --model structshot --tau 0.32

10-way-5 ~ 10-shot

python3 train_demo.py  --mode inter 
--lr 1e-4 --batch_size 1 --trainN 10 --N 10 --K 5 --Q 1 
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 
--max_length 32 --model structshot --tau 0.434

การอ้างอิง

หากคุณใช้งานของคุณไม่กี่คนโปรดอ้างอิงกระดาษของเรา:

 @inproceedings { ding-etal-2021-nerd ,
    title = " Few-{NERD}: A Few-shot Named Entity Recognition Dataset " ,
    author = " Ding, Ning  and
      Xu, Guangwei  and
      Chen, Yulin  and
      Wang, Xiaobin  and
      Han, Xu  and
      Xie, Pengjun  and
      Zheng, Haitao  and
      Liu, Zhiyuan " ,
    booktitle = " Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) " ,
    month = aug,
    year = " 2021 " ,
    address = " Online " ,
    publisher = " Association for Computational Linguistics " ,
    url = " https://aclanthology.org/2021.acl-long.248 " ,
    doi = " 10.18653/v1/2021.acl-long.248 " ,
    pages = " 3198--3213 " ,
}