ดาวน์โหลด textaugment - textaugment Source Source Download

textaugment

ซอร์สโค้ดอื่น ๆ

2.0.0 16-11-2023

ดาวน์โหลด

Textaugment: การปรับปรุงการจำแนกประเภทข้อความสั้น ๆ ผ่านวิธีการเสริมทั่วโลก

คุณเพิ่งพบข้อความ

Textaugment เป็นไลบรารี Python 3 สำหรับการเพิ่มข้อความสำหรับแอปพลิเคชันการประมวลผลภาษาธรรมชาติ Textaugment ตั้งอยู่บนไหล่ยักษ์ของ NLTK, Gensim v3.x และ TextBlob และเล่นได้ดีกับพวกเขา

กิตติกรรมประกาศ

อ้างถึงบทความนี้เมื่อใช้ห้องสมุดนี้ เวอร์ชัน arxiv

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

สารบัญ

คุณสมบัติ
กระดาษอ้างอิง
- ความต้องการ
- การติดตั้ง
- วิธีใช้
  - การเสริม Word2Vec
  - การเสริมตาม WordNet
  - การเสริมตาม RTT
การเพิ่มข้อมูลง่าย (EDA)
การเพิ่มข้อมูลที่ง่ายขึ้น (AEDA)
การเสริมการผสม
- การดำเนินการ
กิตติกรรมประกาศ

คุณสมบัติ

สร้างข้อมูลสังเคราะห์เพื่อปรับปรุงประสิทธิภาพของโมเดลโดยไม่ต้องใช้ความพยายามด้วยตนเอง
ห้องสมุดที่เรียบง่ายน้ำหนักเบาและใช้งานง่าย
เสียบและเล่นกับกรอบการเรียนรู้ของเครื่องใด ๆ (เช่น Pytorch, Tensorflow, Scikit-Learn)
สนับสนุนข้อมูลข้อความ

กระดาษอ้างอิง

การปรับปรุงการจำแนกประเภทข้อความสั้น ๆ ผ่านวิธีการเสริมทั่วโลก

ความต้องการ

Python 3

แพ็คเกจซอฟต์แวร์ต่อไปนี้เป็นการพึ่งพาและจะติดตั้งโดยอัตโนมัติ

$ pip install numpy nltk gensim==3.8.3 textblob googletrans

รหัสต่อไปนี้ดาวน์โหลด NLTK Corpus สำหรับ WordNet

 nltk . download ( 'wordnet' )

รหัสต่อไปนี้ดาวน์โหลด nltk tokenizer tokenizer นี้แบ่งข้อความออกเป็นรายการประโยคโดยใช้อัลกอริทึมที่ไม่ได้รับการดูแลเพื่อสร้างแบบจำลองสำหรับคำย่อการรวบรวมและคำที่เริ่มประโยค

 nltk . download ( 'punkt' )

รหัสต่อไปนี้ดาวน์โหลดรุ่น Tagger Part-of-Speech เริ่มต้น Tagger ส่วนหนึ่งของคำพูดประมวลผลลำดับของคำและแนบส่วนหนึ่งของแท็กคำพูดกับแต่ละคำ

 nltk . download ( 'averaged_perceptron_tagger' )

ใช้ gensim เพื่อโหลดโมเดล Word2vec ที่ผ่านการฝึกอบรมมาแล้ว เช่น Google News จาก Google Drive

 import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )

คุณยังสามารถใช้ Gensim เพื่อโหลด FastText English และ Multilingual Models ของ Facebook

 import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

หรือฝึกอบรมหนึ่งตั้งแต่เริ่มต้นโดยใช้ข้อมูลของคุณหรือชุดข้อมูลสาธารณะต่อไปนี้:

text8 wiki
ชุดข้อมูลจาก "มาตรฐานการสร้างแบบจำลองภาษาคำหนึ่งพันล้านคำ"

การติดตั้ง

ติดตั้งจาก PIP [แนะนำ]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

ติดตั้งจากแหล่งที่มา

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

วิธีใช้

มีการเสริมสามประเภทที่สามารถใช้:

Word2Vec

 from textaugment import Word2vec

Fastext

 from textaugment import Fasttext

Wordnet

 from textaugment import Wordnet

แปล (ซึ่งจะต้องใช้อินเทอร์เน็ต)

 from textaugment import Translate

การเสริมตาม FastText/Word2Vec

ดูสมุดบันทึกนี้สำหรับตัวอย่าง

ตัวอย่างพื้นฐาน

 > >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good

ตัวอย่างขั้นสูง

 > >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent

การเสริมตาม WordNet

ตัวอย่างพื้นฐาน

 > >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town

ตัวอย่างขั้นสูง

 > >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .

การเสริมตาม RTT

ตัวอย่าง

 > >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to town

EDA: เทคนิคการเพิ่มข้อมูลง่าย ๆ สำหรับการเพิ่มประสิทธิภาพในงานการจำแนกประเภทข้อความ

นี่คือการดำเนินการของ EDA โดย Jason Wei และ Kai Zou

https://www.aclweb.org/anthology/d19-1670.pdf

ดูสมุดบันทึกนี้สำหรับตัวอย่าง

เปลี่ยนคำพ้องความหมาย

สุ่มเลือกคำ n จากประโยคที่ไม่หยุดคำ แทนที่แต่ละคำเหล่านี้ด้วยหนึ่งในคำพ้องความหมายที่สุ่มเลือก

ตัวอย่างพื้นฐาน

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town

การลบแบบสุ่ม

สุ่มลบแต่ละคำในประโยคด้วยความน่าจะเป็น p

ตัวอย่างพื้นฐาน

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town

การแลกเปลี่ยนแบบสุ่ม

สุ่มเลือกสองคำในประโยคและสลับตำแหน่งของพวกเขา ทำสิ่งนี้ n ครั้ง

ตัวอย่างพื้นฐาน

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is

การแทรกแบบสุ่ม

ค้นหาคำพ้องความหมายแบบสุ่มของคำสุ่มในประโยคที่ไม่ใช่คำหยุด แทรกคำพ้องความหมายนั้นลงในตำแหน่งสุ่มในประโยค ทำสิ่งนี้ n ครั้ง

ตัวอย่างพื้นฐาน

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town

AEDA: เทคนิคการเพิ่มข้อมูลที่ง่ายขึ้นสำหรับการจำแนกประเภทข้อความ

นี่คือการใช้งานของ AEDA โดย Karimi et al ซึ่งเป็นตัวแปรของ EDA มันขึ้นอยู่กับการแทรกเครื่องหมายเครื่องหมายวรรคตอนแบบสุ่ม

https://aclanthology.org/2021.Findings-EMNLP.234.pdf

การดำเนินการ

ดูสมุดบันทึกนี้สำหรับตัวอย่าง

การแทรกเครื่องหมายวรรคตอนแบบสุ่ม

ตัวอย่างพื้นฐาน

 > >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town

การเสริมการผสม

นี่คือการดำเนินการเสริมการผสมผสานโดย Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz ปรับตัวให้เข้ากับ NLP

ใช้ในการเพิ่มข้อมูลด้วยการผสมสำหรับการจำแนกประโยค: การศึกษาเชิงประจักษ์

Mixup เป็นหลักการเสริมข้อมูลทั่วไปและตรงไปตรงมา ในสาระสำคัญ Mixup ฝึกอบรมเครือข่ายประสาทในการรวมกันของคู่ตัวอย่างและฉลากของพวกเขา ด้วยการทำเช่นนั้นการผสมผสานเครือข่ายประสาทเพื่อสนับสนุนพฤติกรรมเชิงเส้นที่เรียบง่ายในระหว่างการฝึกอบรม

การดำเนินการ

ดูสมุดบันทึกนี้สำหรับตัวอย่าง

สร้างด้วย❤ on

งูหลาม

ผู้เขียน

Joseph Sefara (http://www.speechtech.co.za)
Vukosi Marivate (http://www.vima.co.za)

กิตติกรรมประกาศ

อ้างถึงบทความนี้เมื่อใช้ห้องสมุดนี้ เวอร์ชัน arxiv

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

ใบอนุญาต

MIT ได้รับใบอนุญาต ดูไฟล์ลิขสิทธิ์ที่มาพร้อมสำหรับรายละเอียดเพิ่มเติม

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 2.0.0 16-11-2023
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-15
ขนาด 119.78KB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด