ดาวน์โหลด tweetnlp - ดาวน์โหลดซอร์สโค้ด tweetnlp

Tweetnlp

Tweetnlp สำหรับผู้ที่ชื่นชอบ NLP ทุกคนที่ทำงานบน Twitter และโซเชียลมีเดีย! Python Library tweetnlp จัดเตรียมชุดของเครื่องมือที่มีประโยชน์ในการวิเคราะห์/ทำความเข้าใจทวีตเช่นการวิเคราะห์ความเชื่อมั่นการทำนายอีโมจิและการรับรู้ที่มีชื่อว่ามีชื่อว่าขับเคลื่อนโดยการสร้างแบบจำลองภาษาที่ทันสมัยเป็นพิเศษบนโซเชียลมีเดีย

ข่าว (ธันวาคม 2565): เรานำเสนอกระดาษสาธิต TweetNLP ("TweetNLP: การประมวลผลภาษาธรรมชาติที่ทันสมัยสำหรับโซเชียลมีเดีย") ที่ EMNLP 2022 รุ่นสุดท้ายสามารถพบได้ที่นี่

หน้า Tweetnlp Hugging Face ทุกรุ่น TweetNLP หลักสามารถพบได้ที่นี่บนใบหน้ากอด

ทรัพยากร:

ทัวร์ด่วนพร้อมสมุดบันทึก Colab:
เล่นกับการสาธิตออนไลน์ TweetNLP: ลิงก์
EMNLP 2022 PAPER: LINK
การสอนการประชุมเชิงปฏิบัติการ Summer Cardiff NLP ครั้งที่ 2:
การสอนการประชุมเชิงปฏิบัติการ Summer Cardiff NLP ครั้งที่ 2 (โซลูชั่น):

สารบัญ:

โหลดโมเดลและชุดข้อมูล
แบบจำลองการปรับแต่ง

เริ่มต้นใช้งาน

ติดตั้ง TweetNLP ผ่าน PIP บนคอนโซลของคุณ

pip install tweetnlp

ชุดรูปแบบและชุดข้อมูล

ในส่วนนี้คุณจะได้เรียนรู้วิธีรับแบบจำลองและชุดข้อมูลด้วย tweetnlp โมเดลติดตามโมเดล HuggingFace และชุดข้อมูลอยู่ในรูปแบบของชุดข้อมูล HuggingFace การแนะนำแบบง่าย ๆ ของ HuggingFace และชุดข้อมูลควรพบได้ที่หน้าเว็บ HuggingFace ดังนั้นโปรดตรวจสอบหากคุณยังใหม่กับ HuggingFace

การจำแนกทวีต

โมดูลการจำแนกประเภทประกอบด้วยหกงานที่แตกต่างกัน (การจำแนกหัวข้อ, การวิเคราะห์ความรู้สึก, การตรวจจับประชด, การตรวจจับคำพูดแสดงความเกลียดชัง, การตรวจจับภาษาที่ไม่เหมาะสม, การทำนายอีโมจิและการวิเคราะห์อารมณ์) ในแต่ละตัวอย่างโมเดลจะถูกสร้างอินสแตนซ์โดย tweetnlp.load_model("task-name") และเรียกใช้การทำนายโดยส่งข้อความหรือรายการข้อความเป็นอาร์กิวเมนต์ไปยังฟังก์ชั่นที่เกี่ยวข้อง

การจำแนกหัวข้อ : เป้าหมายของงานนี้คือทวีตเพื่อกำหนดหัวข้อที่เกี่ยวข้องกับเนื้อหา งานดังกล่าวถูกสร้างขึ้นเป็นปัญหาการจำแนกประเภทหลายฉลากที่มีการควบคุมซึ่งแต่ละทวีตได้รับการกำหนดหัวข้ออย่างน้อยหนึ่งหัวข้อจากทั้งหมด 19 หัวข้อที่มีอยู่ หัวข้อนี้ได้รับการดูแลอย่างรอบคอบตามแนวโน้มของ Twitter โดยมีเป้าหมายที่จะกว้างและทั่วไปและประกอบด้วยชั้นเรียนเช่น: ศิลปะและวัฒนธรรมดนตรีหรือกีฬา ชุดข้อมูลคำอธิบายประกอบภายในของเรามีทวีตที่ติดฉลากด้วยตนเองมากกว่า 10K (ตรวจสอบกระดาษที่นี่หรือหน้าชุดข้อมูล HuggingFace)

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

การวิเคราะห์ความเชื่อมั่น : งานการวิเคราะห์ความเชื่อมั่นที่รวมอยู่ใน TweetNLP เป็นเวอร์ชันที่เรียบง่ายซึ่งเป้าหมายคือการทำนายความเชื่อมั่นของทวีตกับหนึ่งในสามของป้ายกำกับต่อไปนี้: บวก, เป็นกลางหรือลบ ชุดข้อมูลพื้นฐานสำหรับภาษาอังกฤษเป็นชุดข้อมูล Tweeteval Unified ของชุดข้อมูล Semeval-2017 จากงานเกี่ยวกับการวิเคราะห์ความเชื่อมั่นใน Twitter (ตรวจสอบกระดาษที่นี่)

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

การตรวจจับแบบประชด : นี่เป็นงานการจำแนกแบบไบนารีที่ได้รับทวีตเป้าหมายคือการตรวจจับว่าเป็นเรื่องน่าขันหรือไม่ มันขึ้นอยู่กับชุดข้อมูลการตรวจจับแบบประชดจากงาน Semeval 2018 (ตรวจสอบกระดาษที่นี่)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

การตรวจจับคำพูดแสดงความเกลียดชัง : งานตรวจจับคำพูดแสดงความเกลียดชังประกอบด้วยการตรวจจับว่าทวีตนั้นแสดงความเกลียดชังต่อชุมชนเป้าหมายหรือไม่ รูปแบบพื้นฐานขึ้นอยู่กับชุดข้อมูลการตรวจจับคำพูดแสดงความเกลียดชังแบบครบวงจร (ดูเอกสารอ้างอิง)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

การระบุภาษาที่ไม่เหมาะสม : งานนี้ประกอบด้วยในการระบุว่ารูปแบบของภาษาที่น่ารังเกียจบางรูปแบบมีอยู่ในทวีตหรือไม่ สำหรับเกณฑ์มาตรฐานของเราเราพึ่งพาชุดข้อมูล Offenseval Semeval2019 (ตรวจสอบกระดาษที่นี่)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

การทำนายอีโมจิ : เป้าหมายของการทำนายอีโมจิคือการทำนายอีโมจิสุดท้ายในทวีตที่กำหนด ชุดข้อมูลที่ใช้ในการปรับแต่งโมเดลของเราเป็นการปรับตัวทวีตจากงาน Semeval 2018 เกี่ยวกับการทำนายอีโมจิ (ตรวจสอบกระดาษที่นี่) รวมถึงอีโมจิ 20 ตัวเป็นป้ายกำกับ (❤, ?, ?,,,,,,,,,,,?

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

การรับรู้อารมณ์ : ได้รับทวีตงานนี้ประกอบด้วยการเชื่อมโยงกับอารมณ์ที่เหมาะสมที่สุด เป็นชุดข้อมูลอ้างอิงเราใช้งาน Semeval 2018 ที่มีผลกระทบในทวีต (ตรวจสอบกระดาษที่นี่) แบบจำลองหลายฉลากล่าสุดรวมถึงอารมณ์สิบเอ็ดประเภท

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

คำเตือน: โมเดลความรู้สึกแบบหลายฉลากและหลายฉลากมีชุดฉลากแบบกระจาย (ฉลากเดี่ยวมีสี่คลาสของ 'Joy'/'การมองโลกในแง่ดี'/'ความโกรธ'/'ความโศกเศร้า' ในขณะที่ Multi-label มีสิบเอ็ดคลาสของ 'ความสุข'/'การมองโลกในแง่ดี'/'/'

การจดจำเอนทิตีชื่อ

โมดูลนี้ประกอบด้วยโมเดลการจดจำเอนทิตี (NER) ที่ได้รับการฝึกฝนเฉพาะสำหรับทวีต โมเดลถูกสร้างอินสแตนซ์โดย tweetnlp.load_model("ner") และเรียกใช้การทำนายโดยให้ข้อความหรือรายการข้อความเป็นอาร์กิวเมนต์ไปยังฟังก์ชัน ner (ตรวจสอบกระดาษที่นี่หรือหน้าชุดข้อมูล HuggingFace)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

ตอบคำถาม

โมดูลนี้ประกอบด้วยรูปแบบการตอบคำถามที่ผ่านการฝึกอบรมเฉพาะสำหรับทวีต โมเดลถูกสร้างอินสแตนซ์โดย tweetnlp.load_model("question_answering") และเรียกใช้การทำนายโดยการให้คำถามหรือรายการคำถามพร้อมกับบริบทหรือรายการบริบทเป็นอาร์กิวเมนต์สำหรับฟังก์ชัน question_answering (ตรวจสอบกระดาษที่นี่หรือชุดข้อมูล HuggingFace)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

คำถามตอบคำถาม

โมดูลนี้ประกอบด้วยคำถามและคำตอบการสร้างคู่ที่ผ่านการฝึกอบรมเฉพาะสำหรับทวีต โมเดลถูกสร้างอินสแตนซ์โดย tweetnlp.load_model("question_answer_generation") และเรียกใช้การทำนายโดยให้บริบทหรือรายการบริบทเป็นอาร์กิวเมนต์สำหรับฟังก์ชัน question_answer_generation (ตรวจสอบกระดาษที่นี่หรือหน้าชุดข้อมูล HuggingFace)

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

การสร้างแบบจำลองภาษา

รูปแบบภาษาที่สวมหน้ากากทำนายโทเค็นที่สวมหน้ากากในประโยคที่กำหนด นี่คืออินสแตนซ์โดย tweetnlp.load_model('language_model') และเรียกใช้การทำนายโดยให้ข้อความหรือรายการข้อความเป็นอาร์กิวเมนต์ไปยังฟังก์ชัน mask_prediction โปรดตรวจสอบให้แน่ใจว่าแต่ละข้อความมีโทเค็น <mask> เนื่องจากในที่สุดก็มีวัตถุประสงค์ดังต่อไปนี้โดยวัตถุประสงค์ของแบบจำลองที่จะทำนาย

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

ทวีตฝัง

โมเดลการฝังทวีตสร้างความยาวคงที่การฝังสำหรับทวีต การฝังแสดงถึงความหมายตามความหมายของทวีตและสิ่งนี้สามารถใช้สำหรับการค้นหาความหมายของทวีตโดยใช้ความคล้ายคลึงกันระหว่างการฝัง โมเดลถูกสร้างอินสแตนซ์โดย tweet_nlp.load_model('sentence_embedding') และเรียกใช้การทำนายโดยส่งข้อความหรือรายการข้อความเป็นอาร์กิวเมนต์ไปยังฟังก์ชัน embedding

รับการฝัง

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

การค้นหาความคล้ายคลึงกัน

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

ทรัพยากรและการโหลดโมเดลที่กำหนดเอง

นี่คือตารางของโมเดลเริ่มต้นที่ใช้ในแต่ละงาน

งาน	แบบอย่าง	ชุดข้อมูล
การจำแนกหัวข้อ (ฉลากเดี่ยว)	CardiffNLP/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Single-all-all	cardiffnlp/tweet_topic_single
การจำแนกหัวข้อ (หลายฉลาก)	CardiffNLP/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi-All	cardiffnlp/tweet_topic_multi
การวิเคราะห์ความเชื่อมั่น (หลายภาษา)	cardiffnlp/twitter-xlm-roberta-base-sentiment	cardiffnlp/tweet_sentiment_multilingual
การวิเคราะห์ความเชื่อมั่น	CardiffNLP/Twitter-Roberta-base-sentiment-latest	tweet_eval
การตรวจจับประชดประชัน	Cardiffnlp/Twitter-Roberta-Base-irony	tweet_eval
เกลียดการตรวจจับ	Cardiffnlp/Twitter-Roberta-Base-Hate-Hate-Latest	tweet_eval
การตรวจจับที่น่ารังเกียจ	CardiffNLP/Twitter-Roberta-Base	tweet_eval
การทำนายอีโมจิ	Cardiffnlp/Twitter-Roberta-Base-Emoji	tweet_eval
การวิเคราะห์อารมณ์ (ฉลากเดี่ยว)	CardiffNLP/Twitter-Roberta-Base-Emotion	tweet_eval
การวิเคราะห์อารมณ์ (Multi-label)	CardiffNLP/Twitter-Roberta-Base-Emotion-Multilabel-Latest	TBA
การจดจำเอนทิตีชื่อ	tner/roberta-large-tweetner7- ทั้งหมด	tner/tweetner7
ตอบคำถาม	LMQG/T5-SMALL-TWEETQA-QA	lmqg/qg_tweetqa
คำถามตอบคำถาม	LMQG/T5-BASE-TWEETQA-QAG	lmqg/qag_tweetqa
การสร้างแบบจำลองภาษา	CardiffNLP/Twitter-Roberta-Base-20121-124m	TBA
ทวีตฝัง	Cambridgeltl/Tweet-Roberta-Base-Embeddings-V1	TBA

ในการใช้โมเดลอื่น ๆ จาก Local/HuggingFace ModelHub หนึ่งสามารถให้เส้นทาง/นามแฝงของโมเดลไปยังฟังก์ชัน load_model ด้านล่างเป็นตัวอย่างในการโหลดโมเดลสำหรับ NER

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

การปรับแต่งแบบจำลอง

TweetNLP จัดเตรียมอินเทอร์เฟซง่าย ๆ ในการปรับแต่งภาษาแบบปรับแต่งในชุดข้อมูลที่รองรับโดย HuggingFace สำหรับโมเดลโฮสติ้ง/ปรับแต่งด้วยการปรับแต่งเรย์สำหรับการค้นหาพารามิเตอร์

งานที่ได้รับการสนับสนุน: sentiment , offensive , irony , hate , emotion , topic_classification

ผลการทดลองกับผู้ฝึกสอนของ tweetnlp สามารถพบได้ในตารางต่อไปนี้ ผลลัพธ์มีการแข่งขันและสามารถใช้เป็นพื้นฐานสำหรับแต่ละงาน ดูหน้าลีดเดอร์บอร์ดเพื่อทราบข้อมูลเพิ่มเติมเกี่ยวกับผลลัพธ์

งาน	language_model	eval_f1	eval_f1_macro	eval_ ความแม่นยำ	การเชื่อมโยง
อีโมจิ	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-20121-124m-Emoji
อารมณ์	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-20121-124m-emotion
เกลียด	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.56	0.53	0.56	Cardiffnlp/Twitter-Roberta-Base-20121-124m-Hate
ประชดประชัน	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-20121-124m-irony
ก้าวร้าว	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-20121-124m
ความเชื่อมั่น	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.71	0.72	0.71	cardiffnlp/twitter-roberta-base-20121-124m-sentiment
topic_classification (เดี่ยว)	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.9	0.8	0.9	CardiffNLP/Twitter-Roberta-Base-20121-124m-topic-single
topic_classification (Multi)	CardiffNLP/Twitter-Roberta-Base-20121-124m	0.75	0.56	0.54	CardiffNLP/Twitter-Roberta-Base-20121-124m-Topic-Multi
ความเชื่อมั่น (หลายภาษา)	CardiffNLP/Twitter-XLM-Roberta-Base	0.69	0.69	0.69	cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual

ตัวอย่าง

ตัวอย่างต่อไปนี้จะทำซ้ำโมเดลประชดของเรา CardiffNLP/Twitter-Roberta-Base-20121-124m-irony

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

จุดตรวจสอบที่บันทึกไว้สามารถโหลดเป็นรุ่นที่กำหนดเองดังต่อไปนี้

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

หากไม่ได้รับ split_validation เทรนเนอร์จะทำการรันเดียวกับพารามิเตอร์เริ่มต้นโดยไม่ต้องค้นหาพารามิเตอร์

กระดาษอ้างอิง

สำหรับรายละเอียดเพิ่มเติมโปรดอ่านเอกสารอ้างอิงของ TweetNLP หากคุณใช้ TweetNLP ในการวิจัยของคุณโปรดใช้รายการ bib ต่อไปนี้เพื่ออ้างอิงเอกสารอ้างอิง:

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

ขยาย