Unduh tweetnlp - Unduh Kode Sumber tweetnlp

Tweetnlp

TweetNLP untuk semua penggemar NLP yang bekerja di Twitter dan media sosial! Python Library tweetnlp menyediakan koleksi alat yang berguna untuk menganalisis/memahami tweet seperti analisis sentimen, prediksi emoji, dan pengenalan namanya-entitas, didukung oleh pemodelan bahasa canggih yang khusus pada media sosial.

Berita (Desember 2022): Kami mempresentasikan kertas demo tweetNLP ("TweetNLP: Pemrosesan Bahasa Alami Cutting-Edge untuk Media Sosial"), di EMNLP 2022. Versi terakhir dapat ditemukan di sini.

Tweetnlp Hugging Face Page Semua model tweetNLP utama dapat ditemukan di sini di Face Hugging.

Sumber daya:

Tur Cepat dengan Colab Notebook:
Mainkan dengan Demo Online TweetNLP: Tautan
EMNLP 2022 Paper: Link
Tutorial Lokakarya Musim Panas Cardiff NLP ke -2:
Tutorial Lokakarya Musim Panas Cardiff NLP ke -2 (Solusi):

Daftar isi:

Muat Model & Dataset
Model fine-tune

Mulai

Instal TweetNLP melalui PIP di konsol Anda.

pip install tweetnlp

Model & Dataset

Di bagian ini, Anda akan belajar cara mendapatkan model dan set data dengan tweetnlp . Model mengikuti model HuggingFace dan dataset berada dalam format dataset HuggingFace. Pendahuluan Mudah dari Model dan Dataset Huggingface harus ditemukan di halaman web HuggingFace, jadi silakan periksa apakah Anda baru di HuggingFace.

Klasifikasi Tweet

Modul klasifikasi terdiri dari enam tugas yang berbeda (klasifikasi topik, analisis sentimen, deteksi ironi, deteksi bicara kebencian, deteksi bahasa ofensif, prediksi emoji, dan analisis emosi). Dalam setiap contoh, model ini dipakai oleh tweetnlp.load_model("task-name") , dan menjalankan prediksi dengan menyampaikan teks atau daftar teks sebagai argumen ke fungsi yang sesuai.

Klasifikasi Topik : Tujuan dari tugas ini adalah, diberikan tweet untuk menetapkan topik yang terkait dengan kontennya. Tugas ini dibentuk sebagai masalah klasifikasi multi-label yang diawasi di mana setiap tweet diberi satu atau lebih topik dari total 19 topik yang tersedia. Topiknya dikuratori dengan hati -hati berdasarkan tren Twitter dengan tujuan untuk menjadi luas dan umum dan terdiri dari kelas -kelas seperti: seni dan budaya, musik, atau olahraga. Dataset yang dianotasi secara internal kami berisi lebih dari 10 ribu tweet berlabel manual (periksa kertas di sini, atau halaman Dataset Huggingface).

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

Analisis Sentimen : Tugas analisis sentimen yang terintegrasi dalam tweetNLP adalah versi yang disederhanakan di mana tujuannya adalah untuk memprediksi sentimen tweet dengan salah satu dari tiga label berikut: positif, netral atau negatif. Dataset dasar untuk bahasa Inggris adalah versi tweeteval terpadu dari dataset Semeval-2017 dari tugas analisis sentimen di Twitter (periksa kertas di sini).

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

Deteksi ironi : Ini adalah tugas klasifikasi biner di mana diberikan tweet, tujuannya adalah untuk mendeteksi apakah itu ironis atau tidak. Ini didasarkan pada dataset deteksi ironi dari tugas Semeval 2018 (periksa kertas di sini).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

Deteksi Pidato Benci : Tugas Deteksi Pidato Benci terdiri dari mendeteksi apakah tweet itu kebencian terhadap komunitas target. Model yang mendasari didasarkan pada serangkaian kumpulan data deteksi kebencian kebencian yang disatukan (lihat Kertas Referensi).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

Identifikasi Bahasa Ofensif : Tugas ini terdiri dalam mengidentifikasi apakah beberapa bentuk bahasa ofensif hadir dalam tweet. Untuk tolok ukur kami, kami bergantung pada dataset offenseval Semeval2019 (periksa kertas di sini).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

Prediksi emoji : Tujuan prediksi emoji adalah untuk memprediksi emoji akhir pada tweet yang diberikan. Dataset yang digunakan untuk menyempurnakan model kami adalah adaptasi tweeteval dari tugas semeval 2018 pada prediksi emoji (periksa kertas di sini), termasuk 20 emoji sebagai label (❤,?,?,?,,?,?,?,?,?, ??, ☀,?,?,?,?,?,?).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

Pengenalan Emosi : Diberi tweet, tugas ini terdiri dari mengaitkannya dengan emosi yang paling tepat. Sebagai dataset referensi kami menggunakan tugas semeval 2018 pada pengaruh dalam tweet (periksa kertas di sini). Model multi-label terbaru mencakup sebelas jenis emosi.

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

PERINGATAN: Model emosi label tunggal dan multi-label memiliki set label diiferen (label tunggal memiliki empat kelas 'kegembiraan'/'optimisme'/'kemarahan'/'kesedihan', sementara multi-label memiliki sebelas kelas 'Joy'/'Optimism'/'Anger'/'Sadness'/'Love'/'Trust'/Fearism ').

Pengakuan entitas yang disebutkan

Modul ini terdiri dari model pengenalan-entitas bernama (NER) yang secara khusus dilatih untuk tweet. Model ini dipakai oleh tweetnlp.load_model("ner") , dan menjalankan prediksi dengan memberikan teks atau daftar teks sebagai argumen untuk fungsi ner (periksa kertas di sini, atau halaman dataset huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

Pertanyaan menjawab

Modul ini terdiri dari model penjawab pertanyaan yang secara khusus dilatih untuk tweet. Model ini dipakai oleh tweetnlp.load_model("question_answering") , dan menjalankan prediksi dengan memberikan pertanyaan atau daftar pertanyaan bersama dengan konteks atau daftar konteks sebagai argumen pada fungsi question_answering (periksa kertas di sini, atau halaman data dataset Huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

Pertanyaan generasi jawaban

Modul ini terdiri dari generasi tanya jawab pasangan yang dilatih khusus untuk tweet. Model ini dipakai oleh tweetnlp.load_model("question_answer_generation") , dan menjalankan prediksi dengan memberikan konteks atau daftar konteks sebagai argumen untuk fungsi question_answer_generation (periksa kertas di sini, atau halaman dataset huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

Pemodelan Bahasa

Model bahasa bertopeng memprediksi token bertopeng dalam kalimat yang diberikan. Ini dipakai oleh tweetnlp.load_model('language_model') , dan menjalankan prediksi dengan memberikan teks atau daftar teks sebagai argumen untuk fungsi mask_prediction . Pastikan bahwa setiap teks memiliki token <mask> , karena itu pada akhirnya adalah berikut oleh tujuan model untuk diprediksi.

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

Tweet embedding

Model embedding tweet menghasilkan embedding panjang tetap untuk tweet. Embedding mewakili semantik dengan makna tweet, dan ini dapat digunakan untuk pencarian semantik tweet dengan menggunakan kesamaan antara embeddings. Model dipakai oleh tweet_nlp.load_model('sentence_embedding') , dan jalankan prediksi dengan menyampaikan teks atau daftar teks sebagai argumen untuk fungsi embedding .

Dapatkan Embedding

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

Pencarian kesamaan

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

Sumber Daya & Pemuatan Model Kustom

Berikut adalah tabel model default yang digunakan dalam setiap tugas.

Tugas	Model	Dataset
Klasifikasi topik (label tunggal)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Single-All	Cardiffnlp/tweet_topic_single
Klasifikasi topik (multi-label)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi-All	Cardiffnlp/tweet_topic_multi
Analisis sentimen (multibahasa)	Cardiffnlp/Twitter-XLM-Roberta-Base-Base-Sentimen	Cardiffnlp/tweet_senttiment_multilingual
Analisis sentimen	Cardiffnlp/Twitter-Roberta-Base-Sentimen Terbaik	tweet_eval
Deteksi ironi	Cardiffnlp/Twitter-Roberta-Base-Irony	tweet_eval
Deteksi Benci	Cardiffnlp/Twitter-Roberta-Base-hate-latest	tweet_eval
Deteksi ofensif	Cardiffnlp/Twitter-Roberta-Base-ofensif	tweet_eval
Prediksi emoji	Cardiffnlp/Twitter-Roberta-Base-Emoji	tweet_eval
Analisis emosi (label tunggal)	Cardiffnlp/Twitter-Roberta-Base-Emotion	tweet_eval
Analisis emosi (multi-label)	Cardiffnlp/Twitter-Roberta-base-emotion-multilabel-latest	Tba
Pengakuan entitas yang disebutkan	Tner/Roberta-Large-Tweetner7-All	tner/tweetner7
Pertanyaan menjawab	LMQG/T5-SMALL-TWEETQA-QA	lmqg/qg_tweetqa
Pertanyaan generasi jawaban	LMQG/T5-BASE-TWEETQA-QAG	lmqg/qag_tweetqa
Pemodelan Bahasa	Cardiffnlp/Twitter-Roberta-base-2021-124m	Tba
Tweet embedding	Cambridgeltl/Tweet-Roberta-Base-Embeddings-V1	Tba

Untuk menggunakan model lain dari ModelHub Lokal/HuggingFace, orang dapat dengan mudah menyediakan jalur model/alias ke fungsi load_model . Di bawah ini adalah contoh untuk memuat model untuk NER.

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

Fine-tuning model

TweetNLP menyediakan antarmuka yang mudah untuk menyempurnakan model bahasa pada dataset yang didukung oleh HuggingFace untuk model hosting/fine-tuning dengan ray tune untuk pencarian parameter.

Tugas yang Didukung: sentiment , offensive , irony , hate , emotion , topic_classification

Hasil percobaan dengan pelatih tweetnlp dapat ditemukan di tabel berikut. Hasil kompetitif dan dapat digunakan sebagai garis dasar untuk setiap tugas. Lihat halaman papan peringkat untuk mengetahui lebih banyak tentang hasilnya.

tugas	bahasa_model	eval_f1	eval_f1_macro	eval_accuracy	link
emoji	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Emoji
emosi	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Emotion
membenci	Cardiffnlp/Twitter-Roberta-base-2021-124m	0,56	0,53	0,56	cardiffnlp/twitter-roberta-base-2021-124m benci
ironi	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-base-2021-124m-irony
menyinggung	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-base-2021-124m-ofensif
sentimen	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.71	0.72	0.71	Cardiffnlp/Twitter-Roberta-base-2021-124m-sentimen
Topic_classification (single)	Cardiffnlp/Twitter-Roberta-base-2021-124m	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Topic-Single
Topic_Clasification (Multi)	Cardiffnlp/Twitter-Roberta-base-2021-124m	0,75	0,56	0,54	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Topic-Multi
sentimen (multibahasa)	Cardiffnlp/Twitter-XLM-Roberta-Base	0.69	0.69	0.69	Cardiffnlp/Twitter-XLM-Roberta-Base-Sentimen-Multilingual

Contoh

Contoh berikut akan mereproduksi model ironi kami Cardiffnlp/Twitter-Roberta-base-2021-124m-irony.

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

Pos pemeriksaan yang disimpan dapat dimuat sebagai model khusus seperti di bawah ini.

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

Jika split_validation tidak diberikan, pelatih akan melakukan satu kali dengan parameter default tanpa pencarian parameter.

Kertas referensi

Untuk detail lebih lanjut, silakan baca kertas referensi TweetNLP yang menyertainya. Jika Anda menggunakan TweetNLP dalam penelitian Anda, silakan gunakan entri bib berikut untuk mengutip kertas referensi:

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

Memperluas