tweetnlpダウンロードtweetnlpソースコードのダウンロード

tweetnlp

Twitterやソーシャルメディアで作業しているすべてのNLP愛好家のためのTweetNLP！ Pythonライブラリtweetnlp 、ソーシャルメディアに特化した最先端の言語モデリングを搭載した、センチメント分析、絵文字予測、指名されたエンティティ認識などのツイートを分析/理解するための有用なツールのコレクションを提供します。

ニュース（2022年12月）： EMNLP 2022で、TweetNLPデモペーパー（「TweetNLP：ソーシャルメディアの最先端の自然言語処理」）を発表しました。最後のバージョンはここにあります。

TweetNLPハグフェイスページすべてのメインTweetNLPモデルは、ハグの顔にあります。

リソース：

Colabノートブックとのクイックツアー：
TweetNLPオンラインデモで再生：リンク
EMNLP 2022ペーパー：リンク
2番目のカーディフNLPサマーワークショップチュートリアル：
2番目のカーディフNLPサマーワークショップチュートリアル（ソリューション）：

モデルとデータセットをロードします
微調整モデル

始めましょう

コンソールにPIP経由でTweetNLPをインストールします。

pip install tweetnlp

モデルとデータセット

このセクションでは、 tweetnlpでモデルとデータセットを取得する方法を学びます。モデルはHuggingfaceモデルに従い、データセットはハグFaceデータセットの形式です。 Huggingfaceモデルとデータセットの簡単な紹介は、HuggingfaceのWebページにあるので、Huggingfaceを使用していない場合は確認してください。

ツイート分類

分類モジュールは、6つの異なるタスク（トピック分類、感情分析、皮肉検出、ヘイトスピーチ検出、攻撃的な言語検出、絵文字予測、感情分析）で構成されています。それぞれの例では、モデルはtweetnlp.load_model("task-name")によってインスタンス化され、対応する関数の引数としてテキストまたはテキストのリストを渡して予測を実行します。

トピック分類：このタスクの目的は、コンテンツに関連するトピックを割り当てるツイートを与えられます。タスクは、各ツイートに合計19の利用可能なトピックから1つ以上のトピックが割り当てられる監視付きマルチラベル分類問題として形成されます。トピックは、幅広く一般的であり、芸術と文化、音楽、スポーツなどのクラスで構成されていることを目的としたTwitterのトレンドに基づいて慎重にキュレーションされました。内部的に発表されたデータセットには、10kを超える手動でラベル付けされたツイートが含まれています（こちらの論文、またはHuggingface Datasetページを確認してください）。

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

感情分析：TweetNLPに統合された感情分析タスクは、ポジティブ、ニュートラル、またはネガティブの3つのラベルのいずれかでツイートの感情を予測することを目標とする単純化されたバージョンです。英語のベースデータセットは、Twitterのセンチメント分析に関するタスクからのSemeval-2017データセットの統一されたツイートバルバージョンです（こちらの論文を確認してください）。

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

皮肉の検出：これは、ツイートが与えられたバイナリ分類タスクであり、目標は皮肉かどうかを検出することです。これは、Semeval 2018タスクの皮肉検出データセットに基づいています（こちらの論文を確認してください）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

ヘイトスピーチ検出：ヘイトスピーチ検出タスクは、ツイートがターゲットコミュニティに対して憎むかどうかを検出することで構成されています。基礎となるモデルは、一連の統一されたヘイトスピーチ検出データセットに基づいています（参照論文を参照）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

攻撃的な言語識別：このタスクは、何らかの形の攻撃的な言語がツイートに存在するかどうかを識別することにあります。ベンチマークについては、SEMVAL2019 obseentValデータセットに依存しています（こちらの論文を確認してください）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

絵文字の予測：絵文字予測の目標は、特定のツイートで最終的な絵文字を予測することです。モデルを微調整するために使用されるデータセットは、絵文字の予測に関するSemeval 2018タスク（こちらの論文を確認）からのツイートマクの適応です。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

感情の認識：ツイートを考えると、このタスクはそれを最も適切な感情に関連付けることで構成されています。リファレンスデータセットとして、Tweetで影響に関するSemeval 2018タスクを使用します（こちらの論文を確認してください）。最新のマルチラベルモデルには、11の感情タイプが含まれています。

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

警告：シングルラベルとマルチラベルの感情モデルには、異なるラベルセットがあります（シングルラベルには「ジョイ」/「楽観主義」/「怒り」/「悲しみ」の4つのクラスがあります。

名前付きエンティティ認識

このモジュールは、ツイート用に特別にトレーニングされた名前の認識（NER）モデルで構成されています。モデルはtweetnlp.load_model("ner")によってインスタンス化され、 ner関数の引数としてテキストまたはテキストのリストを提供することで予測を実行します（こちらの論文、またはHuggingface Datasetページを確認してください）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

質問に答える

このモジュールは、ツイート用に特別にトレーニングされた質問回答モデルで構成されています。モデルはtweetnlp.load_model("question_answering")によってインスタンス化され、質問または質問のリストを提供することで、質問またはコンテキストのリストを質問またはコンテキストのリストをquestion_answering関数の引数として実行します（こちらのペーパー、またはハギングフェイスデータセットページを確認してください）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

質問回答生成

このモジュールは、ツイート用に特別にトレーニングされた質問と回答のペア生成で構成されています。モデルはtweetnlp.load_model("question_answer_generation")によってインスタンス化され、 question_answer_generation関数の引数としてコンテキストまたはコンテキストのリストを指定することにより、予測を実行します（こちらのペーパー、またはハグFaceデータセットページを確認してください）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

言語モデリング

マスクされた言語モデルは、指定された文のマスクされたトークンを予測します。これは、 tweetnlp.load_model('language_model')によってインスタンス化され、 mask_prediction関数の引数としてテキストまたはテキストのリストを提供することにより、予測を実行します。各テキストには<mask>トークンがあることを確認してください。最終的には、予測するモデルの目的により、最終的には以下です。

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

ツイート埋め込み

ツイート埋め込みモデルは、ツイート用の固定長埋め込みを生成します。埋め込みは、ツイートの意味によるセマンティクスを表します。これは、埋め込み間の類似性を使用して、ツイートのセマンティック検索に使用できます。モデルはtweet_nlp.load_model('sentence_embedding')によってインスタンス化され、テキストまたはテキストのリストをembedding関数に引数として渡して予測を実行します。

埋め込みを取得します

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

類似性検索

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

リソースとカスタムモデルの読み込み

各タスクで使用されるデフォルトモデルの表を次に示します。

タスク	モデル	データセット
トピック分類（シングルラベル）	cardiffnlp/twitter-roberta-base-dec2021-tweet-topic-single-all	Cardiffnlp/tweet_topic_single
トピック分類（マルチラベル）	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi-All	Cardiffnlp/tweet_topic_multi
センチメント分析（多言語）	Cardiffnlp/Twitter-xlm-roberta-base-sentiment	Cardiffnlp/tweet_sentiment_multilingual
感情分析	cardiffnlp/twitter-roberta-base-sentiment-latest	tweet_eval
皮肉な検出	Cardiffnlp/Twitter-Roberta-Base-Irony	tweet_eval
嫌いな検出	Cardiffnlp/Twitter-Roberta-Base-Hate-Latest	tweet_eval
攻撃的な検出	Cardiffnlp/Twitter-Roberta-Base攻撃	tweet_eval
絵文字予測	Cardiffnlp/Twitter-Roberta-Base-Emoji	tweet_eval
感情分析（シングルラベル）	Cardiffnlp/Twitter-Roberta-base-emotion	tweet_eval
感情分析（マルチラベル）	Cardiffnlp/Twitter-Roberta-Base-Emotion-Multilabel-Latest	TBA
名前付きエンティティ認識	Tner/Roberta-Large-Tweetner7-all	Tner/Tweetner7
質問に答える	lmqg/t5-small-tweetqa-qa	LMQG/QG_TWEETQA
質問回答生成	lmqg/t5-base-tweetqa-qag	LMQG/QAG_TWEETQA
言語モデリング	Cardiffnlp/Twitter-Roberta-Base-2021-124M	TBA
ツイート埋め込み	cambridgeltl/tweet-roberta-base-embeddings-v1	TBA

Local/Huggingface ModelHubの他のモデルを使用するには、 load_model関数にモデルパス/エイリアスを単純に提供できます。以下は、NERのモデルをロードする例です。

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

モデルの微調整

TweetNLPは、モデルホスティング/ファインチューニングでサポートされているデータセットで言語モデルを微調整するための簡単なインターフェイスを提供します。

サポートされているタスク： sentiment 、 offensive 、 irony 、 hate 、 emotion 、 topic_classification

tweetnlpのトレーナーを使用した実験の結果は、次の表に記載されています。結果は競争力があり、各タスクのベースラインとして使用できます。結果の詳細については、リーダーボードページを参照してください。

タスク	Language_model	eval_f1	eval_f1_macro	Eval_Accuracy	リンク
絵文字	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Emoji
感情	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-2021-124m-emotion
嫌い	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.56	0.53	0.56	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Hate
アイロニー	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Irony
攻撃	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-2021-124Mオフェンス
感情	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.71	0.72	0.71	Cardiffnlp/Twitter-Roberta-Base-2021-124M-sentiment
topic_classification（single）	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Single
topic_classification（multi）	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.75	0.56	0.54	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Multi
感情（多言語）	Cardiffnlp/Twitter-xlm-roberta-base	0.69	0.69	0.69	Cardiffnlp/Twitter-xlm-roberta-base-sentiment-multilingual

例

次の例では、皮肉なモデルCardiffnlp/Twitter-Roberta-Base-2021-124M-IRONYを再現します。

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

保存されたチェックポイントは、以下のようにカスタムモデルとしてロードできます。

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

split_validationが与えられていない場合、トレーナーはパラメーター検索なしでデフォルトのパラメーターで1回の実行を行います。

参照ペーパー

詳細については、付随するTweetNLPのリファレンスペーパーをご覧ください。調査でTweetNLPを使用する場合は、次のbibエントリを使用して、参照用紙を引用してください。

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

拡大する