tweetnlp下载tweetnlp源代码下载

Tweetnlp

在Twitter和社交媒体上工作的所有NLP爱好者推荐！ Python Library tweetnlp提供了一系列有用的工具，可以分析/了解诸如情感分析，表情符号预测和指定性识别等推文，由最先进的语言建模为社交媒体提供支持。

新闻（2022年12月）：我们介绍了TweetNLP演示论文（“ Tweetnlp：尖端的社交媒体自然语言处理”），在EMNLP 2022。最终版本可以在此处找到。

Tweetnlp拥抱面页所有主要的Tweetnlp模型都可以在拥抱的脸上找到。

资源：

与COLAB笔记本快速游览：
使用TweetNLP在线演示：链接
EMNLP 2022纸：链接
第二加的夫NLP夏季研讨会教程：
第二加的夫NLP夏季研讨会教程（解决方案）：

负载模型和数据集
微调模型

开始

在控制台上安装通过PIP安装TweetNLP。

pip install tweetnlp

模型和数据集

在本节中，您将学习如何使用tweetnlp获取模型和数据集。模型遵循拥抱面模型，数据集的格式为拥抱面数据集。应在HuggingFace网页上找到HuggingFace模型和数据集的简单介绍，因此，如果您是HuggingFace的新手，请检查它们。

推文分类

分类模块由六个不同的任务（主题分类，情感分析，讽刺检测，仇恨言论检测，进攻性语言检测，表情符号预测和情感分析）组成。在每个示例中，该模型均由tweetnlp.load_model("task-name")实例化，并通过将文本或文本列表传递给相应函数来运行预测。

主题分类：此任务的目的是，给出了一条推文来分配与其内容相关的主题。该任务是作为有监督的多标签分类问题形成的，其中每个推文分配了一个或多个可用主题的一个或多个主题。这些主题是根据Twitter趋势精心策划的，其目的是广泛，一般，包括：艺术和文化，音乐或体育。我们内部注销的数据集包含超过10K手动标记的推文（在此处查看纸张或HuggingFace数据集页面）。

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

情感分析：在TweetNLP中集成的情感分析任务是一个简化的版本，其目标是预测带有以下三个标签之一的推文的情感：正，中性或负面。英语的基本数据集是从Twitter中的情感分析任务中的Semeval-2017数据集的统一Tweeteval版本（请在此处查看论文）。

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

具有讽刺意味的检测：这是一项二进制分类任务，鉴于推文，目标是检测它是否具有讽刺意味。它基于Semeval 2018任务中的讽刺检测数据集（此处查看论文）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

仇恨言论检测：仇恨言论检测任务包括检测一条推文是否对目标社区仇恨。基础模型基于一套统一的仇恨言语检测数据集（请参阅参考文件）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

进攻性语言标识：此任务包括识别推文中是否存在某种形式的冒犯性语言。对于我们的基准测试，我们依靠Semeval2019犯罪数据集（在此处查看论文）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

表情符号预测：表情符号预测的目标是预测给定推文上的最终表情符号。用于微调我们模型的数据集是从Semeval 2018上的TweetEval改编的表情符号预测任务（在此处检查论文），包括20个表情符号作为标签（❤，？，？，？，？，？，？，？，？，？，？，？，？,,？,,☀，☀，？,,？,,？,,？,,？，？）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

情感认可：鉴于推文，此任务包括将其与最合适的情感联系起来。作为参考数据集，我们将Semeval 2018任务用于Tweet中的影响（此处查看论文）。最新的多标签模型包括11种情感类型。

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

WARNING: The single-label and multi-label emotion model have diiferent label set (single-label has four classes of 'joy'/'optimism'/'anger'/'sadness', while multi-label has eleven classes of 'joy'/'optimism'/'anger'/'sadness'/'love'/'trust'/'fear'/'surprise'/'anticipation'/'disgust'/'pessimism').

命名实体识别

该模块由专门针对推文训练的命名实体识别（NER）模型组成。该模型由tweetnlp.load_model("ner")实例化，并通过将文本或文本列表作为参数作为ner （在此处查看论文或HuggingFace DataSet页面）来运行预测。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

问题回答

该模块由一个针对推文培训的问题回答模型组成。该模型由tweetnlp.load_model("question_answering")实例化，并通过给出问题或问题列表以及上下文或上下文列表作为参数（在此处查看论文或HuggingFace DataSet页面）来question_answering预测）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

问题答案生成

该模块由一个针对推文培训的问答对生成。该模型由tweetnlp.load_model("question_answer_generation")实例化，并通过将上下文或上下文列表作为参数作为question_answer_generation函数（在此处查看论文或HuggingFace DataSet Page）来运行预测）。

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

语言建模

蒙版语言模型预测给定句子中的蒙版令牌。这是由tweetnlp.load_model('language_model')实例化的，并通过将文本或文本列表作为参数作为mask_prediction函数来运行预测。请确保每个文本都有一个<mask>令牌，因为这最终是按照模型预测的目标。

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

推文嵌入

推文嵌入模型可为推文产生固定长度嵌入。嵌入方式通过推文的含义表示语义，这可以通过使用嵌入之间的相似性来用于对推文的语义搜索。模型由tweet_nlp.load_model('sentence_embedding')实例化，并通过将文本或文本列表传递给embedding函数来运行预测。

嵌入

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

相似性搜索

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

资源和自定义模型加载

这是每个任务中使用的默认模型的表。

任务	模型	数据集
主题分类（单标签）	CardiffNLP/Twitter-Roberta-Base-Dec2021-Tweet-Topic-single-All	cardiffnlp/tweet_topic_single
主题分类（多标签）	CardiffNLP/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi-All	Cardiffnlp/Tweet_topic_multi
情感分析（多语言）	CardiffNLP/Twitter-XLM-Roberta-base-sentiment	cardiffnlp/tweet_sentiment_multighatual
情感分析	Cardiffnlp/Twitter-Roberta-base-sentiment-Latest	Tweet_eval
讽刺检测	Cardiffnlp/Twitter-Roberta-base-iRony	Tweet_eval
仇恨检测	Cardiffnlp/Twitter-Roberta-Base讨厌的最终	Tweet_eval
进攻性检测	Cardiffnlp/Twitter-Roberta-Base攻势	Tweet_eval
表情符号预测	Cardiffnlp/Twitter-Roberta-Base-Emoji	Tweet_eval
情绪分析（单标签）	Cardiffnlp/Twitter-Roberta-base-sotion	Tweet_eval
情绪分析（多标签）	Cardiffnlp/Twitter-Roberta-base-Multilabel-Latest	TBA
命名实体识别	tner/roberta-large-tweetner7-all	Tner/Tweetner7
问题回答	LMQG/T5-SMALL-TWEETQA-QA	LMQG/QG_TWEETQA
问题答案生成	lmqg/t5-base-tweetqa-qag	LMQG/QAG_TWEETQA
语言建模	Cardiffnlp/Twitter-Roberta-Base-2021-124M	TBA
推文嵌入	Cambridgeltl/Tweet-Roberta-base-embeddings-v1	TBA

要使用Local/HuggingFace ModelHub中的其他模型，可以简单地为load_model函数提供模型路径/别名。以下是加载NER模型的示例。

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

模型微调

TweetNLP提供了一个简单的接口，可在数据集上通过HuggingFace支持模型托管/使用Ray Tune进行射线曲调来搜索的数据集中的微调语言模型。

受支持的任务： sentiment ， offensive ， irony ， hate ， emotion ， topic_classification

可以在下表中找到使用tweetnlp培训师的实验结果。结果具有竞争力，可以用作每个任务的基准。请参阅排行榜页面以了解有关结果的更多信息。

任务	Laging_model	eval_f1	eval_f1_macro	eval_accuracy	关联
表情符号	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.46	0.35	0.46	CardiffNLP/Twitter-Roberta-Base-2021-124M-Emoji
情感	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.83	0.79	0.83	CardiffNLP/Twitter-Roberta-Base-2021-124M发型
恨	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.56	0.53	0.56	CardiffNLP/Twitter-Roberta-Base-2021-124M讨厌
讽刺	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-2021-124M-iRony
进攻	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-2021-124m进攻
情绪	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.71	0.72	0.71	CardiffNLP/Twitter-Roberta-Base-2021-124M索赔
主题_Classification（单个）	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Single
主题_Classification（Multi）	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.75	0.56	0.54	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Multi
情感（多语言）	CardiffNLP/Twitter-XLM-Roberta基础	0.69	0.69	0.69	Cardiffnlp/Twitter-XLM-Roberta-base-sentiment-Multlindual

例子

以下示例将重现我们的讽刺模型CardiffNLP/Twitter-Roberta-Base-2021-124M-Irony。

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

保存的检查点可以作为自定义模型加载，如下所示。

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

如果未给出split_validation ，则教练将进行单个运行，而无需参数搜索。

参考文件

有关更多详细信息，请阅读随附的TweetNLP参考文件。如果您在研究中使用TweetNLP，请使用以下bib条目引用参考文件：

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

展开