Download tweetnlp - tweetnlp Download de código fonte

Tweetnlp

Tweetnlp para todos os entusiastas da PNL que trabalham no Twitter e nas mídias sociais! O Python Library tweetnlp fornece uma coleção de ferramentas úteis para analisar/entender tweets como análise de sentimentos, previsão de emoji e reconhecimento de entidade nomeada, alimentada pela modelagem de idiomas de ponta especializada em mídias sociais.

Notícias (dezembro de 2022): Apresentamos um artigo de demonstração de tweetnlp ("Tweetnlp: processamento de linguagem natural de ponta para mídias sociais"), no EMNLP 2022. A versão final pode ser encontrada aqui.

Tweetnlp abraçando a página de rosto Todos os principais modelos TweetNLP podem ser encontrados aqui no rosto abraçando.

Recursos:

Tour rápido com o Colab Notebook:
Brinque com a demonstração online Tweetnlp: link
Papel EMNLP 2022: Link
2º Cardiff NLP Summer Workshop Tutorial:
2º Cardiff NLP Summer Workshop Tutorial (Solutions):

Índice:

Modelo de carga e conjunto de dados
Modelo de ajuste fino

Comece

Instale o tweetnlp via PIP em seu console.

pip install tweetnlp

Modelo e conjunto de dados

Nesta seção, você aprenderá como obter os modelos e conjuntos de dados com tweetnlp . Os modelos seguem o modelo Huggingface e os conjuntos de dados estão no formato dos conjuntos de dados HuggingFace. Introduções fáceis de modelos e conjuntos de dados Huggingface devem ser encontrados na página da web huggingface, por isso, verifique -os se você é novo no Huggingface.

Classificação de tweet

O módulo de classificação consiste em seis tarefas diferentes (classificação de tópicos, análise de sentimentos, detecção de ironia, detecção de fala de ódio, detecção de linguagem ofensiva, previsão de emoji e análise de emoções). Em cada exemplo, o modelo é instanciado por tweetnlp.load_model("task-name") e execute a previsão passando um texto ou uma lista de textos como argumento para a função correspondente.

Classificação de tópicos : O objetivo desta tarefa é, dado um tweet para atribuir tópicos relacionados ao seu conteúdo. A tarefa é formada como um problema supervisionado de classificação de vários rótulos, onde cada tweet recebe um ou mais tópicos de um total de 19 tópicos disponíveis. Os tópicos foram cuidadosamente selecionados com base nas tendências do Twitter, com o objetivo de ser amplos e gerais e consistindo em aulas como: artes e cultura, música ou esportes. Nosso conjunto de dados anotado internamente contém mais de 10 mil tweets marcados manualmente (verifique o papel aqui ou a página do conjunto de dados HuggingFace).

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

Análise de sentimentos : A tarefa de análise de sentimentos integrados no TweetnLP é uma versão simplificada, na qual o objetivo é prever o sentimento de um tweet com um dos três rótulos a seguir: positivo, neutro ou negativo. O conjunto de dados básico para o inglês é a versão unificada do Tweeteval do conjunto de dados semeval-2017 da tarefa de análise de sentimentos no Twitter (verifique o artigo aqui).

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

Detecção de ironia : Esta é uma tarefa de classificação binária, onde é dado um tweet, o objetivo é detectar se é irônico ou não. É baseado no conjunto de dados de detecção de ironia da tarefa Semeval 2018 (verifique o artigo aqui).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

Detecção de fala de ódio : a tarefa de detecção de fala de ódio consiste em detectar se um tweet é odioso em relação a uma comunidade -alvo. O modelo subjacente é baseado em um conjunto de conjuntos de dados de detecção de fala de ódio unificados (consulte o papel de referência).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

Identificação do idioma ofensivo : Esta tarefa consiste em identificar se alguma forma de linguagem ofensiva está presente em um tweet. Para a nossa referência, contamos com o conjunto de dados semval2019 de ofensas (verifique o artigo aqui).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

Previsão emoji : o objetivo da previsão de emoji é prever o emoji final em um determinado tweet. O conjunto de dados usado para ajustar nossos modelos é a adaptação tweeteval da tarefa Semeval 2018 sobre a previsão de emoji (verifique o artigo aqui), incluindo 20 emoji como etiquetas (❤,?,? ,?,?,?, ,? ,? ,?,?, ??, ☀,,,,?,,?,?, ,? ,? ,? ,?,?,?

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

Reconhecimento de emoções : Dado um tweet, esta tarefa consiste em associá -lo à sua emoção mais apropriada. Como conjunto de dados de referência, usamos a tarefa Semeval 2018 em efeito em tweets (verifique o artigo aqui). O mais recente modelo de vários rótulos inclui onze tipos de emoções.

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

Aviso: o modelo de emoção de etiqueta única e de vários rótulos possui conjunto de etiquetas diiferentes (uma etiqueta única tem quatro classes de 'alegria'/'otimismo'/'raiva'/'tristeza', enquanto multi-rótulo tem onze classes de 'alegria'/'otimismo'/'raiva'/'tristeza'/'amor'/'confiança'/'medo'/'surpresa'/'antecipação'/'nojo'/'pessimismo').

Reconhecimento de entidade nomeado

Este módulo consiste em um modelo de reconhecimento de entidade nomeada (NER) treinada especificamente para tweets. O modelo é instanciado por tweetnlp.load_model("ner") e executa a previsão fornecendo um texto ou uma lista de textos como argumento para a função ner (verifique o documento aqui ou a página do conjunto de dados HuggingFace).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

Resposta de perguntas

Este módulo consiste em um modelo de resposta a perguntas, treinado especificamente para tweets. O modelo é instanciado pelo tweetnlp.load_model("question_answering") e executa a previsão dando uma pergunta ou uma lista de perguntas junto com um contexto ou uma lista de contextos como argumento para a função question_answering (verifique o artigo aqui, ou a página do conjunto de dados HuggingFace).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

Geração de respostas de perguntas

Este módulo consiste em uma geração de pares de perguntas e respostas treinada especificamente para tweets. O modelo é instanciado por tweetnlp.load_model("question_answer_generation") e executa a previsão, fornecendo um contexto ou uma lista de contextos como argumento para a função de question_answer_generation (verifique o documento aqui ou a página do HuggingFace).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

Modelagem de idiomas

O modelo de linguagem mascarado prevê o token mascarado na frase fornecida. Isso é instanciado por tweetnlp.load_model('language_model') e executa a previsão fornecendo um texto ou uma lista de textos como argumento para a função mask_prediction . Certifique -se de que cada texto tenha um token <mask> , pois esse é eventualmente o seguinte pelo objetivo do modelo a prever.

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

Tweet incorporação

O modelo de incorporação de tweet produz uma incorporação de comprimento fixo para um tweet. A incorporação representa a semântica por significado do tweet, e isso pode ser usado para pesquisa semântica de tweets usando a semelhança entre as incorporações. O modelo é instanciado por tweet_nlp.load_model('sentence_embedding') e execute a previsão passando um texto ou uma lista de textos como argumento para a função embedding .

Obtenha incorporação

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

Pesquisa de similaridade

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

Recursos e carregamento de modelo personalizado

Aqui está uma tabela do modelo padrão usado em cada tarefa.

Tarefa	Modelo	Conjunto de dados
Classificação de tópicos (rótulo único)	Cardiffnlp/Twitter-Roberta-Base-DeC2021-Tweet-Topic-All-All	Cardiffnlp/tweet_topic_single
Classificação de tópicos (Multi-Label)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi-All	Cardiffnlp/tweet_topic_multi
Análise de sentimentos (multilíngue)	Cardiffnlp/twitter-xlm-roberta-BASE-SENTIMENT	Cardiffnlp/tweet_sentiment_multilinguly
Análise de sentimentos	Cardiffnlp/Twitter-Roberta-Base-Sentiment-Latest	tweet_eval
Detecção de ironia	Cardiffnlp/Twitter-Roberta-Base-Irony	tweet_eval
Detecção de ódio	Cardiffnlp/Twitter-Roberta-Base-Hate-Latest	tweet_eval
Detecção ofensiva	Cardiffnlp/Twitter-Roberta-Base-Offression	tweet_eval
Previsão emoji	Cardiffnlp/Twitter-Roberta-Base-Emoji	tweet_eval
Análise de emoção (rótulo único)	Cardiffnlp/Twitter-Roberta-Base-Emotion	tweet_eval
Análise de emoção (Multi-Label)	Cardiffnlp/Twitter-Roberta-Base-Emoção-Multilabel-Latest	TBA
Reconhecimento de entidade nomeado	Tner/Roberta-Large-Tweetner7-All	tner/tweetner7
Resposta de perguntas	LMQG/T5-SMALL-TWEETQA-QA	LMQG/QG_TWEETQA
Geração de respostas de perguntas	LMQG/T5-BASE-TWEETQA-QAG	LMQG/QAG_TWEETQA
Modelagem de idiomas	Cardiffnlp/Twitter-Roberta-Base-2021-124M	TBA
Tweet incorporação	Cambridgeltl/Tweet-Roberta-Base-Embeddings-V1	TBA

Para usar um outro modelo do ModelHub local/HuggingFace, pode -se simplesmente fornecer o caminho do modelo/alias à função load_model . Abaixo está um exemplo para carregar um modelo para NER.

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

Modelo de ajuste fino

O TweetNLP fornece uma interface fácil para ajustar modelos de linguagem nos conjuntos de dados suportados pelo HuggingFace para hospedagem/ajuste fino com música de raio para pesquisa de parâmetros.

Tarefas suportadas: sentiment , offensive , irony , hate , emotion , topic_classification

Os resultados de experimentos com o treinador do tweetnlp podem ser encontrados na tabela a seguir. Os resultados são competitivos e podem ser usados como linhas de base para cada tarefa. Veja a página da tabela de classificação para saber mais sobre os resultados.

tarefa	idioma_model	Eval_f1	Eval_f1_macro	Eval_Accuracy	link
Emoji	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,46	0,35	0,46	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-EMOJI
emoção	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,83	0,79	0,83	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-EMOTION
odiar	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,56	0,53	0,56	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-HATE
ironia	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,79	0,78	0,79	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-IRONY
ofensiva	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,86	0,82	0,86	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-OFFERNO
sentimento	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,71	0,72	0,71	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-SENTIMENTO
Topic_classification (single)	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,9	0,8	0,9	Cardiffnlp/Twitter-Roberta-BASE-2021-124M-Topic-single
Topic_classification (multi)	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0,75	0,56	0,54	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Multi
sentimento (multilíngue)	Cardiffnlp/Twitter-XLM-Roberta-Base	0,69	0,69	0,69	Cardiffnlp/twitter-xlm-robert-Base-Sentiment-Multilingual

Exemplo

O exemplo a seguir reproduzirá nosso modelo de ironia Cardiffnlp/Twitter-Roberta-BASE-2021-124M-IRONY.

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

O ponto de verificação salvo pode ser carregado como um modelo personalizado como abaixo.

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

Se split_validation não for fornecido, o treinador fará uma única execução com parâmetros padrão sem pesquisa de parâmetros.

Papel de referência

Para mais detalhes, leia o documento de referência do TweetNLP que acompanha. Se você usar o TweetNLP em sua pesquisa, use a seguinte entrada bib para citar o documento de referência:

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

Expandir