Descargar tweetnlp - Descargar el código fuente de tweetnlp

Tweetnlp

¡Tweetnlp para todos los entusiastas de la PNL que trabajan en Twitter y las redes sociales! La Biblioteca Python tweetnlp proporciona una colección de herramientas útiles para analizar/comprender tweets, como el análisis de sentimientos, la predicción de emoji y el reconocimiento de entidad nombrado, impulsado por el modelado de idiomas de vanguardia especializado en las redes sociales.

Noticias (diciembre de 2022): Presentamos un documento de demostración TweetNLP ("Tweetnlp: Procesamiento de lenguaje natural de vanguardia para las redes sociales"), en EMNLP 2022. La versión final se puede encontrar aquí.

Tweetnlp Hugging Face Página Todos los principales modelos Tweetnlp se pueden encontrar aquí en la cara abrazada.

Recursos:

Tour rápida con Notebook Colab:
Juega con la demostración en línea de Tweetnlp: enlace
EMNLP 2022 Documento: enlace
2do Taller de Taller de Verano de Cardiff NLP:
2do Tutorial de Taller de Verano de Cardiff NLP (soluciones):

Tabla de contenido:

Modelo de carga y conjunto de datos
Modelo de sintonía

Empezar

Instale tweetnlp a través de PIP en su consola.

pip install tweetnlp

Modelo y conjunto de datos

En esta sección, aprenderá cómo obtener los modelos y conjuntos de datos con tweetnlp . Los modelos siguen el modelo Huggingface y los conjuntos de datos están en el formato de conjuntos de datos de Huggingface. Se deben encontrar introducciones fáciles de los modelos y conjuntos de datos Huggingface en la página web de Huggingface, así que verifíquelos si es nuevo en Huggingface.

Clasificación de tweets

El módulo de clasificación consta de seis tareas diferentes (clasificación de temas, análisis de sentimientos, detección de ironía, detección de discursos de odio, detección de idiomas ofensivos, predicción de emoji y análisis de emociones). En cada ejemplo, el modelo está instanciado por tweetnlp.load_model("task-name") , y ejecuta la predicción pasando un texto o una lista de textos como argumento a la función correspondiente.

Clasificación de temas : el objetivo de esta tarea es, dado un tweet para asignar temas relacionados con su contenido. La tarea se forma como un problema de clasificación multiclabel supervisado donde a cada tweet se le asigna uno o más temas de un total de 19 temas disponibles. Los temas fueron cuidadosamente seleccionados en función de las tendencias de Twitter con el objetivo de ser amplios y generales y consisten en clases como: artes y cultura, música o deportes. Nuestro conjunto de datos anotado internamente contiene más de 10k tweets marcados manualmente (consulte el documento aquí o la página del conjunto de datos Huggingface).

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

Análisis de sentimientos : la tarea de análisis de sentimientos integrada en TweetNLP es una versión simplificada donde el objetivo es predecir el sentimiento de un tweet con una de las tres etiquetas siguientes: positivo, neutral o negativo. El conjunto de datos base para inglés es la versión TweetEval unificada del conjunto de datos SemEval-2017 de la tarea sobre el análisis de sentimientos en Twitter (consulte el documento aquí).

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

Detección de ironía : esta es una tarea de clasificación binaria donde se le da un tweet, el objetivo es detectar si es irónico o no. Se basa en el conjunto de datos de detección de ironía de la tarea Semeval 2018 (consulte el documento aquí).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

Detección de discursos de odio : la tarea de detección de discursos de odio consiste en detectar si un tweet es odioso hacia una comunidad objetivo. El modelo subyacente se basa en un conjunto de conjuntos de datos de detección de discursos de odio unificado (ver documento de referencia).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

Identificación del lenguaje ofensivo : esta tarea consiste en identificar si alguna forma de lenguaje ofensivo está presente en un tweet. Para nuestro punto de referencia, confiamos en el conjunto de datos SemeVal2019 OffenseVal (consulte el documento aquí).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

Predicción de emoji : el objetivo de la predicción de emoji es predecir el emoji final en un tweet dado. El conjunto de datos utilizado para ajustar nuestros modelos es la adaptación tweeteval de la tarea Semeval 2018 sobre la predicción de emoji (verifique el documento aquí), incluidos 20 emoji como etiquetas (❤,?,?,?,,?,?,?,?,?,?

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

Reconocimiento de emociones : dado un tweet, esta tarea consiste en asociarlo con su emoción más apropiada. Como conjunto de datos de referencia, utilizamos la tarea Semeval 2018 en el efecto en los tweets (consulte el documento aquí). El último modelo de múltiples etiquetas incluye once tipos de emociones.

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

ADVERTENCIA: El modelo de emoción de etiqueta única y múltiples tiene un conjunto de etiquetas Diiferent (una sola etiqueta tiene cuatro clases de 'Joy'/'Optimismo'/'enojo'/'tristeza', mientras que la etiqueta multiBelente tiene clases de once clases de 'Joy'/'Optimismo'/'enojo'/'Sadness'/'Love'/'Trust'/'Tear'/'Surprise'/'Anticipation'/'Discousmismo'/'

Reconocimiento de entidad nombrado

Este módulo consiste en un modelo de reconocimiento de entidad nombrada (NER) específicamente entrenado para tweets. El modelo está instanciado por tweetnlp.load_model("ner") , y ejecuta la predicción dando un texto o una lista de textos como argumento a la función ner (verifique el documento aquí, o la página del conjunto de datos Huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

Respuesta de preguntas

Este módulo consiste en un modelo de respuesta de pregunta específicamente entrenado para tweets. El modelo está instanciado por tweetnlp.load_model("question_answering") , y ejecuta la predicción dando una pregunta o una lista de preguntas junto con un contexto o una lista de contextos como argumento a la función question_answering (verifique el documento aquí o la página del conjunto de datos Huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

Generación de respuestas de la pregunta

Este módulo consiste en una generación de parejas de preguntas y respuestas específicamente entrenadas para tweets. El modelo está instanciado por tweetnlp.load_model("question_answer_generation") , y ejecuta la predicción dando un contexto o una lista de contextos como argumento a la función question_answer_generation (verifique el documento aquí o la página del conjunto de datos de Huggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

Modelado de idiomas

El modelo de lenguaje enmascarado predice el token enmascarado en la oración dada. Esto es instanciado por tweetnlp.load_model('language_model') , y ejecuta la predicción dando un texto o una lista de textos como argumento a la función mask_prediction . Asegúrese de que cada texto tenga un token <mask> , ya que eso es eventualmente el siguiente por el objetivo del modelo para predecir.

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

Tweet incrustando

El modelo de incrustación de tweet produce una incrustación de longitud fija para un tweet. La incrustación representa la semántica por el significado del tweet, y esto se puede utilizar para la búsqueda semántica de tweets mediante el uso de la similitud entre los incrustaciones. El modelo se instancia mediante tweet_nlp.load_model('sentence_embedding') , y ejecute la predicción pasando un texto o una lista de textos como argumento a la función embedding .

Get Incrusting

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

Búsqueda de similitud

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

Recursos y carga de modelo personalizado

Aquí hay una tabla del modelo predeterminado utilizado en cada tarea.

Tarea	Modelo	Conjunto de datos
Clasificación de temas (etiqueta única)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Single-All	cardiffnlp/tweet_topic_single
Clasificación de temas (múltiples etiquetas)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweet-Topic-Multi All	cardiffnlp/tweet_topic_multi
Análisis de sentimientos (multilingüe)	Cardiffnlp/Twitter-XLM-Roberta-Base-Sentiment	cardiffnlp/tweet_sentiment_multilingual
Análisis de sentimientos	Cardiffnlp/Twitter-Roberta-Base-Sentiment-Latest	tweet_eval
Detección de ironía	Cardiffnlp/Twitter-Roberta-Base-Irony	tweet_eval
Detección de odio	Cardiffnlp/Twitter-Roberta-Base-odate-Latest	tweet_eval
Detección ofensiva	Cardiffnlp/Twitter-Roberta-Base-ofensens	tweet_eval
Predicción de emoji	Cardiffnlp/Twitter-Roberta-Base-Emoji	tweet_eval
Análisis de emociones (etiqueta única)	Cardiffnlp/Twitter-Roberta-Base-Emotion	tweet_eval
Análisis de emociones (múltiples etiquetas)	Cardiffnlp/Twitter-Roberta-Base-emotion-Multilabel-Latest	TBA
Reconocimiento de entidad nombrado	Tner/Roberta-Large-Tweetner7-All	tner/tweTner7
Respuesta de preguntas	LMQG/T5-SMALL-TWEETQA-QA	LMQG/QG_TWEETQA
Generación de respuestas de la pregunta	LMQG/T5-BASE-TWEETQA-QAG	LMQG/QAG_TWEETQA
Modelado de idiomas	Cardiffnlp/Twitter-Roberta-Base-2021-124m	TBA
Tweet incrustando	Cambridgeltl/Tweet-Roberta-Base-Embeddings-V1	TBA

Para usar un otro modelo de ModelHub local/Huggingface, uno simplemente puede proporcionar la ruta del modelo/alias a la función load_model . A continuación se muestra un ejemplo para cargar un modelo para NER.

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

Modelo ajustado

TweetNLP proporciona una interfaz fácil para ajustar los modelos de lenguaje en los conjuntos de datos admitidos por Huggingface para alojamiento/ajuste de modelos con Ray Tune para la búsqueda de parámetros.

Tareas apoyadas: sentiment , offensive , irony , hate , emotion , topic_classification

Los resultados de los experimentos con el entrenador de tweetnlp se pueden encontrar en la siguiente tabla. Los resultados son competitivos y pueden usarse como líneas de base para cada tarea. Vea la página de la tabla de clasificación para saber más sobre los resultados.

tarea	lenguaje_modelo	eval_f1	eval_f1_macro	eval_accuración	enlace
emoji	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-2021-124M-EMOJI
emoción	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-2021-124M-EMOCIÓN
odiar	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.56	0.53	0.56	Cardiffnlp/Twitter-Roberta-Base-2021-124M-OTA
ironía	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Irony
ofensivo	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-2021-124m Offensive
sentimiento	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.71	0.72	0.71	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Sentiment
topic_classification (single)	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Topic-Single
topic_classification (multi)	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.75	0.56	0.54	Cardiffnlp/Twitter-Roberta-Base-2021-124m-Topic-Multi
Sentimiento (multilingüe)	Cardiffnlp/Twitter-XLM-Roberta-Base	0.69	0.69	0.69	Cardiffnlp/Twitter-XLM-Roberta-Base-Sentiment-Multilingüe

Ejemplo

El siguiente ejemplo reproducirá nuestro modelo de ironía Cardiffnlp/Twitter-Roberta-Base-2021-124m-Irony.

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

El punto de control guardado se puede cargar como un modelo personalizado como se muestra a continuación.

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

Si no se proporciona split_validation , el entrenador hará una sola ejecución con parámetros predeterminados sin búsqueda de parámetros.

Papel de referencia

Para obtener más detalles, lea el documento de referencia de TweetNLP. Si usa TweetNLP en su investigación, utilice la siguiente entrada bib para citar el documento de referencia:

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

Expandir