트위터와 소셜 미디어에서 일하는 모든 NLP 애호가를위한 트윗! Python Library tweetnlp 소셜 미디어에 특화된 최첨단 언어 모델링으로 구동되는 감정 분석, 이모티콘 예측 및 명명 된 엔티티 인식과 같은 트윗을 분석/이해하는 유용한 도구 모음을 제공합니다.
뉴스 (2022 년 12 월) : 우리는 EMNLP 2022에서 트위터 네 데모 페이퍼 ( "Tweetnlp : Cutting-Edge Natural Language Processing")를 발표했습니다. 최종 버전은 여기에서 찾을 수 있습니다.
Tweetnlp Hugging Face Page 모든 주요 트윗 모델은 여기에서 포옹 페이스에서 찾을 수 있습니다.
자원:
목차 :
콘솔에 PIP를 통해 트윗을 설치하십시오.
pip install tweetnlp 이 섹션에서는 tweetnlp 로 모델 및 데이터 세트를 얻는 방법을 배웁니다. 모델은 Huggingface 모델을 따르고 데이터 세트는 Huggingface 데이터 세트 형식입니다. Huggingface 웹 페이지에서 Huggingface 모델 및 데이터 세트를 쉽게 소개해야하므로 Huggingface에 익숙지 확인하십시오.
분류 모듈은 6 가지 다른 작업 (주제 분류, 감정 분석, 아이러니 감지, 증오 언어 탐지, 불쾌한 언어 탐지, 이모티콘 예측 및 감정 분석)으로 구성됩니다. 각각의 예에서, 모델은 tweetnlp.load_model("task-name") 에 의해 인스턴스화되며, 텍스트 나 텍스트 목록을 해당 함수에 인수로 전달하여 예측을 실행합니다.
import tweetnlp
# MULTI-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' ) # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." ) # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
'business_&_entrepreneurs' : 0.010188567452132702 ,
'celebrity_&_pop_culture' : 0.92448890209198 ,
'diaries_&_daily_life' : 0.03425711765885353 ,
'family' : 0.00796138122677803 ,
'fashion_&_style' : 0.020642118528485298 ,
'film_tv_&_video' : 0.08062587678432465 ,
'fitness_&_health' : 0.006343095097690821 ,
'food_&_dining' : 0.0042883665300905704 ,
'gaming' : 0.004327300935983658 ,
'learning_&_educational' : 0.010652057826519012 ,
'music' : 0.8291937112808228 ,
'news_&_social_concern' : 0.24688217043876648 ,
'other_hobbies' : 0.020671198144555092 ,
'relationships' : 0.020371075719594955 ,
'science_&_technology' : 0.0170074962079525 ,
'sports' : 0.014291072264313698 ,
'travel_&_adventure' : 0.010423899628221989 ,
'youth_&_student_life' : 0.008605164475739002 }}
# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False ) # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
'business_&_entrepreneurs' : 6.916998972883448e-05 ,
'pop_culture' : 0.9995898604393005 ,
'daily_life' : 0.00011083036952186376 ,
'sports_&_gaming' : 8.668467489769682e-05 ,
'science_&_technology' : 5.152115045348182e-05 }}
# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False ) import tweetnlp
# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' ) # Or `model = tweetnlp.Sentiment()`
model . sentiment ( "Yes, including Medicare and social security saving?" ) # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}
# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True ) # Or `model = tweetnlp.Sentiment(multilingual=True)`
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}
# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l ) import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'irony' ) # Or `model = tweetnlp.Irony()`
model . irony ( 'If you wanna look like a badass, have drama on social media' ) # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }}
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' ) import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'hate' ) # Or `model = tweetnlp.Hate()`
model . hate ( 'Whoever just unfollowed me you a bitch' ) # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' ) import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'offensive' ) # Or `model = tweetnlp.Offensive()`
model . offensive ( "All two of them taste like ass." ) # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' ) import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'emoji' ) # Or `model = tweetnlp.Emoji()`
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' ) # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
'probability' : { '❤' : 0.13197319209575653 ,
'?' : 0.11246423423290253 ,
'?' : 0.008415069431066513 ,
'?' : 0.04842926934361458 ,
'' : 0.014528146013617516 ,
'?' : 0.1509675830602646 ,
'?' : 0.08625403046607971 ,
'' : 0.01616635173559189 ,
'?' : 0.07396604865789413 ,
'?' : 0.03033279813826084 ,
'?' : 0.16525287926197052 ,
'??' : 0.020336611196398735 ,
'☀' : 0.00799981877207756 ,
'?' : 0.016111424192786217 ,
'' : 0.012984540313482285 ,
'?' : 0.012557178735733032 ,
'?' : 0.031386848539114 ,
'?' : 0.006829539313912392 ,
'?' : 0.04188741743564606 ,
'?' : 0.011156936176121235 }}
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' ) import tweetnlp
# MULTI-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' ) # Or `model = tweetnlp.Emotion()`
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' ) # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
'probability' : { 'anger' : 0.00025800734874792397 ,
'anticipation' : 0.0005329723935574293 ,
'disgust' : 0.00026112011983059347 ,
'fear' : 0.00027552215033210814 ,
'joy' : 0.7721399068832397 ,
'love' : 0.1806265264749527 ,
'optimism' : 0.04208092764019966 ,
'pessimism' : 0.00025325192837044597 ,
'sadness' : 0.0006160663324408233 ,
'surprise' : 0.0005619609728455544 ,
'trust' : 0.002393839880824089 }}
# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' ) # Or `model = tweetnlp.Emotion()`
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' ) # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )경고 : 단일 라벨과 다중 라벨 감정 모델에는 차이 레이블 세트가 있습니다 (단일 라벨은 4 가지 클래스의 '기쁨'/'낙관주의'/'분노'/'슬픔'을 가지고 있으며, 멀티 라벨은 11 개의 클래스의 '기쁨'/'낙관론'/'안 저'/'슬픔'/'사랑'/'두려움'/'놀라움'/'혐의'/'혐의'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'PARSIMISTAMES'/'PARSIMATION').
이 모듈은 트윗을 위해 특별히 훈련 된 NER (Named-Entity Recognition) 모델로 구성됩니다. 이 모델은 tweetnlp.load_model("ner") 에 의해 인스턴스화되며 ner 기능에 대한 인수로 텍스트 나 텍스트 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Huggingface 데이터 세트 페이지를 확인하십시오).
import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'ner' ) # Or `model = tweetnlp.NER()`
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' ) # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity.
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True ) # Or `model.predict`
> >> [
{ 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
{ 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
{ 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]
# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' ) 이 모듈은 트윗을 위해 특별히 훈련 된 질문 응답 모델로 구성됩니다. 이 모델은 tweetnlp.load_model("question_answering") 에 의해 인스턴스화되며 question_answering 함수에 대한 인수로 컨텍스트 또는 컨텍스트 목록과 함께 질문 또는 질문 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Huggingface 데이터 세트 페이지를 확인하십시오).
import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'question_answering' ) # Or `model = tweetnlp.QuestionAnswering()`
model . question_answering (
question = 'who created the post as we know it today?' ,
context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
) # Or `model.predict`
> >> { 'generated_text' : 'ben' }
# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' ) 이 모듈은 트윗을 위해 특별히 훈련 된 질문 및 답변 쌍 생성으로 구성됩니다. 이 모델은 tweetnlp.load_model("question_answer_generation") 에 의해 인스턴스화되며 question_answer_generation 함수에 대한 인수로 컨텍스트 또는 컨텍스트 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Heengingface 데이터 세트 페이지를 확인하십시오).
import tweetnlp
# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' ) # Or `model = tweetnlp.QuestionAnswerGeneration()`
model . question_answer_generation (
text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
) # Or `model.predict`
> >> [
{ 'question' : 'who created the post?' , 'answer' : 'ben' },
{ 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]
# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' ) 마스크 된 언어 모델은 주어진 문장에서 마스크 된 토큰을 예측합니다. 이것은 tweetnlp.load_model('language_model') 에 의해 인스턴스화되며, 텍스트 나 텍스트 목록을 mask_prediction 함수에 대한 인수로 제공하여 예측을 실행합니다. 각 텍스트에 <mask> 토큰이 있는지 확인하십시오. 결국 예측할 모델의 목표에 의해 다음과 같습니다.
import tweetnlp
model = tweetnlp . load_model ( 'language_model' ) # Or `model = tweetnlp.LanguageModel()`
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 ) # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
'best_sentences' : [ 'How many more days until opening day? ?' ,
'How many more hours until opening day? ?' ]} 트윗 임베딩 모델은 트윗을위한 고정 길이 임베딩을 생성합니다. 임베딩은 트윗의 의미에 의한 의미론을 나타내며, 이는 임베딩 사이의 유사성을 사용하여 트윗의 의미 틱 검색에 사용될 수 있습니다. 모델은 tweet_nlp.load_model('sentence_embedding') 에 의해 인스턴스화되며 텍스트 나 텍스트 목록을 embedding 함수에 인수로 전달하여 예측을 실행합니다.
import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' ) # Or `model = tweetnlp.SentenceEmbedding()`
# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor. Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)
# Get sentence embedding (multiple inputs)
tweet_corpus = [
"Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
"Trump appointed judge Stephanos Bibas " ,
"If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
"@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
"Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done & keep up this trend. A major pillar of our govt's economic policy is export enhancement & we will provide full support to promote export culture." ,
"@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
"The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
"@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
"Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
"@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
"Tomorrow is my last day as Senator from Alabama. I believe our opportunities are boundless when we find common ground. As we swear in a new Congress & a new President, demand from them that they do just that & build a stronger, more just society. It’s been an honor to serve you."
"The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person” Wearing a mask makes them a “good person” & anyone who disagrees w/them isn’t They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
"@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 ) sims = []
for n , i in enumerate ( tweet_corpus ):
_sim = model . similarity ( tweet , i )
sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )
> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor . Well done .
- top 0 : Tomorrow is my last day as Senator from Alabama . I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society . It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ” Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
- similaty : 0.7480925982953287
- top 1 : Trump appointed judge Stephanos Bibas
- similaty : 0.6289173306344258
- top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
- similaty : 0.6017154109745276다음은 각 작업에 사용 된 기본 모델 테이블입니다.
| 일 | 모델 | 데이터 세트 |
|---|---|---|
| 주제 분류 (단일 라벨) | Cardiffnlp/Twitter-Roberta-Base-Dec2021- 트윗-주제-홀드 | cardiffnlp/tweet_topic_single |
| 주제 분류 (멀티 라벨) | Cardiffnlp/Twitter-Roberta-Base-Dec2021- 트윗-다중-all | cardiffnlp/tweet_topic_multi |
| 감정 분석 (다국어) | Cardiffnlp/Twitter-Xlm-Roberta-Base-Sentiment | cardiffnlp/tweet_sentiment_multingual |
| 감정 분석 | Cardiffnlp/Twitter-Roberta-Base-Sentiment-Latest | 트윗 _eval |
| 아이러니 탐지 | Cardiffnlp/Twitter-Roberta-Base-IRONY | 트윗 _eval |
| 증오 감지 | Cardiffnlp/Twitter-Roberta-Base-Hate-Latest | 트윗 _eval |
| 공격적인 탐지 | Cardiffnlp/Twitter-Roberta-Base-Base-Armentsive | 트윗 _eval |
| 이모티콘 예측 | Cardiffnlp/Twitter-Roberta-Base-Emoji | 트윗 _eval |
| 감정 분석 (단일 라벨) | Cardiffnlp/Twitter-Roberta-Base-Encot | 트윗 _eval |
| 감정 분석 (멀티 라벨) | Cardiffnlp/Twitter-Roberta-Base-emotion-Multilabel-Latest | TBA |
| 지명 된 엔티티 인식 | Tner/Roberta-Large-Tweetner7-all | tner/tweetner7 |
| 질문 대답 | lmqg/t5-small-tweetqa-qa | lmqg/qg_tweetqa |
| 질문 답변 세대 | LMQG/T5-베이스-트윗 QA-QAG | lmqg/qag_tweetqa |
| 언어 모델링 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | TBA |
| 트윗 임베딩 | Cambridgeltl/Tweet-Roberta-Base-embeddings-V1 | TBA |
Local/Huggingface ModelHub의 다른 모델을 사용하려면 load_model 함수에 모델 경로/별명을 제공 할 수 있습니다. 아래는 NER에 대한 모델을로드하는 예입니다.
import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )Tweetnlp는 매개 변수 검색을위한 Ray Tune과 함께 모델 호스팅/미세 조정을 위해 지원하는 데이터 세트에서 언어 모델에서 미세 조정에 대한 쉬운 인터페이스를 제공합니다.
sentiment , offensive , irony , hate , emotion , topic_classification tweetnlp 의 트레이너를 사용한 실험 결과는 다음 표에서 찾을 수 있습니다. 결과는 경쟁력이 있으며 각 작업의 기준으로 사용할 수 있습니다. 결과에 대한 자세한 내용은 리더 보드 페이지를 참조하십시오.
| 일 | 언어_model | 평가_F1 | 평가_F1_MACRO | 평가_ACCURACY | 링크 |
|---|---|---|---|---|---|
| 이모티콘 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.46 | 0.35 | 0.46 | Cardiffnlp/Twitter-Roberta-Base-2021-124M-Emoji |
| 감정 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.83 | 0.79 | 0.83 | Cardiffnlp/Twitter-Roberta-Base-2021-124m-emotion |
| 싫어하다 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.56 | 0.53 | 0.56 | Cardiffnlp/Twitter-Roberta-Base-2021-124M-Hate |
| 반어 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.79 | 0.78 | 0.79 | Cardiffnlp/Twitter-Roberta-Base-2021-124M-IRONY |
| 공격 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.86 | 0.82 | 0.86 | Cardiffnlp/Twitter-Roberta-Base-2021-124M 무시 무시합니다 |
| 감정 | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.71 | 0.72 | 0.71 | Cardiffnlp/Twitter-Roberta-Base-2021-124m-sentiment |
| 주제 _classification (싱글) | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.9 | 0.8 | 0.9 | Cardiffnlp/Twitter-Roberta-Base-2021-124M 주제 단일 |
| 주제 _classification (멀티) | Cardiffnlp/Twitter-Roberta-Base-2021-124M | 0.75 | 0.56 | 0.54 | Cardiffnlp/Twitter-Roberta-Base-2021-124M 주제-다중 |
| 감정 (다국어) | Cardiffnlp/Twitter-Xlm-Roberta-Base | 0.69 | 0.69 | 0.69 | Cardiffnlp/Twitter-xlm-Roberta-Base-Sentiment-Multingual |
다음 예는 아이러니 모델 Cardiffnlp/Twitter-Roberta-Base-2021-124M-IRONY를 재현합니다.
import logging
import tweetnlp
logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )
# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' , # language model to fine-tune
dataset = dataset ,
label_to_id = label_to_id ,
max_length = 128 ,
split_test = 'test' ,
split_train = 'train' ,
split_validation = 'validation' ,
output_dir = 'model_ckpt/irony'
)
# start model fine-tuning with parameter optimization
trainer . train (
eval_step = 50 , # each `eval_step`, models are validated on the validation set
n_trials = 10 , # number of trial at parameter optimization
search_range_lr = [ 1e-6 , 1e-4 ], # define the search space for learning rate (min and max value)
search_range_epoch = [ 1 , 6 ], # define the search space for epoch (min and max value)
search_list_batch = [ 4 , 8 , 16 , 32 , 64 ] # define the search space for batch size (list of integer to test)
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
"eval_loss" : 1.3228046894073486 ,
"eval_f1" : 0.7959183673469388 ,
"eval_f1_macro" : 0.791350632069195 ,
"eval_accuracy" : 0.7959183673469388 ,
"eval_runtime" : 2.2267 ,
"eval_samples_per_second" : 352.084 ,
"eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )저장된 체크 포인트는 아래와 같이 사용자 정의 모델로로드 될 수 있습니다.
import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" ) split_validation 제공되지 않으면 트레이너는 매개 변수 검색없이 기본 매개 변수로 단일 실행을 수행합니다.
자세한 내용은 첨부 Tweetnlp의 참조 용지를 읽으십시오. 연구에서 트윗을 사용하는 경우 다음 bib 항목을 사용하여 참조 용지를 인용하십시오.
@inproceedings{camacho-collados-etal-2022-tweetnlp,
title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2022",
address = "Abu Dhabi, U.A.E.",
publisher = "Association for Computational Linguistics",
}