tweetnlp 다운로드 tweetnlp 소스 코드 다운로드

트윗

트위터와 소셜 미디어에서 일하는 모든 NLP 애호가를위한 트윗! Python Library tweetnlp 소셜 미디어에 특화된 최첨단 언어 모델링으로 구동되는 감정 분석, 이모티콘 예측 및 명명 된 엔티티 인식과 같은 트윗을 분석/이해하는 유용한 도구 모음을 제공합니다.

뉴스 (2022 년 12 월) : 우리는 EMNLP 2022에서 트위터 네 데모 페이퍼 ( "Tweetnlp : Cutting-Edge Natural Language Processing")를 발표했습니다. 최종 버전은 여기에서 찾을 수 있습니다.

Tweetnlp Hugging Face Page 모든 주요 트윗 모델은 여기에서 포옹 페이스에서 찾을 수 있습니다.

자원:

Colab 노트북과의 빠른 투어 :
Tweetnlp Online Demo : Link와 함께 재생하십시오
EMNLP 2022 용지 : 링크
2 차 Cardiff NLP Summer Workshop 튜토리얼 :
2 차 Cardiff NLP Summer Workshop Tutorial (솔루션) :

로드 모델 및 데이터 세트
미세 조정 모델

시작하세요

콘솔에 PIP를 통해 트윗을 설치하십시오.

pip install tweetnlp

모델 및 데이터 세트

이 섹션에서는 tweetnlp 로 모델 및 데이터 세트를 얻는 방법을 배웁니다. 모델은 Huggingface 모델을 따르고 데이터 세트는 Huggingface 데이터 세트 형식입니다. Huggingface 웹 페이지에서 Huggingface 모델 및 데이터 세트를 쉽게 소개해야하므로 Huggingface에 익숙지 확인하십시오.

트윗 분류

분류 모듈은 6 가지 다른 작업 (주제 분류, 감정 분석, 아이러니 감지, 증오 언어 탐지, 불쾌한 언어 탐지, 이모티콘 예측 및 감정 분석)으로 구성됩니다. 각각의 예에서, 모델은 tweetnlp.load_model("task-name") 에 의해 인스턴스화되며, 텍스트 나 텍스트 목록을 해당 함수에 인수로 전달하여 예측을 실행합니다.

주제 분류 :이 작업의 목표는 컨텐츠와 관련된 주제를 할당하기위한 트윗이 주어지는 것입니다. 이 작업은 각 트윗에 총 19 개의 사용 가능한 주제에서 하나 이상의 주제가 할당되는 감독 된 다중 표지 분류 문제로 형성됩니다. 주제는 광범위하고 일반적인 목표로 트위터 트렌드를 기반으로 신중하게 선별되었으며 예술과 문화, 음악 또는 스포츠와 같은 수업으로 구성됩니다. 내부 발표 된 데이터 세트에는 10K가 넘는 수동으로 표지 된 트윗이 포함되어 있습니다 (여기에서 논문 또는 HuggingFace 데이터 세트 페이지를 확인하십시오).

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

감정 분석 : TweetNLP에 통합 된 감정 분석 작업은 단순화 된 버전으로, 목표는 긍정적, 중립적 또는 부정적인 세 가지 레이블 중 하나를 사용하여 트윗의 감정을 예측하는 것입니다. 영어의 기본 데이터 세트는 Twitter의 감정 분석 작업의 Semeval-2017 데이터 세트의 통합 트위터 버전입니다 (여기서 논문 확인).

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

아이러니 감지 : 이것은 트윗이 주어진 이진 분류 작업입니다. 목표는 아이러니한지 여부를 감지하는 것입니다. Semeval 2018 작업의 아이러니 감지 데이터 세트를 기반으로합니다 (여기에서 논문을 확인하십시오).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

증오 음성 탐지 : 증오심 표현 탐지 작업은 트윗이 대상 커뮤니티에 대한 증오 여부를 감지하는 것으로 구성됩니다. 기본 모델은 통합 증오 음성 탐지 데이터 세트를 기반으로합니다 (참조 용지 참조).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

공격적인 언어 식별 :이 작업은 일부 형태의 공격적인 언어가 트윗에 있는지 확인하는 것으로 구성됩니다. 우리의 벤치 마크를 위해 우리는 SEMEVAL2019 FREMENSEVAL 데이터 세트에 의존합니다 (여기서 논문을 확인하십시오).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

이모티콘 예측 : 이모티콘 예측의 목표는 주어진 트윗에서 최종 이모티콘을 예측하는 것입니다. 우리의 모델을 미세 조정하는 데 사용되는 데이터 세트는 이모티콘 예측에 관한 Semeval 2018 작업 (여기서 논문을 확인)의 트위터가 적응하는 것입니다.

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

감정 인식 : 트윗이 주어지면이 작업은 가장 적절한 감정과 연관시키는 것으로 구성됩니다. 참조 데이터 세트로서 우리는 Tweets의 영향에 대해 Semeval 2018 작업을 사용합니다 (여기에서 논문을 확인하십시오). 최신 멀티 라벨 모델에는 11 개의 감정 유형이 포함됩니다.

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

경고 : 단일 라벨과 다중 라벨 감정 모델에는 차이 레이블 세트가 있습니다 (단일 라벨은 4 가지 클래스의 '기쁨'/'낙관주의'/'분노'/'슬픔'을 가지고 있으며, 멀티 라벨은 11 개의 클래스의 '기쁨'/'낙관론'/'안 저'/'슬픔'/'사랑'/'두려움'/'놀라움'/'혐의'/'혐의'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'혐오감'/'PARSIMISTAMES'/'PARSIMATION').

지명 된 엔티티 인식

이 모듈은 트윗을 위해 특별히 훈련 된 NER (Named-Entity Recognition) 모델로 구성됩니다. 이 모델은 tweetnlp.load_model("ner") 에 의해 인스턴스화되며 ner 기능에 대한 인수로 텍스트 나 텍스트 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Huggingface 데이터 세트 페이지를 확인하십시오).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

질문 대답

이 모듈은 트윗을 위해 특별히 훈련 된 질문 응답 모델로 구성됩니다. 이 모델은 tweetnlp.load_model("question_answering") 에 의해 인스턴스화되며 question_answering 함수에 대한 인수로 컨텍스트 또는 컨텍스트 목록과 함께 질문 또는 질문 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Huggingface 데이터 세트 페이지를 확인하십시오).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

질문 답변 세대

이 모듈은 트윗을 위해 특별히 훈련 된 질문 및 답변 쌍 생성으로 구성됩니다. 이 모델은 tweetnlp.load_model("question_answer_generation") 에 의해 인스턴스화되며 question_answer_generation 함수에 대한 인수로 컨텍스트 또는 컨텍스트 목록을 제공하여 예측을 실행합니다 (여기서 논문 또는 Heengingface 데이터 세트 페이지를 확인하십시오).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

언어 모델링

마스크 된 언어 모델은 주어진 문장에서 마스크 된 토큰을 예측합니다. 이것은 tweetnlp.load_model('language_model') 에 의해 인스턴스화되며, 텍스트 나 텍스트 목록을 mask_prediction 함수에 대한 인수로 제공하여 예측을 실행합니다. 각 텍스트에 <mask> 토큰이 있는지 확인하십시오. 결국 예측할 모델의 목표에 의해 다음과 같습니다.

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

트윗 임베딩

트윗 임베딩 모델은 트윗을위한 고정 길이 임베딩을 생성합니다. 임베딩은 트윗의 의미에 의한 의미론을 나타내며, 이는 임베딩 사이의 유사성을 사용하여 트윗의 의미 틱 검색에 사용될 수 있습니다. 모델은 tweet_nlp.load_model('sentence_embedding') 에 의해 인스턴스화되며 텍스트 나 텍스트 목록을 embedding 함수에 인수로 전달하여 예측을 실행합니다.

임베딩을 얻으십시오

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

유사성 검색

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

리소스 및 사용자 정의 모델 로딩

다음은 각 작업에 사용 된 기본 모델 테이블입니다.

일	모델	데이터 세트
주제 분류 (단일 라벨)	Cardiffnlp/Twitter-Roberta-Base-Dec2021- 트윗-주제-홀드	cardiffnlp/tweet_topic_single
주제 분류 (멀티 라벨)	Cardiffnlp/Twitter-Roberta-Base-Dec2021- 트윗-다중-all	cardiffnlp/tweet_topic_multi
감정 분석 (다국어)	Cardiffnlp/Twitter-Xlm-Roberta-Base-Sentiment	cardiffnlp/tweet_sentiment_multingual
감정 분석	Cardiffnlp/Twitter-Roberta-Base-Sentiment-Latest	트윗 _eval
아이러니 탐지	Cardiffnlp/Twitter-Roberta-Base-IRONY	트윗 _eval
증오 감지	Cardiffnlp/Twitter-Roberta-Base-Hate-Latest	트윗 _eval
공격적인 탐지	Cardiffnlp/Twitter-Roberta-Base-Base-Armentsive	트윗 _eval
이모티콘 예측	Cardiffnlp/Twitter-Roberta-Base-Emoji	트윗 _eval
감정 분석 (단일 라벨)	Cardiffnlp/Twitter-Roberta-Base-Encot	트윗 _eval
감정 분석 (멀티 라벨)	Cardiffnlp/Twitter-Roberta-Base-emotion-Multilabel-Latest	TBA
지명 된 엔티티 인식	Tner/Roberta-Large-Tweetner7-all	tner/tweetner7
질문 대답	lmqg/t5-small-tweetqa-qa	lmqg/qg_tweetqa
질문 답변 세대	LMQG/T5-베이스-트윗 QA-QAG	lmqg/qag_tweetqa
언어 모델링	Cardiffnlp/Twitter-Roberta-Base-2021-124M	TBA
트윗 임베딩	Cambridgeltl/Tweet-Roberta-Base-embeddings-V1	TBA

Local/Huggingface ModelHub의 다른 모델을 사용하려면 load_model 함수에 모델 경로/별명을 제공 할 수 있습니다. 아래는 NER에 대한 모델을로드하는 예입니다.

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

모델 미세 조정

Tweetnlp는 매개 변수 검색을위한 Ray Tune과 함께 모델 호스팅/미세 조정을 위해 지원하는 데이터 세트에서 언어 모델에서 미세 조정에 대한 쉬운 인터페이스를 제공합니다.

지원되는 과제 : sentiment , offensive , irony , hate , emotion , topic_classification

tweetnlp 의 트레이너를 사용한 실험 결과는 다음 표에서 찾을 수 있습니다. 결과는 경쟁력이 있으며 각 작업의 기준으로 사용할 수 있습니다. 결과에 대한 자세한 내용은 리더 보드 페이지를 참조하십시오.

일	언어_model	평가_F1	평가_F1_MACRO	평가_ACCURACY	링크
이모티콘	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Emoji
감정	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-2021-124m-emotion
싫어하다	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.56	0.53	0.56	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Hate
반어	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-2021-124M-IRONY
공격	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-2021-124M 무시 무시합니다
감정	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.71	0.72	0.71	Cardiffnlp/Twitter-Roberta-Base-2021-124m-sentiment
주제 _classification (싱글)	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124M 주제 단일
주제 _classification (멀티)	Cardiffnlp/Twitter-Roberta-Base-2021-124M	0.75	0.56	0.54	Cardiffnlp/Twitter-Roberta-Base-2021-124M 주제-다중
감정 (다국어)	Cardiffnlp/Twitter-Xlm-Roberta-Base	0.69	0.69	0.69	Cardiffnlp/Twitter-xlm-Roberta-Base-Sentiment-Multingual

예

다음 예는 아이러니 모델 Cardiffnlp/Twitter-Roberta-Base-2021-124M-IRONY를 재현합니다.

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

저장된 체크 포인트는 아래와 같이 사용자 정의 모델로로드 될 수 있습니다.

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

split_validation 제공되지 않으면 트레이너는 매개 변수 검색없이 기본 매개 변수로 단일 실행을 수행합니다.

참조 용지

자세한 내용은 첨부 Tweetnlp의 참조 용지를 읽으십시오. 연구에서 트윗을 사용하는 경우 다음 bib 항목을 사용하여 참조 용지를 인용하십시오.

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

확장하다