tweetnlp Download - tweetnlp Source Code Download

Tweetnlp

TweetNLP لجميع عشاق NLP الذين يعملون على Twitter ووسائل التواصل الاجتماعي! يوفر tweetnlp لمكتبة Python مجموعة من الأدوات المفيدة لتحليل/فهم التغريدات مثل تحليل المشاعر ، والتنبؤ الرباعي ، والاعتراف بالدخول المسماة ، مدعوم من أحدث النمذجة اللغوية المتخصصة على وسائل التواصل الاجتماعي.

الأخبار (ديسمبر 2022): قدمنا ورقة تجريبية TweetNLP ("TweetNLP: معالجة اللغة الطبيعية المتطورة لوسائل التواصل الاجتماعي") ، في EMNLP 2022. يمكن العثور على النسخة النهائية هنا.

Tweetnlp Hugging Face صفحة جميع طرز TweetNLP الرئيسية يمكن العثور عليها هنا على وجه المعانقة.

موارد:

جولة سريعة مع دفتر كولاب:
العب مع TweetNLP العرض التوضيحي عبر الإنترنت: الرابط
ورقة EMNLP 2022: الرابط
2nd Cardiff NLP Summer Workshop البرنامج التعليمي:
2nd Cardiff NLP Summer Workshop Tutorial (Solutions):

جدول المحتويات:

تحميل النموذج ومجموعة البيانات
نموذج النصوص

ابدأ

قم بتثبيت TweetNLP عبر PIP على وحدة التحكم الخاصة بك.

pip install tweetnlp

نموذج ومجموعة البيانات

في هذا القسم ، سوف تتعلم كيفية الحصول على النماذج ومجموعات البيانات باستخدام tweetnlp . تتبع النماذج نموذج HuggingFace ومجموعات البيانات في شكل مجموعات بيانات HuggingFace. يجب العثور على مقدمات سهلة لنماذج Huggingface ومجموعات البيانات في صفحة الويب Huggingface ، لذا يرجى التحقق منها إذا كنت جديدًا في Huggingface.

تصنيف سقسقة

تتكون وحدة التصنيف من ست مهام مختلفة (تصنيف الموضوع ، تحليل المشاعر ، الكشف عن المفارقة ، الكشف عن خطاب الكراهية ، الكشف عن اللغة الهجومية ، التنبؤ التعبيري ، وتحليل العاطفة). في كل مثال ، يتم إنشاء إنشاء النموذج بواسطة tweetnlp.load_model("task-name") ، وتشغيل التنبؤ عن طريق تمرير نص أو قائمة من النصوص كوسيطة للوظيفة المقابلة.

تصنيف الموضوع : الهدف من هذه المهمة هو ، بالنظر إلى تغريدة لتعيين الموضوعات المتعلقة بمحتواها. يتم تشكيل المهمة كمشكلة تصنيف متعددة العطلة خاضعة للإشراف حيث يتم تعيين كل تغريدة واحدة أو أكثر من 19 موضوعًا متاحًا. تم تنظيم الموضوعات بعناية بناءً على اتجاهات Twitter بهدف أن تكون واسعة وعامة وتتألف من فصول مثل: الفنون والثقافة أو الموسيقى أو الرياضة. تحتوي مجموعة البيانات المعروفة داخليًا على أكثر من 10 آلاف تغريدات ذات علامات يدوي (تحقق من الورقة هنا ، أو صفحة DataStist Huggingface).

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'topic_classification' )  # Or `model = tweetnlp.TopicClassification()`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )  # Or `model.predict`
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ]}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : [ 'celebrity_&_pop_culture' , 'music' ],
 'probability' : { 'arts_&_culture' : 0.037371691316366196 ,
  'business_&_entrepreneurs' : 0.010188567452132702 ,
  'celebrity_&_pop_culture' : 0.92448890209198 ,
  'diaries_&_daily_life' : 0.03425711765885353 ,
  'family' : 0.00796138122677803 ,
  'fashion_&_style' : 0.020642118528485298 ,
  'film_tv_&_video' : 0.08062587678432465 ,
  'fitness_&_health' : 0.006343095097690821 ,
  'food_&_dining' : 0.0042883665300905704 ,
  'gaming' : 0.004327300935983658 ,
  'learning_&_educational' : 0.010652057826519012 ,
  'music' : 0.8291937112808228 ,
  'news_&_social_concern' : 0.24688217043876648 ,
  'other_hobbies' : 0.020671198144555092 ,
  'relationships' : 0.020371075719594955 ,
  'science_&_technology' : 0.0170074962079525 ,
  'sports' : 0.014291072264313698 ,
  'travel_&_adventure' : 0.010423899628221989 ,
  'youth_&_student_life' : 0.008605164475739002 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'topic_classification' , multi_label = False )  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." )
> >> { 'label' : 'pop_culture' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . topic ( "Jacob Collier is a Grammy-awarded English artist from London." , return_probability = True )
> >> { 'label' : 'pop_culture' ,
 'probability' : { 'arts_&_culture' : 9.20625461731106e-05 ,
  'business_&_entrepreneurs' : 6.916998972883448e-05 ,
  'pop_culture' : 0.9995898604393005 ,
  'daily_life' : 0.00011083036952186376 ,
  'sports_&_gaming' : 8.668467489769682e-05 ,
  'science_&_technology' : 5.152115045348182e-05 }}

# GET DATASET
dataset_multi_label , label2id_multi_label = tweetnlp . load_dataset ( 'topic_classification' )
dataset_single_label , label2id_single_label = tweetnlp . load_dataset ( 'topic_classification' , multi_label = False )

تحليل المشاعر : مهمة تحليل المشاعر المدمجة في TweetNLP هي نسخة مبسطة حيث الهدف هو التنبؤ بشعور تغريدة مع واحدة من العلامات الثلاثة التالية: إيجابية أو محايدة أو سلبية. مجموعة البيانات الأساسية للغة الإنجليزية هي النسخة الموحدة TweetEval من مجموعة بيانات Semeval-2017 من المهمة على تحليل المعنويات في Twitter (تحقق من الورقة هنا).

 import tweetnlp

# ENGLISH MODEL
model = tweetnlp . load_model ( 'sentiment' )  # Or `model = tweetnlp.Sentiment()` 
model . sentiment ( "Yes, including Medicare and social security saving?" )  # Or `model.predict`
> >> { 'label' : 'positive' }
model . sentiment ( "Yes, including Medicare and social security saving?" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.004584966693073511 , 'neutral' : 0.19360853731632233 , 'positive' : 0.8018065094947815 }}

# MULTILINGUAL MODEL
model = tweetnlp . load_model ( 'sentiment' , multilingual = True )  # Or `model = tweetnlp.Sentiment(multilingual=True)` 
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" )
> >> { 'label' : 'positive' }
model . sentiment ( "天気が良いとやっぱり気持ち良いなあ" , return_probability = True )
> >> { 'label' : 'positive' , 'probability' : { 'negative' : 0.028369612991809845 , 'neutral' : 0.08128828555345535 , 'positive' : 0.8903420567512512 }}

# GET DATASET (ENGLISH)
dataset , label2id = tweetnlp . load_dataset ( 'sentiment' )
# GET DATASET (MULTILINGUAL)
for l in [ 'all' , 'arabic' , 'english' , 'french' , 'german' , 'hindi' , 'italian' , 'portuguese' , 'spanish' ]:
    dataset_multilingual , label2id_multilingual = tweetnlp . load_dataset ( 'sentiment' , multilingual = True , task_language = l )

الكشف المفارق : هذه مهمة تصنيف ثنائية حيث تم إعطاء تغريدة ، والهدف من ذلك هو اكتشاف ما إذا كانت مفارقة أم لا. يعتمد على مجموعة بيانات الكشف المفارقة من مهمة Semeval 2018 (تحقق من الورقة هنا).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'irony' )  # Or `model = tweetnlp.Irony()` 
model . irony ( 'If you wanna look like a badass, have drama on social media' )  # Or `model.predict`
> >> { 'label' : 'irony' }
model . irony ( 'If you wanna look like a badass, have drama on social media' , return_probability = True )
> >> { 'label' : 'irony' , 'probability' : { 'non_irony' : 0.08390884101390839 , 'irony' : 0.9160911440849304 }} 

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'irony' )

الكشف عن خطاب الكراهية : تتكون مهمة الكشف عن خطاب الكراهية من اكتشاف ما إذا كانت تغريدة تكره مجتمعًا مستهدفًا. يعتمد النموذج الأساسي على مجموعة من مجموعات بيانات الكشف عن خطاب الكراهية الموحدة (انظر الورقة المرجعية).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'hate' )  # Or `model = tweetnlp.Hate()` 
model . hate ( 'Whoever just unfollowed me you a bitch' )  # Or `model.predict`
> >> { 'label' : 'not-hate' }
model . hate ( 'Whoever just unfollowed me you a bitch' , return_probability = True )
> >> { 'label' : 'non-hate' , 'probability' : { 'non-hate' : 0.7263831496238708 , 'hate' : 0.27361682057380676 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'hate' )

تحديد اللغة الهجومية : تتكون هذه المهمة في تحديد ما إذا كان هناك شكل من أشكال اللغة الهجومية موجودًا في تغريدة. لقياسنا ، نعتمد على مجموعة بيانات Semeval2019 EmbenseVal (تحقق من الورقة هنا).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'offensive' )  # Or `model = tweetnlp.Offensive()` 
model . offensive ( "All two of them taste like ass." )  # Or `model.predict`
> >> { 'label' : 'offensive' }
model . offensive ( "All two of them taste like ass." , return_probability = True )
> >> { 'label' : 'offensive' , 'probability' : { 'non-offensive' : 0.16420328617095947 , 'offensive' : 0.8357967734336853 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'offensive' )

التنبؤ بالرموز التعبيرية : هدف التنبؤ الرباعي هو التنبؤ بالعوائد الرموز التعبيرية النهائية على تغريدة معينة. مجموعة البيانات المستخدمة لضبط نماذجنا هي تكييف TweetEval من مهمة Semeval 2018 حول التنبؤ الرباعي (تحقق من الورقة هنا) ، بما في ذلك 20 رموز تعبيرية (❤ ،؟ ،؟

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'emoji' )  # Or `model = tweetnlp.Emoji()` 
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' )  # Or `model.predict`
> >> { 'label' : '?' }
model . emoji ( 'Beautiful sunset last night from the pontoon @TupperLakeNY' , return_probability = True )
> >> { 'label' : '?' ,
 'probability' : { '❤' : 0.13197319209575653 ,
  '?' : 0.11246423423290253 ,
  '?' : 0.008415069431066513 ,
  '?' : 0.04842926934361458 ,
  '' : 0.014528146013617516 ,
  '?' : 0.1509675830602646 ,
  '?' : 0.08625403046607971 ,
  '' : 0.01616635173559189 ,
  '?' : 0.07396604865789413 ,
  '?' : 0.03033279813826084 ,
  '?' : 0.16525287926197052 ,
  '??' : 0.020336611196398735 ,
  '☀' : 0.00799981877207756 ,
  '?' : 0.016111424192786217 ,
  '' : 0.012984540313482285 ,
  '?' : 0.012557178735733032 ,
  '?' : 0.031386848539114 ,
  '?' : 0.006829539313912392 ,
  '?' : 0.04188741743564606 ,
  '?' : 0.011156936176121235 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emoji' )

التعرف على العاطفة : بالنظر إلى تغريدة ، تتكون هذه المهمة من ربطها بمشاعرها الأنسب. كمجموعة بيانات مرجعية ، نستخدم مهمة Semeval 2018 على التأثير في التغريدات (تحقق من الورقة هنا). أحدث نموذج متعدد العلامات يتضمن أحد عشر نوعًا مشاعرًا.

 import tweetnlp

# MULTI-LABEL MODEL 
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'joy' ,
 'probability' : { 'anger' : 0.00025800734874792397 ,
  'anticipation' : 0.0005329723935574293 ,
  'disgust' : 0.00026112011983059347 ,
  'fear' : 0.00027552215033210814 ,
  'joy' : 0.7721399068832397 ,
  'love' : 0.1806265264749527 ,
  'optimism' : 0.04208092764019966 ,
  'pessimism' : 0.00025325192837044597 ,
  'sadness' : 0.0006160663324408233 ,
  'surprise' : 0.0005619609728455544 ,
  'trust' : 0.002393839880824089 }}

# SINGLE-LABEL MODEL
model = tweetnlp . load_model ( 'emotion' )  # Or `model = tweetnlp.Emotion()` 
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' )  # Or `model.predict`
> >> { 'label' : 'joy' }
# NOTE: the probability of the sinlge-label model the softmax over the label.
model . emotion ( 'I love swimming for the same reason I love meditating...the feeling of weightlessness.' , return_probability = True )
> >> { 'label' : 'optimism' , 'probability' : { 'joy' : 0.01367587223649025 , 'optimism' : 0.7345258593559265 , 'anger' : 0.1770714670419693 , 'sadness' : 0.07472680509090424 }}

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'emotion' )

تحذير: يحتوي نموذج العاطفة المسمى الواحد والمتعدد العطلة على مجموعة ملصقات مؤلفة (السمية الواحدة لديها أربع فئات من "الفرح"/"التفاؤل"/"الغضب"/"الحزن" ، في حين أن المتعددة لديها أحد عشر فئة من الفئة "الفرح"/"التفاؤل"/"الغضب"/"الحزن"/"الحب"/"الثقة"/"الخوف"/"مفاجأة"/"التوقع"/"الاشمئزاز"/"التشاؤم").

اسم التعرف على الكيان

تتكون هذه الوحدة من نموذج التعرف على الدار (NER) المسماة المدربين على وجه التحديد للتغريدات. يتم إنشاء مثيل له بواسطة tweetnlp.load_model("ner") ، ويقوم بتشغيل التنبؤ من خلال إعطاء نص أو قائمة من النصوص كوسيطة لوظيفة ner (تحقق من الورقة هنا ، أو صفحة DATASTING Luggingface).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'ner' )  # Or `model = tweetnlp.NER()` 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' )  # Or `model.predict`
> >> [{ 'type' : 'person' , 'entity' : 'Jacob Collier' }, { 'type' : 'event' , 'entity' : ' Grammy' }, { 'type' : 'location' , 'entity' : ' London' }]
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity. 
model . ner ( 'Jacob Collier is a Grammy-awarded English artist from London.' , return_probability = True )  # Or `model.predict`
> >> [
  { 'type' : 'person' , 'entity' : 'Jacob Collier' , 'probability' : 0.9905318220456442 },
  { 'type' : 'event' , 'entity' : ' Grammy' , 'probability' : 0.19164378941059113 },
  { 'type' : 'location' , 'entity' : ' London' , 'probability' : 0.9607000350952148 }
]

# GET DATASET
dataset , label2id = tweetnlp . load_dataset ( 'ner' )

إجابة سؤال

تتكون هذه الوحدة من نموذج للإجابة على أسئلة تم تدريبه على وجه التحديد للتغريدات. يتم إنشاء مثيل له بواسطة tweetnlp.load_model("question_answering") ، ويدير التنبؤ من خلال إعطاء سؤال أو قائمة من الأسئلة إلى جانب سياق أو قائمة من السياقات كوسيطة إلى وظيفة question_answering (تحقق من الورقة هنا ، أو صفحة مجموعة بيانات المعانقة).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answering' )  # Or `model = tweetnlp.QuestionAnswering()` 
model . question_answering (
  question = 'who created the post as we know it today?' ,
  context = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> { 'generated_text' : 'ben' }

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answering' )

جيل إجابة سؤال

تتكون هذه الوحدة من جيل زوج أسئلة وإجابات تم تدريبه على وجه التحديد للتغريدات. تم إنشاء مثيل له بواسطة tweetnlp.load_model("question_answer_generation") ، ويقوم بتشغيل التنبؤ عن طريق إعطاء سياق أو قائمة من السياقات كوسيطة إلى وظيفة question_answer_generation (تحقق من الورقة هنا ، أو صفحة Datasface مجموعة).

 import tweetnlp

# MODEL
model = tweetnlp . load_model ( 'question_answer_generation' )  # Or `model = tweetnlp.QuestionAnswerGeneration()` 
model . question_answer_generation (
  text = "'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`
> >> [
    { 'question' : 'who created the post?' , 'answer' : 'ben' },
    { 'question' : 'what did ben do in 1994?' , 'answer' : 'he retired as editor' }
]

# GET DATASET
dataset = tweetnlp . load_dataset ( 'question_answer_generation' )

نمذجة اللغة

يتنبأ نموذج اللغة المقنعة بالرمز المقنع في الجملة المحددة. يتم إنشاء مثيل له بواسطة tweetnlp.load_model('language_model') ، ويدير التنبؤ عن طريق إعطاء نص أو قائمة من النصوص كوسيطة لوظيفة mask_prediction . يرجى التأكد من أن كل نص لديه رمز <mask> ، لأن هذا هو في النهاية ما يلي من خلال الهدف من النموذج للتنبؤ به.

 import tweetnlp
model = tweetnlp . load_model ( 'language_model' )  # Or `model = tweetnlp.LanguageModel()` 
model . mask_prediction ( "How many more <mask> until opening day? ?" , best_n = 2 )  # Or `model.predict`
> >> { 'best_tokens' : [ 'days' , 'hours' ],
 'best_scores' : [ 5.498564104033932e-11 , 4.906026140893971e-10 ],
 'best_sentences' : [ 'How many more days until opening day? ?' ,
  'How many more hours until opening day? ?' ]}

تغريدة التضمين

ينتج نموذج تضمين تغريدة طول ثابت للتغريد. يمثل التضمين الدلالات بمعنى التغريد ، ويمكن استخدام ذلك للبحث الدلالي للتغريدات باستخدام التشابه بين التضمين. يتم إنشاء مثيل له بواسطة tweet_nlp.load_model('sentence_embedding') ، وقم بتشغيل التنبؤ عن طريق تمرير نص أو قائمة من النصوص كوسيطة لوظيفة embedding .

الحصول على التضمين

 import tweetnlp
model = tweetnlp . load_model ( 'sentence_embedding' )  # Or `model = tweetnlp.SentenceEmbedding()` 

# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model . embedding ( tweet )
vectors . shape
> >> ( 768 ,)

# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here." ,
    "Trump appointed judge Stephanos Bibas " ,
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1" ,
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education." ,
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture." ,
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM" ,
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020." ,
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost" ,
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2" ,
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%" ,
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you." 
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned" ,
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis." ,
]
vectors = model . embedding ( tweet_corpus , batch_size = 4 )
vectors . shape
> >> ( 12 , 768 )

البحث عن التشابه

 sims = []
for n , i in enumerate ( tweet_corpus ):
  _sim = model . similarity ( tweet , i )
  sims . append ([ n , _sim ])
print ( f'anchor tweet: { tweet } n ' )
for m , ( n , s ) in enumerate ( sorted ( sims , key = lambda x : x [ 1 ], reverse = True )[: 3 ]):
  print ( f' - top { m } : { tweet_corpus [ n ] } n - similaty: { s } n ' )

> >> anchor tweet : I will never understand the decision making of the people of Alabama . Their new Senator is a definite downgrade . You have served with honor .  Well done .

 - top 0 : Tomorrow is my last day as Senator from Alabama .  I believe our opportunities are boundless when we find common ground . As we swear in a new Congress & amp ; a new President , demand from them that they do just that & amp ; build a stronger , more just society .  It ’ s been an honor to serve you . The mask cult can ’ t ever admit masks don ’ t work because their ideology is based on feeling like a “ good person ”  Wearing a mask makes them a “ good person ” & amp ; anyone who disagrees w / them isn ’ t  They can ’ t tolerate any idea that makes them feel like their self - importance is unearned
 - similaty : 0.7480925982953287

 - top 1 : Trump appointed judge Stephanos Bibas 
 - similaty : 0.6289173306344258

 - top 2 : Free , fair elections are the lifeblood of our democracy . Charges of unfairness are serious . But calling an election unfair does not make it so . Charges require specific allegations and then proof . We have neither here .
 - similaty : 0.6017154109745276

الموارد وتحميل النموذج المخصص

فيما يلي جدول للنموذج الافتراضي المستخدم في كل مهمة.

مهمة	نموذج	مجموعة البيانات
تصنيف الموضوع (السلام الفردي)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweat-Topic-Single-All	Cardiffnlp/Tweet_topic_single
تصنيف الموضوع (متعددة العلامات)	Cardiffnlp/Twitter-Roberta-Base-Dec2021-Tweat-Topic-Multi-All	Cardiffnlp/Tweet_topic_multi
تحليل المشاعر (متعدد اللغات)	Cardiffnlp/Twitter-XLM-Roberta-Base-Sentiment	Cardiffnlp/Tweet_Sentiment_Multilingual
تحليل المشاعر	Cardiffnlp/Twitter-Roberta-Base-Sentiment-Latest	Tweet_eval
الكشف المفارقة	Cardiffnlp/Twitter-Roberta-Base-Irony	Tweet_eval
الكراهية اكتشاف	Cardiffnlp/Twitter-Roberta-Base-Hate-Latest	Tweet_eval
الكشف الهجومي	Cardiffnlp/Twitter-Roberta-Base-Offensive	Tweet_eval
تنبؤ الرموز التعبيرية	Cardiffnlp/Twitter-Roberta-Base-Emoji	Tweet_eval
تحليل العاطفة (السلام الفردي)	Cardiffnlp/Twitter-Roberta-Base-Emotion	Tweet_eval
تحليل العاطفة (متعددة العلامات)	Cardiffnlp/Twitter-Roberta-Base-Emotion-Multilabel-Latest	TBA
اسم التعرف على الكيان	Tner/Roberta-Large-Tweetner7-All	tner/tweetner7
إجابة سؤال	LMQG/T5-SMALL-TWETQA-QA	LMQG/QG_TWEETQA
جيل إجابة سؤال	LMQG/T5-base-tweetqa-qag	LMQG/QAG_TWEETQA
نمذجة اللغة	Cardiffnlp/Twitter-Roberta-Base-2021-124m	TBA
تغريدة التضمين	Cambridgeltl/Tweet-roberta-base-peddings-V1	TBA

لاستخدام نموذج آخر من ModelHub Local/HuggingFace ، يمكن للمرء ببساطة توفير مسار/الاسم المستعار للمسار إلى وظيفة load_model . فيما يلي مثال لتحميل نموذج لـ NER.

 import tweetnlp
tweetnlp . load_model ( 'ner' , model_name = 'tner/twitter-roberta-base-2019-90m-tweetner7-continuous' )

النموذج النموذجي

يوفر TweetNLP واجهة سهلة لضبط النماذج اللغوية على مجموعات البيانات التي تدعمها HuggingFace لاستضافة النماذج/ضبطها مع Ray Tune for Parameter Search.

المهام المدعومة: sentiment ، offensive ، irony ، hate ، emotion ، topic_classification

يمكن العثور على نتائج التجارب مع tweetnlp مدرب في الجدول التالي. النتائج تنافسية ويمكن استخدامها كخطوط الأساس لكل مهمة. انظر صفحة المتصدرين لمعرفة المزيد عن النتائج.

مهمة	language_model	eval_f1	eval_f1_macro	eval_accuracy	وصلة
الرموز التعبيرية	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.46	0.35	0.46	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Emoji
العاطفة	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.83	0.79	0.83	Cardiffnlp/Twitter-Roberta-Base-2021-124m-emotion
يكره	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.56	0.53	0.56	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Hate
مفارقة	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.79	0.78	0.79	Cardiffnlp/Twitter-Roberta-Base-2021-124M-IRONY
جارح	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.86	0.82	0.86	Cardiffnlp/Twitter-Roberta-Base-2021-124mmmensive
المشاعر	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.71	0.72	0.71	Cardiffnlp/Twitter-Roberta-Base-2021-124M
Topic_Classification (واحد)	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.9	0.8	0.9	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Single
Topic_Classification (Multi)	Cardiffnlp/Twitter-Roberta-Base-2021-124m	0.75	0.56	0.54	Cardiffnlp/Twitter-Roberta-Base-2021-124M-Topic-Multi
المشاعر (متعددة اللغات)	Cardiffnlp/Twitter-XLM-Roberta-base	0.69	0.69	0.69	Cardiffnlp/Twitter-XLM-Roberta-Base-Sentiment-Multilingual

مثال

سيؤدي المثال التالي إلى إعادة إنتاج نموذج Cardiffnlp/Twitter-Roberta-Base-2021-124M-Irony.

 import logging
import tweetnlp

logging . basicConfig ( format = '%(asctime)s %(levelname)-8s %(message)s' , level = logging . INFO , datefmt = '%Y-%m-%d %H:%M:%S' )

# load dataset
dataset , label_to_id = tweetnlp . load_dataset ( "irony" )
# load trainer class
trainer_class = tweetnlp . load_trainer ( "irony" )
# setup trainer
trainer = trainer_class (
    language_model = 'cardiffnlp/twitter-roberta-base-2021-124m' ,  # language model to fine-tune
    dataset = dataset ,
    label_to_id = label_to_id ,
    max_length = 128 ,
    split_test = 'test' ,
    split_train = 'train' ,
    split_validation = 'validation' ,
    output_dir = 'model_ckpt/irony' 
)
# start model fine-tuning with parameter optimization
trainer . train (
  eval_step = 50 ,  # each `eval_step`, models are validated on the validation set 
  n_trials = 10 ,  # number of trial at parameter optimization
  search_range_lr = [ 1e-6 , 1e-4 ],  # define the search space for learning rate (min and max value)
  search_range_epoch = [ 1 , 6 ],  # define the search space for epoch (min and max value)
  search_list_batch = [ 4 , 8 , 16 , 32 , 64 ]  # define the search space for batch size (list of integer to test) 
)
# evaluate model on the test set
trainer . evaluate ()
> >> {
  "eval_loss" : 1.3228046894073486 ,
  "eval_f1" : 0.7959183673469388 ,
  "eval_f1_macro" : 0.791350632069195 ,
  "eval_accuracy" : 0.7959183673469388 ,
  "eval_runtime" : 2.2267 ,
  "eval_samples_per_second" : 352.084 ,
  "eval_steps_per_second" : 44.01
}
# save model locally (saved at `{output_dir}/best_model` as default)
trainer . save_model ()
# run prediction
trainer . predict ( 'If you wanna look like a badass, have drama on social media' )
> >> { 'label' : 'irony' }
# push your model on huggingface hub
trainer . push_to_hub ( hf_organization = 'cardiffnlp' , model_alias = 'twitter-roberta-base-2021-124m-irony' )

يمكن تحميل نقطة التفتيش المحفوظة كنموذج مخصص على النحو التالي.

 import tweetnlp
model = tweetnlp . load_model ( 'irony' , model_name = "model_ckpt/irony/best_model" )

إذا لم يتم إعطاء split_validation ، فسيقوم المدرب بتشغيل واحد مع معلمات افتراضية دون البحث عن المعلمة.

ورقة مرجعية

لمزيد من التفاصيل ، يرجى قراءة الورقة المرجعية لـ TweetNLP المصاحبة. إذا كنت تستخدم TweetNLP في بحثك ، فيرجى استخدام إدخال bib التالي للاستشهاد بالورقة المرجعية:

 @inproceedings{camacho-collados-etal-2022-tweetnlp,
    title={{T}weet{NLP}: {C}utting-{E}dge {N}atural {L}anguage {P}rocessing for {S}ocial {M}edia},
    author={Camacho-Collados, Jose and Rezaee, Kiamehr and Riahi, Talayeh and Ushio, Asahi and Loureiro, Daniel and Antypas, Dimosthenis and Boisson, Joanne and Espinosa-Anke, Luis and Liu, Fangyu and Mart{'i}nez-C{'a}mara, Eugenio and others},
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

يوسع