ดาวน์โหลด aravec - ดาวน์โหลดซอร์สโค้ด aravec

aravec

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

Aravec 3.0

ความก้าวหน้าในเครือข่ายประสาทได้นำไปสู่การพัฒนาในสาขาต่าง ๆ เช่นการมองเห็นคอมพิวเตอร์การรู้จำเสียงพูดและการประมวลผลภาษาธรรมชาติ (NLP) หนึ่งในการพัฒนาล่าสุดที่มีอิทธิพลมากที่สุดใน NLP คือการใช้คำที่ฝังคำซึ่งเป็นคำที่แสดงเป็นเวกเตอร์ในพื้นที่ต่อเนื่องจับความสัมพันธ์ทางไวยากรณ์และความหมายมากมายในหมู่พวกเขา

Aravec เป็นโครงการโอเพ่นซอร์สแบบกระจายที่ได้รับการฝึกฝนมาล่วงหน้า Aravec เวอร์ชันแรกให้แบบจำลองการฝังคำที่แตกต่างกันหกแบบที่สร้างขึ้นบนโดเมนเนื้อหาอาหรับที่แตกต่างกันสามโดเมน Tweets และ Wikipedia บทความนี้อธิบายถึงทรัพยากรที่ใช้ในการสร้างแบบจำลองเทคนิคการทำความสะอาดข้อมูลที่ใช้แล้วขั้นตอนการประมวลผลล่วงหน้ารวมถึงรายละเอียดของเทคนิคการสร้างคำที่ใช้คำว่า

Aravec เวอร์ชันที่สามมีโมเดลการฝังคำที่แตกต่างกัน 16 แบบที่สร้างขึ้นบนโดเมนเนื้อหาอาหรับที่แตกต่างกันสองรายการ ทวีตและบทความภาษาอาหรับวิกิพีเดีย ความแตกต่างที่สำคัญระหว่างรุ่นนี้และรุ่นก่อนหน้าคือเราผลิตโมเดลสองประเภทที่แตกต่างกันคือรุ่น Unigrams และ N-GRAMS เราใช้ชุดของเทคนิคทางสถิติเพื่อ genrate n-grams ที่ใช้กันมากที่สุดของแต่ละโดเมนข้อมูล

ทวีต Twitter
บทความภาษาอาหรับ Wikipedia

โดยโทเค็นทั้งหมดมากกว่า 1,169,075,128 โทเค็น

ลองดูว่ามีการแสดงโมเดล NGRAMS อย่างไร:

โปรดดูหน้าผลลัพธ์สำหรับการสืบค้นเพิ่มเติม

การอ้างอิง

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

อ่านกระดาษข้อความเต็ม

วิธีใช้

โมเดลเหล่านี้ถูกสร้างขึ้นโดยใช้ไลบรารี Gensim Python นี่คือรหัสง่าย ๆ สำหรับการโหลดและใช้หนึ่งในรุ่นโดยทำตามขั้นตอนเหล่านี้:

ติดตั้ง gensim > = 3.4 และ nltk > = 3.2 โดยใช้ pip หรือ conda

pip ติดตั้ง gensim nltk

Conda ติดตั้ง gensim nltk

แยกไฟล์โมเดลบีบอัดไปยังไดเรกทอรี [เช่น Twittert-CBOW ]
เก็บไฟล์ . npy คุณจะโหลดไฟล์โดยไม่มีส่วนขยายเช่นสิ่งที่คุณเห็นในรหัสต่อไปนี้
เรียกใช้รหัส Python นี้เพื่อโหลดและใช้โมเดล

วิธีการรวม aravec กับ spacy.io

รหัสสมุดบันทึก

ตัวอย่างรหัส

 # -*- coding: utf8 -*-
import gensim
import re
import numpy as np
from nltk import ngrams

from utilities import * # import utilities.py module

# ============================   
# ====== N-Grams Models ======

t_model = gensim . models . Word2Vec . load ( 'models/full_grams_cbow_100_twitter.mdl' )

# python 3.X
token = clean_str ( u'ابو تريكه' ). replace ( " " , "_" )
# python 2.7
# token = clean_str(u'ابو تريكه'.decode('utf8', errors='ignore')).replace(" ", "_")

if token in t_model . wv :
    most_similar = t_model . wv . most_similar ( token , topn = 10 )
    for term , score in most_similar :
        term = clean_str ( term ). replace ( " " , "_" )
        if term != token :
            print ( term , score )

# تريكه 0.752911388874054
# حسام_غالي 0.7516342401504517
# وائل_جمعه 0.7244222164154053
# وليد_سليمان 0.7177559733390808
# ...

# =========================================
# == Get the most similar tokens to a compound query
# most similar to 
# عمرو دياب + الخليج - مصر

pos_tokens = [ clean_str ( t . strip ()). replace ( " " , "_" ) for t in [ 'عمرو دياب' , 'الخليج' ] if t . strip () != "" ]
neg_tokens = [ clean_str ( t . strip ()). replace ( " " , "_" ) for t in [ 'مصر' ] if t . strip () != "" ]

vec = calc_vec ( pos_tokens = pos_tokens , neg_tokens = neg_tokens , n_model = t_model , dim = t_model . vector_size )

most_sims = t_model . wv . similar_by_vector ( vec , topn = 10 )
for term , score in most_sims :
    if term not in pos_tokens + neg_tokens :
        print ( term , score )

# راشد_الماجد 0.7094649076461792
# ماجد_المهندس 0.6979793906211853
# عبدالله_رويشد 0.6942606568336487
# ...

# ====================
# ====================




# ============================== 
# ====== Uni-Grams Models ======

t_model = gensim . models . Word2Vec . load ( 'models/full_uni_cbow_100_twitter.mdl' )

# python 3.X
token = clean_str ( u'تونس' )
# python 2.7
# token = clean_str('تونس'.decode('utf8', errors='ignore'))

most_similar = t_model . wv . most_similar ( token , topn = 10 )
for term , score in most_similar :
    print ( term , score )

# ليبيا 0.8864325284957886
# الجزائر 0.8783721327781677
# السودان 0.8573237061500549
# مصر 0.8277812600135803
# ...



# get a word vector
word_vector = t_model . wv [ token ]

การดาวน์โหลด

รุ่น N-GRAMS

เพื่อดูสิ่งที่เราสามารถปลดปล่อยจากรุ่น N-GRAMS โดยใช้คำสั่งที่คล้ายกันมากที่สุด โปรดดูหน้าผลลัพธ์

รุ่น N-GRAMS

แบบอย่าง	เอกสารไม่	คำศัพท์ไม่	ขนาด vec	การดาวน์โหลด
Twitter-cbow	66,900,000	1,476,715	300	การดาวน์โหลด
Twitter-cbow	66,900,000	1,476,715	100	การดาวน์โหลด
Twitter-skipgram	66,900,000	1,476,715	300	การดาวน์โหลด
Twitter-skipgram	66,900,000	1,476,715	100	การดาวน์โหลด
Wikipedia-cbow	1,800,000	662,109	300	การดาวน์โหลด
Wikipedia-cbow	1,800,000	662,109	100	การดาวน์โหลด
Wikipedia-skipgram	1,800,000	662,109	300	การดาวน์โหลด
Wikipedia-skipgram	1,800,000	662,109	100	การดาวน์โหลด

รุ่น Unigrams

แบบอย่าง	เอกสารไม่	คำศัพท์ไม่	ขนาด vec	การดาวน์โหลด
Twitter-cbow	66,900,000	1,259,756	300	การดาวน์โหลด
Twitter-cbow	66,900,000	1,259,756	100	การดาวน์โหลด
Twitter-skipgram	66,900,000	1,259,756	300	การดาวน์โหลด
Twitter-skipgram	66,900,000	1,259,756	100	การดาวน์โหลด
Wikipedia-cbow	1,800,000	320,636	300	การดาวน์โหลด
Wikipedia-cbow	1,800,000	320,636	100	การดาวน์โหลด
Wikipedia-skipgram	1,800,000	320,636	300	การดาวน์โหลด
Wikipedia-skipgram	1,800,000	320,636	100	การดาวน์โหลด

การอ้างอิง

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

อ่านกระดาษข้อความเต็ม

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-16
ขนาด 992.15KB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด