aravec下載 - aravec源代碼下載

aravec

其他源碼

1.0.0

下載

Aravec 3.0

神經網絡的進步導致了計算機視覺，語音識別和自然語言處理（NLP）等領域的發展。 NLP中最有影響力的最新發展之一是使用單詞嵌入，其中單詞在連續空間中表示為向量，從而捕獲了其中許多句法和語義關係。

ARAVEC是一個預先訓練的分佈式單詞表示（單詞嵌入）開源項目，旨在為阿拉伯NLP研究社區提供免費使用和強大的單詞嵌入模型。 ARAVEC的第一個版本提供了建立在三個不同的阿拉伯內容域之上的六個不同單詞嵌入模型。 Tweets和Wikipedia本文介紹了用於構建模型的資源，使用的數據清潔技術，進行的預處理步驟以及所使用的單詞嵌入創建技術的細節。

ARAVEC的第三版提供了16個不同的單詞嵌入模型，建立在兩個不同的阿拉伯內容域之上。推文和維基百科阿拉伯文章。這個版本與以前的區別之間的主要區別在於，我們生產了兩種不同類型的模型，雜物和n-grams模型。我們利用一組統計技術來構成每個數據域中最常用的N-gram。

Twitter推文
維基百科阿拉伯文章

總代幣超過1,169,075,128個令牌。

看看如何表示Ngrams模型：

請查看結果頁面以獲取更多查詢。

引用

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

閱讀全文紙

如何使用

這些模型是使用Gensim Python庫構建的。這是一個簡單的代碼，用於加載和使用其中一個模型，然後按照以下步驟進行操作：

使用pip或conda安裝gensim > = 3.4和nltk > = 3.2

PIP安裝Gensim NLTK

Conda安裝Gensim NLTK

將壓縮模型文件提取到目錄[例如Twittert-CBOW ]
保留.npy文件。您將不用擴展名加載文件，就像您在以下代碼中看到的那樣。
運行此Python代碼加載並使用模型

如何將aravec與spacy.io集成

筆記本代碼

代碼樣本

 # -*- coding: utf8 -*-
import gensim
import re
import numpy as np
from nltk import ngrams

from utilities import * # import utilities.py module

# ============================   
# ====== N-Grams Models ======

t_model = gensim . models . Word2Vec . load ( 'models/full_grams_cbow_100_twitter.mdl' )

# python 3.X
token = clean_str ( u'ابو تريكه' ). replace ( " " , "_" )
# python 2.7
# token = clean_str(u'ابو تريكه'.decode('utf8', errors='ignore')).replace(" ", "_")

if token in t_model . wv :
    most_similar = t_model . wv . most_similar ( token , topn = 10 )
    for term , score in most_similar :
        term = clean_str ( term ). replace ( " " , "_" )
        if term != token :
            print ( term , score )

# تريكه 0.752911388874054
# حسام_غالي 0.7516342401504517
# وائل_جمعه 0.7244222164154053
# وليد_سليمان 0.7177559733390808
# ...

# =========================================
# == Get the most similar tokens to a compound query
# most similar to 
# عمرو دياب + الخليج - مصر

pos_tokens = [ clean_str ( t . strip ()). replace ( " " , "_" ) for t in [ 'عمرو دياب' , 'الخليج' ] if t . strip () != "" ]
neg_tokens = [ clean_str ( t . strip ()). replace ( " " , "_" ) for t in [ 'مصر' ] if t . strip () != "" ]

vec = calc_vec ( pos_tokens = pos_tokens , neg_tokens = neg_tokens , n_model = t_model , dim = t_model . vector_size )

most_sims = t_model . wv . similar_by_vector ( vec , topn = 10 )
for term , score in most_sims :
    if term not in pos_tokens + neg_tokens :
        print ( term , score )

# راشد_الماجد 0.7094649076461792
# ماجد_المهندس 0.6979793906211853
# عبدالله_رويشد 0.6942606568336487
# ...

# ====================
# ====================




# ============================== 
# ====== Uni-Grams Models ======

t_model = gensim . models . Word2Vec . load ( 'models/full_uni_cbow_100_twitter.mdl' )

# python 3.X
token = clean_str ( u'تونس' )
# python 2.7
# token = clean_str('تونس'.decode('utf8', errors='ignore'))

most_similar = t_model . wv . most_similar ( token , topn = 10 )
for term , score in most_similar :
    print ( term , score )

# ليبيا 0.8864325284957886
# الجزائر 0.8783721327781677
# السودان 0.8573237061500549
# مصر 0.8277812600135803
# ...



# get a word vector
word_vector = t_model . wv [ token ]

下載

n-grams模型

為了查看我們可以使用一些最相似的查詢從N-Grams模型中保留的內容。請查看結果頁面

n-grams模型

模型	文檔編號	詞彙號	vec-size	下載
Twitter-Cbow	66,900,000	1,476,715	300	下載
Twitter-Cbow	66,900,000	1,476,715	100	下載
Twitter-skipgram	66,900,000	1,476,715	300	下載
Twitter-skipgram	66,900,000	1,476,715	100	下載
Wikipedia-Cbow	1,800,000	662,109	300	下載
Wikipedia-Cbow	1,800,000	662,109	100	下載
Wikipedia-skipgram	1,800,000	662,109	300	下載
Wikipedia-skipgram	1,800,000	662,109	100	下載

Unigrams模型

模型	文檔編號	詞彙號	vec-size	下載
Twitter-Cbow	66,900,000	1,259,756	300	下載
Twitter-Cbow	66,900,000	1,259,756	100	下載
Twitter-skipgram	66,900,000	1,259,756	300	下載
Twitter-skipgram	66,900,000	1,259,756	100	下載
Wikipedia-Cbow	1,800,000	320,636	300	下載
Wikipedia-Cbow	1,800,000	320,636	100	下載
Wikipedia-skipgram	1,800,000	320,636	300	下載
Wikipedia-skipgram	1,800,000	320,636	100	下載

引用

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

閱讀全文紙

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-04-16
大小 992.15KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部