BPEMB是基於字節對編碼(BPE)並在Wikipedia培訓的275種語言中的預訓練子字嵌入的集合。它的預期用途是自然語言處理中神經模型的輸入。
網站・使用・下載・跨PPEMB ・紙(PDF)・引用bpemb
使用PIP安裝BPEMB:
pip install bpemb嵌入和句子模型將在您第一次使用時自動下載。
> >> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
> >> bpemb_en = BPEmb ( lang = "en" , dim = 50 )
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . model
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . d50 . w2v . bin . tar . gz您可以使用BPEMB做兩件事。第一個是子詞細分:
# apply English BPE subword segmentation model
> >> bpemb_en . encode ( "Stratford" )
[ '▁strat' , 'ford' ]
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
> >> bpemb_zh = BPEmb ( lang = "zh" , vs = 100000 )
# apply Chinese BPE subword segmentation model
> >> bpemb_zh . encode ( "这是一个中文句子" ) # "This is a Chinese sentence."
[ '▁这是一个' , '中文' , '句子' ] # ["This is a", "Chinese", "sentence"]如果單詞分開的話,取決於詞彙大小。通常,較小的詞彙大小將使部分分割成許多子字,而較大的詞彙大小將導致頻繁的單詞不會被拆分:
| 詞彙大小 | 分割 |
|---|---|
| 1000 | ['str','at','f','ord'] |
| 3000 | ['str','at',''ford'] |
| 5000 | ['str','at',''ford'] |
| 10000 | ['strat','ford'] |
| 25000 | ['o stratford'] |
| 50000 | ['o stratford'] |
| 100000 | ['o stratford'] |
| 200000 | ['o stratford'] |
BPEMB的第二個目的是提供驗證的子字嵌入:
# Embeddings are wrapped in a gensim KeyedVectors object
> >> type ( bpemb_zh . emb )
gensim . models . keyedvectors . Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
> >> bpemb_en . most_similar ( "ford" )
[( 'bury' , 0.8745079040527344 ),
( 'ton' , 0.8725000619888306 ),
( 'well' , 0.871537446975708 ),
( 'ston' , 0.8701574206352234 ),
( 'worth' , 0.8672043085098267 ),
( 'field' , 0.859795331954956 ),
( 'ley' , 0.8591548204421997 ),
( 'ington' , 0.8126075267791748 ),
( 'bridge' , 0.8099068999290466 ),
( 'brook' , 0.7979353070259094 )]
> >> type ( bpemb_en . vectors )
numpy . ndarray
> >> bpemb_en . vectors . shape
( 10000 , 50 )
> >> bpemb_zh . vectors . shape
( 100000 , 100 )要使用神經網絡中的子字嵌入,要么將輸入編碼為子字ID:
> >> ids = bpemb_zh . encode_ids ( "这是一个中文句子" )
[ 25950 , 695 , 20199 ]
> >> bpemb_zh . vectors [ ids ]. shape
( 3 , 100 )或使用embed方法:
# apply Chinese subword segmentation and perform embedding lookup
> >> bpemb_zh . embed ( "这是一个中文句子" ). shape
( 3 , 100 )Ab(Abkhazian)・ ace(achinese)・ ady(adyghe)・ af(Afrikaans)・ ak(akan)・ als(alemannic)・ am(amharic)・ am(amharic)・ast(asturian)・ atj(atikamekw)・ av(avaric)・
Ba(Bashkir)・ bar(巴伐利亞)・ bcl(中央比科爾)・ be(白俄羅斯人)・ bg(保加利亞語)・ bi(bislama)・ bjn(banjar)・ bm(bambara)・ bm(bambara)・ bm(bambara)・ bn(bengali)・ bn(bengali)・(波斯尼亞)・蟲(buginese)・ bxr(俄羅斯埋葬)
Ca(加泰羅尼亞)・ cdo(最小dong中國)・ ce(車臣)・ ceb(cebuano)・CH(Chamorro)・ chr(Cherokee)・Chy(Cheyenne)・CKB(Cheyenne)・CKB(Central Kurdish)(庫爾德人)・(kashubian)・ cu(教堂斯拉夫)・ cv(chuvash)・ cy(威爾士)
da(丹麥)・ de(德語)・ din(dinka)・ diq(dimli)・ dsb(下索爾比安)・ dty(dotyali)・ dv(dhivehi)・ dz(dhivehi)・ dz(dzongkha)
ee(ewe)・埃爾(現代希臘語)・ en(英語)・eo(Esperanto)・ES(西班牙語)・ET(ESTONIAN)・EU(BASQUE)・ext(basque)・ext(Extremaduran)
fa(波斯)・ ff(fulah)・ fi(finnish)・ fj(fijian)・ fo(faroese)・ fr(法語)・ frp(arpitan)・ frr(northern frr(北方弗里斯安)・ frr(friulian)・ frr(friulian)
Ga(愛爾蘭)・GAG(GAGAUZ)・ gan(gan Chinesh)・GD(蘇格蘭蓋爾語)・ gl(Galician)・Glk(Gilaki)・Gn(Guarani)・GOM(Goan konkani)
ha(hausa)・ hak(hakka center)・ haw(夏威夷人)・ he(hebrew)・ hi(hindi)・ hif(fiji hindi)・ hr(croatian)・ hr(croatian)・ hsb(upper sorbian)・
ia(Interlingua)・ID(印度尼西亞人)・Ie(Interlingue)・Ig(Igbo)・IK(Inupiaq)・ILO(Iloko)・IO(iloko)・IO(IDO)io(Ido)・ is(icelandic)・ ies(icelandic)・ it(italian)・ it(italian)・IU(Inuktitut)・IU(InukTitut)・
JA(日語)・果醬(牙買加克里奧爾語英語)・ jbo(lojban)・ jv(javanese)
ka(Georgian)・ kaa(kara-kalpak)・ kab(kabyle)・ kbd(kabardian)・ kbp(kabiyè)・ kg(kongo)・ ki(kongo)・ ki(kikuyu) ko(韓語)・ koi(komi-permyak)・ krc(karachay-balkar)・ ks(kashmiri)・ ksh(kölsch)・ ku(kurdish)・ kv(komi)・ kv(komi)・ kw(cornish)・
la(拉丁文)・ lad(Ladino)・ lb(盧森堡)・ lbe(lak)・ lez(lezghian)・ lg(ganda)・ li(limburgan)・ lij(ligurian)・ lij(ligurian)・ lmo(lombard)・ ln(lingala)・ ln(lingala)・lo(lingala)llri(lo)lri(lo)lrri(llri)lrri(lri)・ ltg(Latgalian)・ lv(Latvian)
Mai(Maithili)・MDF(Moksha)・mg(Malagasy)・MH(Marshallese)・MHR(Easter Mari)・Mi(Maori)・min(Minangkabau)(Minangkabau)・Mk(Macedonian)・Ml(Macedonian)・Ml(Malayalam)・Ml(MalayalAl)Marathi(maratham)MrAthi MrAthi MrAthi MRATHI MRATHI(MRATHI MRATHI MRATHI MRATHI) MS(Malay)・MT(Maltese)・MWL(Mirandese)・ my(burmese)・ myv(erzya)・mzn(mazanderani)
na(nauru)・ nap(Neapolitan)・ nds(低德國)・ ne(nepali)・新(newari)新(newari)・ ng(ndonga)・ nl(荷蘭語)・ nn(荷蘭語)・ nn(挪威Nynorsk)・紐約(Nyanja)
OC(occitan)・ olo(livvi)・ om(oromo)・或(oriya)・ os(ossetian)
pa(panjabi)・ pag(pangasinan)・帕姆(pampanga)・ pap(papiamento)・ pcd(picard)・ pdc(pennsylvania derman)・ pfl(pfaelzisch)(pfaelzisch) (西部panjabi)・ pnt(pontic)・PS(PUSTTO)・Pt(葡萄牙)
qu(quechua)
rm(romansh)・ rmy(vlax romani)・ rn(rundi)・ ro(羅馬尼亞語)・ ru(俄語)・ rue(rusyn)・ rw(kinyarwanda)
sa(梵語)・ sah(yakut)・ sc(sardinian)・ scn(西西里人)sco(scots)・Sd(Sindhi)・Se(sindhi)・ se(北薩米)・ sg(sango)・ sg(sango)sh(serbo-croatian)・ sh(serbo-croatian)・ sh(serbo-croatian)s s s s s s s s s s slavean slavak slavean slavean slavean slavean slav slav slav slaiv slavane(slavean)。・ sn(shona)so(索馬里)s s s s s sr(塞爾維亞)・ srn(sranan tongo)(Sranan tongo)・ ss(Swati)・St(Swati)・St(Southern Sotho)・STQ(Saterfriesissch)
ta(泰米爾語)・ tcy(tulu)te(telugu)・ tet(tetum)・ tg(tajik)・ th(thai)・ ti(tigrinya)・ tk(tigrinya)・ tk(turkmen)・ tk(turkmen)・ tl(taalog)・ tl(takalog)・ tn(takalog)・ tn(tn(tn)to(tn(tna)至(tna)to(tna)to(tona)to(tonga)・(tsonga)・ tt(tatar)・ tum(tumbuka)・ tw(twi)・ ty(tahitian)・ tyv(tuvinian)
UDM(UDMURT)・ UG(UIGHUR)・英國(烏克蘭)・ ur(urdu)・ uz(uzbek)
VE(VENDA)・ vec(venetian)・ vep(VEP)・ vi(越南)・ vls(vlaams)・ vo(volapük)
WA(WALLOON)・戰爭(WARAY)・ wo(Wolof)・WUU(WU中文)
xal(kalmyk)・ xh(xhosa)・ xmf(mingrelian)
Yi(Yiddish)・ YO(約魯巴)
Za(Zhuang)・ Zea(Zeeuws)・ ZH(中文)・ Zu(Zulu)
多語言)
如果您在學術工作中使用BPEMB,請引用:
@InProceedings{heinzerling2018bpemb,
author = {Benjamin Heinzerling and Michael Strube},
title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {May 7-12, 2018},
address = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
isbn = {979-10-95546-00-9},
language = {english}
}