BPEMB是基于字节对编码(BPE)并在Wikipedia培训的275种语言中的预训练子字嵌入的集合。它的预期用途是自然语言处理中神经模型的输入。
网站・使用・下载・跨PPEMB ・纸(PDF)・引用bpemb
使用PIP安装BPEMB:
pip install bpemb嵌入和句子模型将在您第一次使用时自动下载。
> >> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
> >> bpemb_en = BPEmb ( lang = "en" , dim = 50 )
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . model
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . d50 . w2v . bin . tar . gz您可以使用BPEMB做两件事。第一个是子词细分:
# apply English BPE subword segmentation model
> >> bpemb_en . encode ( "Stratford" )
[ '▁strat' , 'ford' ]
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
> >> bpemb_zh = BPEmb ( lang = "zh" , vs = 100000 )
# apply Chinese BPE subword segmentation model
> >> bpemb_zh . encode ( "这是一个中文句子" ) # "This is a Chinese sentence."
[ '▁这是一个' , '中文' , '句子' ] # ["This is a", "Chinese", "sentence"]如果单词分开的话,取决于词汇大小。通常,较小的词汇大小将使部分分割成许多子字,而较大的词汇大小将导致频繁的单词不会被拆分:
| 词汇大小 | 分割 |
|---|---|
| 1000 | ['str','at','f','ord'] |
| 3000 | ['str','at',''ford'] |
| 5000 | ['str','at',''ford'] |
| 10000 | ['strat','ford'] |
| 25000 | ['o stratford'] |
| 50000 | ['o stratford'] |
| 100000 | ['o stratford'] |
| 200000 | ['o stratford'] |
BPEMB的第二个目的是提供验证的子字嵌入:
# Embeddings are wrapped in a gensim KeyedVectors object
> >> type ( bpemb_zh . emb )
gensim . models . keyedvectors . Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
> >> bpemb_en . most_similar ( "ford" )
[( 'bury' , 0.8745079040527344 ),
( 'ton' , 0.8725000619888306 ),
( 'well' , 0.871537446975708 ),
( 'ston' , 0.8701574206352234 ),
( 'worth' , 0.8672043085098267 ),
( 'field' , 0.859795331954956 ),
( 'ley' , 0.8591548204421997 ),
( 'ington' , 0.8126075267791748 ),
( 'bridge' , 0.8099068999290466 ),
( 'brook' , 0.7979353070259094 )]
> >> type ( bpemb_en . vectors )
numpy . ndarray
> >> bpemb_en . vectors . shape
( 10000 , 50 )
> >> bpemb_zh . vectors . shape
( 100000 , 100 )要使用神经网络中的子字嵌入,要么将输入编码为子字ID:
> >> ids = bpemb_zh . encode_ids ( "这是一个中文句子" )
[ 25950 , 695 , 20199 ]
> >> bpemb_zh . vectors [ ids ]. shape
( 3 , 100 )或使用embed方法:
# apply Chinese subword segmentation and perform embedding lookup
> >> bpemb_zh . embed ( "这是一个中文句子" ). shape
( 3 , 100 )Ab(Abkhazian)・ ace(achinese)・ ady(adyghe)・ af(Afrikaans)・ ak(akan)・ als(alemannic)・ am(amharic)・ am(amharic)・ast(asturian)・ atj(atikamekw)・ av(avaric)・
Ba(Bashkir)・ bar(巴伐利亚)・ bcl(中央比科尔)・ be(白俄罗斯人)・ bg(保加利亚语)・ bi(bislama)・ bjn(banjar)・ bm(bambara)・ bm(bambara)・ bm(bambara)・ bn(bengali)・ bn(bengali)・(波斯尼亚)・虫(buginese)・ bxr(俄罗斯埋葬)
Ca(加泰罗尼亚)・ cdo(最小dong中国)・ ce(车臣)・ ceb(cebuano)・CH(Chamorro)・ chr(Cherokee)・Chy(Cheyenne)・CKB(Cheyenne)・CKB(Central Kurdish)(库尔德人)・(kashubian)・ cu(教堂斯拉夫)・ cv(chuvash)・ cy(威尔士)
da(丹麦)・ de(德语)・ din(dinka)・ diq(dimli)・ dsb(下索尔比安)・ dty(dotyali)・ dv(dhivehi)・ dz(dhivehi)・ dz(dzongkha)
ee(ewe)・埃尔(现代希腊语)・ en(英语)・eo(Esperanto)・ES(西班牙语)・ET(ESTONIAN)・EU(BASQUE)・ext(basque)・ext(Extremaduran)
fa(波斯)・ ff(fulah)・ fi(finnish)・ fj(fijian)・ fo(faroese)・ fr(法语)・ frp(arpitan)・ frr(northern frr(北方弗里斯安)・ frr(friulian)・ frr(friulian)
Ga(爱尔兰)・GAG(GAGAUZ)・ gan(gan Chinesh)・GD(苏格兰盖尔语)・ gl(Galician)・Glk(Gilaki)・Gn(Guarani)・GOM(Goan konkani)
ha(hausa)・ hak(hakka center)・ haw(夏威夷人)・ he(hebrew)・ hi(hindi)・ hif(fiji hindi)・ hr(croatian)・ hr(croatian)・ hsb(upper sorbian)・
ia(Interlingua)・ID(印度尼西亚人)・Ie(Interlingue)・Ig(Igbo)・IK(Inupiaq)・ILO(Iloko)・IO(iloko)・IO(IDO)io(Ido)・ is(icelandic)・ ies(icelandic)・ it(italian)・ it(italian)・IU(Inuktitut)・IU(InukTitut)・
JA(日语)・果酱(牙买加克里奥尔语英语)・ jbo(lojban)・ jv(javanese)
ka(Georgian)・ kaa(kara-kalpak)・ kab(kabyle)・ kbd(kabardian)・ kbp(kabiyè)・ kg(kongo)・ ki(kongo)・ ki(kikuyu) ko(韩语)・ koi(komi-permyak)・ krc(karachay-balkar)・ ks(kashmiri)・ ksh(kölsch)・ ku(kurdish)・ kv(komi)・ kv(komi)・ kw(cornish)・
la(拉丁文)・ lad(Ladino)・ lb(卢森堡)・ lbe(lak)・ lez(lezghian)・ lg(ganda)・ li(limburgan)・ lij(ligurian)・ lij(ligurian)・ lmo(lombard)・ ln(lingala)・ ln(lingala)・lo(lingala)llri(lo)lri(lo)lrri(llri)lrri(lri)・ ltg(Latgalian)・ lv(Latvian)
Mai(Maithili)・MDF(Moksha)・mg(Malagasy)・MH(Marshallese)・MHR(Easter Mari)・Mi(Maori)・min(Minangkabau)(Minangkabau)・Mk(Macedonian)・Ml(Macedonian)・Ml(Malayalam)・Ml(MalayalAl)Marathi(maratham)MrAthi MrAthi MrAthi MRATHI MRATHI(MRATHI MRATHI MRATHI MRATHI) MS(Malay)・MT(Maltese)・MWL(Mirandese)・ my(burmese)・ myv(erzya)・mzn(mazanderani)
na(nauru)・ nap(Neapolitan)・ nds(低德国)・ ne(nepali)・新(newari)新(newari)・ ng(ndonga)・ nl(荷兰语)・ nn(荷兰语)・ nn(挪威Nynorsk)・纽约(Nyanja)
OC(occitan)・ olo(livvi)・ om(oromo)・或(oriya)・ os(ossetian)
pa(panjabi)・ pag(pangasinan)・帕姆(pampanga)・ pap(papiamento)・ pcd(picard)・ pdc(pennsylvania derman)・ pfl(pfaelzisch)(pfaelzisch) (西部panjabi)・ pnt(pontic)・PS(PUSTTO)・Pt(葡萄牙)
qu(quechua)
rm(romansh)・ rmy(vlax romani)・ rn(rundi)・ ro(罗马尼亚语)・ ru(俄语)・ rue(rusyn)・ rw(kinyarwanda)
sa(梵语)・ sah(yakut)・ sc(sardinian)・ scn(西西里人)sco(scots)・Sd(Sindhi)・Se(sindhi)・ se(北萨米)・ sg(sango)・ sg(sango)sh(serbo-croatian)・ sh(serbo-croatian)・ sh(serbo-croatian)s s s s s s s s s s slavean slavak slavean slavean slavean slavean slav slav slav slaiv slavane(slavean)。・ sn(shona)so(索马里)s s s s s sr(塞尔维亚)・ srn(sranan tongo)(Sranan tongo)・ ss(Swati)・St(Swati)・St(Southern Sotho)・STQ(Saterfriesissch)
ta(泰米尔语)・ tcy(tulu)te(telugu)・ tet(tetum)・ tg(tajik)・ th(thai)・ ti(tigrinya)・ tk(tigrinya)・ tk(turkmen)・ tk(turkmen)・ tl(taalog)・ tl(takalog)・ tn(takalog)・ tn(tn(tn)to(tn(tna)至(tna)to(tna)to(tona)to(tonga)・(tsonga)・ tt(tatar)・ tum(tumbuka)・ tw(twi)・ ty(tahitian)・ tyv(tuvinian)
UDM(UDMURT)・ UG(UIGHUR)・英国(乌克兰)・ ur(urdu)・ uz(uzbek)
VE(VENDA)・ vec(venetian)・ vep(VEP)・ vi(越南)・ vls(vlaams)・ vo(volapük)
WA(WALLOON)・战争(WARAY)・ wo(Wolof)・WUU(WU中文)
xal(kalmyk)・ xh(xhosa)・ xmf(mingrelian)
Yi(Yiddish)・ YO(约鲁巴)
Za(Zhuang)・ Zea(Zeeuws)・ ZH(中文)・ Zu(Zulu)
多语言)
如果您在学术工作中使用BPEMB,请引用:
@InProceedings{heinzerling2018bpemb,
author = {Benjamin Heinzerling and Michael Strube},
title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
month = {May 7-12, 2018},
address = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
isbn = {979-10-95546-00-9},
language = {english}
}