bpemb Téléchargement - bpemb Code source Télécharger

bpemb

Autre code source

1.0.0

Télécharger

Bpemb

BPEMB est une collection d'intégrés de sous-mots pré-formés dans 275 langues, basés sur le codage des octets-paires (BPE) et formé sur Wikipedia. Son utilisation prévue est comme entrée pour les modèles neuronaux dans le traitement du langage naturel.

Site Web ・ Utilisation ・ Télécharger ・ Multibpemb ・ Papier (PDF) ・ Citant BPEMB

Usage

Installez BPEMB avec PIP:

pip install bpemb

Les incorporations et les modèles de phrases seront téléchargés automatiquement la première fois que vous les utilisez.

 > >> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
> >> bpemb_en = BPEmb ( lang = "en" , dim = 50 )
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . model
downloading https : // nlp . h - its . org / bpemb / en / en . wiki . bpe . vs10000 . d50 . w2v . bin . tar . gz

Vous pouvez faire deux choses principales avec BPEMB. Le premier est la segmentation des sous-mots:

 # apply English BPE subword segmentation model
> >> bpemb_en . encode ( "Stratford" )
[ '▁strat' , 'ford' ]
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
> >> bpemb_zh = BPEmb ( lang = "zh" , vs = 100000 )
# apply Chinese BPE subword segmentation model
> >> bpemb_zh . encode ( "这是一个中文句子" )  # "This is a Chinese sentence."
[ '▁这是一个' , '中文' , '句子' ]  # ["This is a", "Chinese", "sentence"]

Si / comment un mot est divisé dépend de la taille du vocabulaire. Généralement, une taille de vocabulaire plus petite produira une segmentation en de nombreux sous-mots, tandis qu'une grande taille de vocabulaire entraînera des mots fréquents qui ne sont pas divisés:

taille de vocabulaire	segmentation
1000	['Art', 'at', 'f', 'ord']
3000	['Art', 'at', 'Ford']
5000	['Art', 'at', 'Ford']
10000	['° Strat', 'Ford']
25000	['E Stratford']
50000	['E Stratford']
100000	['E Stratford']
200000	['E Stratford']

Le deuxième objectif de BPEMB est de fournir des incorporations de sous-mots prédéfinies:

 # Embeddings are wrapped in a gensim KeyedVectors object
> >> type ( bpemb_zh . emb )
gensim . models . keyedvectors . Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
> >> bpemb_en . most_similar ( "ford" )
[( 'bury' , 0.8745079040527344 ),
 ( 'ton' , 0.8725000619888306 ),
 ( 'well' , 0.871537446975708 ),
 ( 'ston' , 0.8701574206352234 ),
 ( 'worth' , 0.8672043085098267 ),
 ( 'field' , 0.859795331954956 ),
 ( 'ley' , 0.8591548204421997 ),
 ( 'ington' , 0.8126075267791748 ),
 ( 'bridge' , 0.8099068999290466 ),
 ( 'brook' , 0.7979353070259094 )]
> >> type ( bpemb_en . vectors )
numpy . ndarray
> >> bpemb_en . vectors . shape
( 10000 , 50 )
> >> bpemb_zh . vectors . shape
( 100000 , 100 )

Pour utiliser des incorporations de sous-mots dans votre réseau de neurones, encodez votre entrée dans les ID de sous-mots:

 > >> ids = bpemb_zh . encode_ids ( "这是一个中文句子" )
[ 25950 , 695 , 20199 ]
> >> bpemb_zh . vectors [ ids ]. shape
( 3 , 100 )

Ou utilisez la méthode embed :

 # apply Chinese subword segmentation and perform embedding lookup
> >> bpemb_zh . embed ( "这是一个中文句子" ). shape
( 3 , 100 )

Téléchargements pour chaque langue

AB (Abkhazian) ・ ace (Achinese) ・ Ady (Adyghe) ・ af (afrikaans) ・ ak (akan) ・ als (Alemannic) ・ am (amharic) ・ an (aragonais) ・ ang (old anglais) ・ ar (arabe) ・ arc (officiel) ・ arz (Eggien) ・・ arc) ・ arz (Eggien) ・・ arc) ・ arz (Eggian) ・・ arc) ・ arz (Eggien) ・・ arc) ・ arz (Eggien) ・・ arc. ast (asturian) ・ atj (atikamekw) ・ av (avaric) ・ ay (aymara) ・ az (Azerbaïdjani) ・ AZB (Azerbaïdjani du sud)

ba (bashkir) ・ bar (bavarois) ・ bcl (bikol central) ・ être (biélorusse) ・ bg (bulgare) ・ bi (bislama) ・ bjn (banjar) ・ bm (bambara) ・ bn (bengali) ・ bo (Tibetan) ・ bpy (bishnupriya) ・ Breon) ・ bpy (bishnupriya) ・ Breon) ・ bpy (bishnupriya) ・ Breon) ・ Bpy (Bishnupriya) (Bosnian) ・ Bug (Buginese) ・ Bxr (Russie Buriat)

CA (Catalan) ・ CDO (min dong chinois) ・ ce (tchétchène) ・ ceb (cebuano) ・ ch (Chamorro) ・ Chr (Cherokee) ・ Chy (Cheyenne) ・ CKB (Central Kurdish) ・ Co (Corsican) ・ Cr (Cree) ・ Crh (Crimean Tatar) (Kashubian) ・ Cu (église slave) ・ cv (chuvash) ・ cy (gallois)

da (danois) ・ de (allemand) ・ din (dinka) ・ diq (dimli) ・ dsb (Sorbian inférieur) ・ dty (dotyali) ・ dv (dhivehi) ・ dz (dzongkha)

ee (ewe) ・ el (grec moderne) ・ en (anglais) ・ eo (espéranto) ・ es (espagnol) ・ et (estonien) ・ eu (basque) ・ ext (extrémité)

fa (persan) ・ ff (fulah) ・ fi (finnois) ・ fj (fidjien) ・ fo (faroeais) ・ fr (français) ・ frp (arpitan) ・ frr (frisien nord) ・ fourrure (friulian) ・ fy (frisien occidental)

Ga (irlandais) ・ Gag (Gagauz) ・ Gan (Gan chinois) ・ GD (Gaélique écossais) ・ Gl (Galicien) ・ Glk (Gilaki) ・ GN (Guarani) ・ Gom (Goan Konkani) ・ Got (Gothic) ・ Gu (gujarati) ・ GV (Manx)

ha (haUsa) ・ hak (hakka chinois) ・ haw (hawaïen) ・ il (hébreu) ・ hi (hindi) ・ hif (fiji hindi) ・ hr (croate) ・ hsb (sorbi supérieur) ・ ht (haïtien) ・ Hu (Hungarian) ・ hy (arménien)

ia (interlingua) ・ id (indonésien) ・ ie (interlingue) ・ ig (igbo) ・ ik (inupiaq) ・ ilo (iloko) ・ io (ido) ・ est (islandais) ・ it (italien) ・ iu (inuktitut)

JA (Japonais) ・ Jam (Jamaïcain Creole English) ・ JBO (LOJBAN) ・ JV (Javanais)

ka (géorgien) ・ kaa (kara-kalpak) ・ kab (kabyle) ・ kbd (kabardien) ・ kbp (kabiyè) ・ kg (kongo) ・ ki (kikuyu) ・ kk (kazakh) ・ kl (kallisut) ・ km (Central Khmer) ko (coréen) ・ koi (komi-permyak) ・ krc (karachay-balkar) ・ ks (Cachemire) ・ ksh (kölsch) ・ ku (kurdish) ・ kv (komi) ・ kw (cornish) ・ ky (kirghiz)

la (latin) ・ lad (ladino) ・ lb (Luxembourgish) ・ lbe (lak) ・ lez (lezghian) ・ lg (ganda) ・ li (limburgan) ・ lij (ligurian) ・ lMo (lombard) ・ ln (lingala) ・ lOr (lao) ・ lrc (Lringala) ・ lor) ・ lrc (Lringala) ・ LoR (lao) ・ lrc (Lringala). (Lituanien) ・ LTG (Latgalian) ・ lv (Letton)

Mai (Maithili) ・ Mdf (moksha) ・ mg (malgache) ・ MH (Marshallese) ・ MHR (Mari oriental) ・ MI (Maori) ・ Min (Minangkabau) ・ Mk (Macédonien) ・ Ml (malayalam) ・ Min (Mongolien) (Malais) ・ mt (maltais) ・ mwl (mirandese) ・ mon (birman) ・ myv (erzya) ・ mzn (mazanderani)

na (nauru) ・ nap (napolitan) ・ nds (faible allemand) ・ ne (népalais) ・ new (newari) ・ ng (ndonga) ・ nl (néerlandais) ・ nn (nynorsk norwegi・ NY (Nyanja)

oc (occitan) ・ olo (livvi) ・ om (oromo) ・ ou (oriya) ・ os (ostètien)

PA (Panjabi) ・ PAG (Pangasinan) ・ PAM (Pampanga) ・ Pap (Papimento) ・ PCD (Picard) ・ PDC (Pennsylvania allemande) ・ Pfl (Pfaelzisch) ・ Pi (Pali) ・ Pih (pfl (Western Panjabi) ・ PNT (Pontique) ・ Ps (pushto) ・ PT (Portugais)

qu (quechua)

RM (Romansh) ・ Rmy (Vlax Romani) ・ Rn (Rundi) ・ Ro (Roumain) ・ RU (Russe) ・ Rue (Rusyn) ・ RW (Kinyarwanda)

SA (Sanskrit) ・ SAH (Yakut) ・ Sc (Sardinien) ・ Scn (sicilien) ・ SCO (Écossais) ・ Sd (Sindhi) ・ SE (Northern Sami) ・ Sg (Sango) ・ SH (Serbo-Croatian) ・ SI (Sinhala) ・ Skan・ Sn (Shona) ・ So (somali) ・ sq (Albanais) ・ Sr (serbe) ・ Srn (Sranan Tongo) ・ Ss (Swati) ・ St (Southern Sotho) ・ STQ (Saterfriesisch) ・ Su (Sundanais) ・ SV (suédois)

ta (tamoul) ・ tcy (tulu) ・ te (telugu) ・ tet (tétum) ・ tg (tajik) ・ th (thai) ・ ti (Tigrinya) ・ tk (turkmène) ・ tl (tagalog) ・ tn (tswana) ・ à (tonga) ・ tpi (tok pisin) ・ tr (TRUK) (Tsonga) ・ tt (tatar) ・ tum (tumbuka) ・ tw (twi) ・ ty (tahitien) ・ tyv (tuvinien)

udm (udmurt) ・ ug (Oughour) ・ UK (ukrainien) ・ ur (ourdou) ・ Uz (Uzbek)

Ve (Venda) ・ Vec (Venetian) ・ Vep (VEPS) ・ VI (Vietnamien) ・ VLS (VLAAMS) ・ VO (Volapük)

wa (wallon) ・ guerre (waray) ・ wo (wolof) ・ wuu (wu chinois)

xal (kalmyk) ・ xh (xhosa) ・ xmf (mingrelian)

yi (yiddish) ・ yo (yoruba)

za (zhuang) ・ zea (zeeuws) ・ zh (chinois) ・ zu (zoulu)

Multibpemb

Multi (multilingue)

Citant bpemb

Si vous utilisez BPEMB dans le travail académique, veuillez citer:

 @InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Développer

Informations supplémentaires

Version 1.0.0
Type Autre code source
Date de mise à jour 2025-04-16
taille 22.46KB
Provenant de Github

Applications connexes

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommandé pour vous

chat.petals.dev

Autre code source

1.0.0
GPT Prompt Templates

Autre code source

1.0.0
GPTyped

Autre code source

GPTyped 1.0.5
Google Dorks

Autre code source

1.0
shepherd

Autre code source

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Autre code source

v1.1.0-rc-3
Google Dorks

Autre code source

1.0
shepherd

Autre code source

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Autre code source

v1.1.0-rc-3

Actualités connexes Tout