fast langdetect Download - fast langdetect Source Code Download

fast langdetect

Code Source AI

pypi_0.2.2 Support offlilne usage

Télécharger

à la recherche rapide

Aperçu

Fast-LangDetect fournit une détection de langage ultra-rapide et très précise basée sur FastText, une bibliothèque développée par Facebook. Ce package est 80x plus rapide que les méthodes traditionnelles et offre une précision à 95%.

Il prend en charge les versions Python 3.9 à 3.12.

Soutenez l'utilisation hors ligne.

Ce projet s'appuie sur Zafercavdar / FastText-LangDetect avec des améliorations de l'emballage.

Pour plus d'informations sur le modèle FastText sous-jacent, reportez-vous à la documentation officielle: Identification du langage FastText.

Note

Cette bibliothèque nécessite plus de 200 Mo de mémoire à utiliser en mode mémoire basse.

Installation

Pour installer Fast-LangDetect, vous pouvez utiliser pip ou pdm :

Utilisation de pip

pip install fast-langdetect

Utilisation de PDM

pdm add fast-langdetect

Usage

Pour des performances et une précision optimales dans la détection du langage, utilisez detect(text, low_memory=False) pour charger le modèle plus grand.

Le modèle sera téléchargé dans le répertoire /tmp/fasttext-langdetect lors de la première utilisation.

API native (recommandée)

Note

Cette fonction suppose que l'on donne une seule ligne de texte. Vous devez supprimer les caractères n avant de passer le texte. Si l'échantillon est trop long ou trop court, la précision diminuera (par exemple, dans le cas de trop courts, les Chinois seront prédits comme japonais).

 from fast_langdetect import detect , detect_multilingual

# Single language detection
print ( detect ( "Hello, world!" ))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# `use_strict_mode` determines whether the model loading process should enforce strict conditions before using fallback options.
# If `use_strict_mode` is set to True, we will load only the selected model, not the fallback model.
print ( detect ( "Hello, world!" , low_memory = False , use_strict_mode = True ))

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove ` n ` characters or it will raise an ValueError.
"""
multiline_text = multiline_text . replace ( " n " , "" )  # NOTE:ITS IMPORTANT TO REMOVE n CHARACTERS
print ( detect ( multiline_text ))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print ( detect ( "Привет, мир!" )[ "lang" ])
# Output: ru

# Multi-language detection
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" ))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

# Multi-language detection with low memory mode disabled
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" , low_memory = False ))
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]

Fonction `detect_language` pratique

 from fast_langdetect import detect_language

# Single language detection
print ( detect_language ( "Hello, world!" ))
# Output: EN

print ( detect_language ( "Привет, мир!" ))
# Output: RU

print ( detect_language ( "你好，世界！" ))
# Output: ZH

Splipting Text par la langue

Pour le fractionnement de texte basé sur le langage, veuillez vous référer au référentiel Split-Lang.

Référence

Pour les résultats de référence détaillés, reportez-vous à Zafercavdar / FastText-LangDetect # Benchmark.

Références

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

 @article { joulin2016bag ,
  title = { Bag of Tricks for Efficient Text Classification } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1607.01759 } ,
  year = { 2016 }
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Modèles de classification de texte de compression

 @article { joulin2016fasttext ,
  title = { FastText.zip: Compressing text classification models } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1612.03651 } ,
  year = { 2016 }
}