fast langdetect下載 - fast langdetect源代碼下載

fast langdetect

Ai源碼

pypi_0.2.2 Support offlilne usage

下載

快速距離

概述

Fast-LangDetect基於FackText（由Facebook開發的庫）提供超快速且高度準確的語言檢測。該軟件包比傳統方法快80倍，並且具有95％的精度。

它支持Python版本3.9至3.12。

支持離線用法。

該項目建立在Zafercavdar/fastText-langdetect上，具有增強包裝。

有關基礎FastText模型的更多信息，請參閱官方文檔：FastText語言標識。

筆記

該庫需要超過200MB的內存才能在低內存模式下使用。

安裝

要安裝Fast-LangDetect，您可以使用pip或pdm ：

使用PIP

pip install fast-langdetect

使用PDM

pdm add fast-langdetect

用法

為了獲得最佳的語言檢測性能和準確性，請使用detect(text, low_memory=False)加載較大的模型。

首次使用後，該模型將下載到/tmp/fasttext-langdetect目錄。

本地API（推薦）

筆記

假定此功能給出單行文本。在傳遞文本之前，您應該刪除n字符。如果樣本太長或太短，精度將降低（例如，如果太短，則將中文被預測為日語）。

 from fast_langdetect import detect , detect_multilingual

# Single language detection
print ( detect ( "Hello, world!" ))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# `use_strict_mode` determines whether the model loading process should enforce strict conditions before using fallback options.
# If `use_strict_mode` is set to True, we will load only the selected model, not the fallback model.
print ( detect ( "Hello, world!" , low_memory = False , use_strict_mode = True ))

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove ` n ` characters or it will raise an ValueError.
"""
multiline_text = multiline_text . replace ( " n " , "" )  # NOTE:ITS IMPORTANT TO REMOVE n CHARACTERS
print ( detect ( multiline_text ))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print ( detect ( "Привет, мир!" )[ "lang" ])
# Output: ru

# Multi-language detection
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" ))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

# Multi-language detection with low memory mode disabled
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" , low_memory = False ))
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]

方便`detect_language`函數

 from fast_langdetect import detect_language

# Single language detection
print ( detect_language ( "Hello, world!" ))
# Output: EN

print ( detect_language ( "Привет, мир!" ))
# Output: RU

print ( detect_language ( "你好，世界！" ))
# Output: ZH

用語言拆分文字

有關基於語言的文本分配，請參考拆分式傾斜存儲庫。

基準

有關詳細的基準結果，請參閱Zafercavdar/fastText-langdetect＃基準。

參考

[1] A. Joulin，E。 Grave，P。 Bojanowski，T。 Mikolov，，用於有效文本分類的技巧

 @article { joulin2016bag ,
  title = { Bag of Tricks for Efficient Text Classification } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1607.01759 } ,
  year = { 2016 }
}

[2] A. Joulin，E。 Grave，P。 Bojanowski，M。 Douze，H。 Jégou，T。 Mikolov，fasttext.zip：壓縮文本分類模型

 @article { joulin2016fasttext ,
  title = { FastText.zip: Compressing text classification models } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1612.03651 } ,
  year = { 2016 }
}