fast langdetect 다운로드 - fast langdetect 소스 코드 다운로드

fast langdetect

AI 소스 코드

pypi_0.2.2 Support offlilne usage

다운로드

빠른 변호사

개요

Fast-LangDetect는 Facebook에서 개발 한 라이브러리 인 FastText를 기반으로 매우 빠르고 매우 정확한 언어 탐지를 제공합니다. 이 패키지는 기존 방법보다 80 배 빠르며 95% 정확도를 제공합니다.

파이썬 버전 3.9 ~ 3.12를 지원합니다.

오프라인 사용을 지원합니다.

이 프로젝트는 포장의 향상된 Zafercavdar/Fasttext-Langdetect를 기반으로합니다.

기본 FastText 모델에 대한 자세한 내용은 공식 문서 : FastText 언어 식별을 참조하십시오.

메모

이 라이브러리는 낮은 메모리 모드에서 사용하려면 200MB 이상의 메모리가 필요합니다.

설치

Fast-LangDetect를 설치하려면 pip 또는 pdm 사용할 수 있습니다.

PIP 사용

pip install fast-langdetect

PDM 사용

pdm add fast-langdetect

용법

언어 감지의 최적 성능과 정확성을 위해 detect(text, low_memory=False) 사용하여 더 큰 모델을로드하십시오.

이 모델은 처음 사용하면 /tmp/fasttext-langdetect 디렉토리에 다운로드됩니다.

기본 API (권장)

메모

이 기능은 단일 텍스트 줄을 주어야한다고 가정합니다. 텍스트를 전달하기 전에 n 문자를 제거해야합니다. 샘플이 너무 길거나 너무 짧으면 정확도가 감소합니다 (예 : 너무 짧은 경우 중국어는 일본어로 예측됩니다).

 from fast_langdetect import detect , detect_multilingual

# Single language detection
print ( detect ( "Hello, world!" ))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# `use_strict_mode` determines whether the model loading process should enforce strict conditions before using fallback options.
# If `use_strict_mode` is set to True, we will load only the selected model, not the fallback model.
print ( detect ( "Hello, world!" , low_memory = False , use_strict_mode = True ))

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove ` n ` characters or it will raise an ValueError.
"""
multiline_text = multiline_text . replace ( " n " , "" )  # NOTE:ITS IMPORTANT TO REMOVE n CHARACTERS
print ( detect ( multiline_text ))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print ( detect ( "Привет, мир!" )[ "lang" ])
# Output: ru

# Multi-language detection
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" ))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

# Multi-language detection with low memory mode disabled
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" , low_memory = False ))
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]

편리한 `detect_language` 기능

 from fast_langdetect import detect_language

# Single language detection
print ( detect_language ( "Hello, world!" ))
# Output: EN

print ( detect_language ( "Привет, мир!" ))
# Output: RU

print ( detect_language ( "你好，世界！" ))
# Output: ZH

언어 별 텍스트를 분할

언어를 기반으로 한 텍스트 분할은 Split-lang 저장소를 참조하십시오.

기준

자세한 벤치 마크 결과는 Zafercavdar/FastText-LangDetect#벤치 마크를 참조하십시오.

참조

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, 효율적인 텍스트 분류를위한 트릭 가방

 @article { joulin2016bag ,
  title = { Bag of Tricks for Efficient Text Classification } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1607.01759 } ,
  year = { 2016 }
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, Fasttext.zip : 텍스트 분류 모델 압축

 @article { joulin2016fasttext ,
  title = { FastText.zip: Compressing text classification models } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1612.03651 } ,
  year = { 2016 }
}