ดาวน์โหลด fast langdetect - ดาวน์โหลดซอร์สโค้ด fast langdetect

fast langdetect

โค้ดแหล่งที่มา AI

pypi_0.2.2 Support offlilne usage

ดาวน์โหลด

การตรวจสอบอย่างรวดเร็ว

ภาพรวม

Fast-Langdetect ให้การตรวจจับภาษาที่รวดเร็วและมีความแม่นยำสูงโดยใช้ FastText ซึ่งเป็นห้องสมุดที่พัฒนาโดย Facebook แพ็คเกจนี้เร็วกว่าวิธีการดั้งเดิม 80x และให้ความแม่นยำ 95%

รองรับรุ่น Python 3.9 ถึง 3.12

สนับสนุนการใช้งานออฟไลน์

โครงการนี้สร้างขึ้นตาม Zafercavdar/fasttext-langdetect พร้อมการปรับปรุงในบรรจุภัณฑ์

สำหรับข้อมูลเพิ่มเติมเกี่ยวกับโมเดล FastText พื้นฐานโปรดดูเอกสารประกอบอย่างเป็นทางการ: การระบุภาษา FastText

บันทึก

ไลบรารีนี้ต้องการหน่วยความจำมากกว่า 200MB เพื่อใช้ในโหมดหน่วยความจำต่ำ

การติดตั้ง

ในการติดตั้ง Fast-Langdetect คุณสามารถใช้ pip หรือ pdm :

ใช้ PIP

pip install fast-langdetect

ใช้ PDM

pdm add fast-langdetect

การใช้งาน

เพื่อประสิทธิภาพที่ดีที่สุดและความแม่นยำในการตรวจจับภาษาให้ใช้ detect(text, low_memory=False) เพื่อโหลดโมเดลที่ใหญ่กว่า

รุ่นจะถูกดาวน์โหลดไปยังไดเรกทอรี /tmp/fasttext-langdetect เมื่อใช้ครั้งแรก

API ดั้งเดิม (แนะนำ)

บันทึก

ฟังก์ชั่นนี้จะได้รับข้อความบรรทัดเดียว คุณควรลบอักขระ n ก่อนส่งข้อความ หากตัวอย่างยาวเกินไปหรือสั้นเกินไปความแม่นยำจะลดลง (ตัวอย่างเช่นในกรณีที่สั้นเกินไปจีนจะถูกคาดการณ์ว่าเป็นภาษาญี่ปุ่น)

 from fast_langdetect import detect , detect_multilingual

# Single language detection
print ( detect ( "Hello, world!" ))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# `use_strict_mode` determines whether the model loading process should enforce strict conditions before using fallback options.
# If `use_strict_mode` is set to True, we will load only the selected model, not the fallback model.
print ( detect ( "Hello, world!" , low_memory = False , use_strict_mode = True ))

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove ` n ` characters or it will raise an ValueError.
"""
multiline_text = multiline_text . replace ( " n " , "" )  # NOTE:ITS IMPORTANT TO REMOVE n CHARACTERS
print ( detect ( multiline_text ))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print ( detect ( "Привет, мир!" )[ "lang" ])
# Output: ru

# Multi-language detection
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" ))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

# Multi-language detection with low memory mode disabled
print ( detect_multilingual ( "Hello, world!你好世界!Привет, мир!" , low_memory = False ))
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]

ฟังก์ชั่น `detect_language` ที่สะดวก

 from fast_langdetect import detect_language

# Single language detection
print ( detect_language ( "Hello, world!" ))
# Output: EN

print ( detect_language ( "Привет, мир!" ))
# Output: RU

print ( detect_language ( "你好，世界！" ))
# Output: ZH

การแยกข้อความตามภาษา

สำหรับการแยกข้อความตามภาษาโปรดดูที่ที่เก็บแยก

เกณฑ์มาตรฐาน

สำหรับผลลัพธ์มาตรฐานโดยละเอียดโปรดดูที่ ZaferCavdar/FastText-Langdetect#Benchmark

การอ้างอิง

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, กระเป๋าของเทคนิคสำหรับการจำแนกข้อความที่มีประสิทธิภาพ

 @article { joulin2016bag ,
  title = { Bag of Tricks for Efficient Text Classification } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1607.01759 } ,
  year = { 2016 }
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: การบีบอัดแบบจำลองการจำแนกประเภทข้อความ

 @article { joulin2016fasttext ,
  title = { FastText.zip: Compressing text classification models } ,
  author = { Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas } ,
  journal = { arXiv preprint arXiv:1612.03651 } ,
  year = { 2016 }
}