ดาวน์โหลด clean text - clean text รหัสแหล่งที่มาดาวน์โหลด

clean text

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

`clean-text`

เนื้อหาที่ผู้ใช้สร้างขึ้นบนเว็บและในโซเชียลมีเดียมักจะสกปรก ประมวลผลข้อมูลที่ถูกคัดลอกด้วย clean-text เพื่อสร้างการแสดงข้อความปกติ ตัวอย่างเช่นเปลี่ยนอินพุตที่เสียหายนี้:

A bunch of \u2018new\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght &lt;3!«

ในเอาต์พุตที่สะอาดนี้:

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text ใช้กฎ FTFY, Unidecode และกฎที่สร้างขึ้นด้วยมือมากมายเช่น regex

การติดตั้ง

ในการติดตั้งแพ็คเกจ Unidecode ที่ได้รับอนุญาตจาก GPL พร้อม:

pip install clean-text[gpl]

คุณอาจต้องการงดจาก GPL:

pip install clean-text

NB: แพ็คเกจนี้มีชื่อว่า clean-text และไม่ cleantext

หาก UnideCode ไม่สามารถใช้งานได้ clean-text จะหันไปใช้ unicodedata ของ Python สำหรับการถอดเสียง การแปลเป็นสัญลักษณ์ ASCII ที่ใกล้เคียงที่สุดเกี่ยวข้องกับการแมปด้วยตนเองเช่น ê ถึง e การทำแผนที่ของ unidecode นั้นเป็นสิ่งที่ยอดเยี่ยม แต่ Unicodedata นั้นเพียงพอ อย่างไรก็ตามคุณอาจต้องการปิดการใช้งานคุณสมบัตินี้โดยสิ้นเชิงขึ้นอยู่กับข้อมูลและกรณีการใช้งานของคุณ

เพื่อให้ชัดเจน: มี ความไม่สอดคล้องกัน ระหว่างการประมวลผลข้อความที่มีหรือไม่มี unidecode

การใช้งาน

 from cleantext import clean

clean ( "some input" ,
    fix_unicode = True ,               # fix various unicode errors
    to_ascii = True ,                  # transliterate to closest ASCII representation
    lower = True ,                     # lowercase text
    no_line_breaks = False ,           # fully strip line breaks as opposed to only normalizing them
    no_urls = False ,                  # replace all URLs with a special token
    no_emails = False ,                # replace all email addresses with a special token
    no_phone_numbers = False ,         # replace all phone numbers with a special token
    no_numbers = False ,               # replace all numbers with a special token
    no_digits = False ,                # replace all digits with a special token
    no_currency_symbols = False ,      # replace all currency symbols with a special token
    no_punct = False ,                 # remove punctuations
    replace_with_punct = "" ,          # instead of removing punctuations you may replace them
    replace_with_url = "<URL>" ,
    replace_with_email = "<EMAIL>" ,
    replace_with_phone_number = "<PHONE>" ,
    replace_with_number = "<NUMBER>" ,
    replace_with_digit = "0" ,
    replace_with_currency_symbol = "<CUR>" ,
    lang = "en"                       # set to 'de' for German special handling
)

เลือกอาร์กิวเมนต์ที่เหมาะสมกับงานของคุณอย่างระมัดระวัง พารามิเตอร์เริ่มต้นแสดงไว้ด้านบน

คุณสามารถใช้ฟังก์ชั่นเฉพาะสำหรับการทำความสะอาดเท่านั้น สำหรับสิ่งนี้ลองดูที่ซอร์สโค้ด

ภาษาที่รองรับ

จนถึงตอนนี้มีเพียงภาษาอังกฤษและเยอรมันเท่านั้นที่ได้รับการสนับสนุนอย่างเต็มที่ มันควรทำงานสำหรับภาษาตะวันตกส่วนใหญ่ หากคุณต้องการการจัดการพิเศษสำหรับภาษาของคุณอย่าลังเลที่จะมีส่วนร่วม -

การใช้ `clean-text` ด้วย `scikit-learn`

นอกจากนี้ยังมี API ที่เข้ากันได้กับ Scikit-Learn ที่จะใช้ในท่อของคุณ พารามิเตอร์ทั้งหมดข้างต้นทำงานที่นี่เช่นกัน

pip install clean-text[gpl,sklearn]
pip install clean-text[sklearn]

 from cleantext . sklearn import CleanTransformer

cleaner = CleanTransformer ( no_punct = False , lower = False )

cleaner . transform ([ 'Happily clean your text!' , 'Another Input' ])

การพัฒนา

ใช้บทกวี

การบริจาค

หากคุณมี คำถาม พบ ข้อผิดพลาด หรือต้องการเสนอ คุณสมบัติ ใหม่ให้ดูที่หน้าปัญหา

คำขอดึง จะได้รับการต้อนรับโดยเฉพาะอย่างยิ่งเมื่อแก้ไขข้อบกพร่องหรือปรับปรุงคุณภาพของรหัส

หากคุณไม่ชอบเอาต์พุตของ clean-text ให้พิจารณาเพิ่มการทดสอบด้วยอินพุตเฉพาะของคุณและเอาต์พุตที่ต้องการ

งานที่เกี่ยวข้อง

แพ็คเกจทำความสะอาดข้อความทั่วไป

https://github.com/pudo/normality
https://github.com/davidmogar/cucco
https://github.com/lyeoni/prenlp
https://github.com/s/preprocessor
https://github.com/artefactory/nlpretext
https://github.com/cbaziotis/ekphrasis

ไลบรารี NLP แบบเต็มเป่าด้วยการทำความสะอาดข้อความ

https://github.com/chartbeat-labs/textacy
https://github.com/jbesomi/texthero

ลบหรือแทนที่สตริง

https://github.com/vi3k6i5/flashtext
https://github.com/ddelange/retrie

ตรวจจับวันที่

https://github.com/scrapinghub/dateparser

ทำความสะอาดข้อมูลการรวบรวมข้อมูลทั่วไปขนาดใหญ่

https://github.com/facebookresearch/cc_net

กิตติกรรมประกาศ

สร้างขึ้นจากการทำงานโดย Burton Dewilde สำหรับเนื้อสัมผัส

ใบอนุญาต

Apache

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-17
ขนาด 33.96KB
มาจาก Github

แอปที่เกี่ยวข้อง

เกม Deep Clean Idle

2024-10-03
ข้อความกับพระเยซูจีน

2023-08-23
ข้อความกับพระเยซู

2023-08-17
ข้อความกับพระเยซูเวอร์ชั่นภาษาจีน

2023-08-17
ข้อความหรือตาย

2023-07-03
สะอาดเอี่ยม

2022-08-05

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด

clean text

clean-text