clean text下載 - clean text源代碼下載

clean text

其他源碼

1.0.0

下載

`clean-text`

網絡和社交媒體上的用戶生成的內容通常很髒。用clean-text進行預處理的刮擦數據，以創建標準化的文本表示。例如，轉動此損壞的輸入：

A bunch of \u2018new\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght &lt;3!«

進入這個乾淨的輸出：

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text使用FTFY，UNIDECODE和許多手工製作的規則，即

安裝

要與：

pip install clean-text[gpl]

您可能想棄權：

pip install clean-text

NB：此軟件包被命名為clean-text ，而不是cleantext 。

如果沒有UniDecode， clean-text將訴諸Python的Unicodedata.Sormalize以進行音譯。向最接近的ASCII符號的音譯涉及手動映射，即ê到e 。 unidecode的映射是超級的，但Unicodedata的映射是足夠的。但是，您可能需要根據數據和用例完全禁用此功能。

為了清楚地表明：處理或沒有unidecode的文本之間存在不一致之處。

用法

 from cleantext import clean

clean ( "some input" ,
    fix_unicode = True ,               # fix various unicode errors
    to_ascii = True ,                  # transliterate to closest ASCII representation
    lower = True ,                     # lowercase text
    no_line_breaks = False ,           # fully strip line breaks as opposed to only normalizing them
    no_urls = False ,                  # replace all URLs with a special token
    no_emails = False ,                # replace all email addresses with a special token
    no_phone_numbers = False ,         # replace all phone numbers with a special token
    no_numbers = False ,               # replace all numbers with a special token
    no_digits = False ,                # replace all digits with a special token
    no_currency_symbols = False ,      # replace all currency symbols with a special token
    no_punct = False ,                 # remove punctuations
    replace_with_punct = "" ,          # instead of removing punctuations you may replace them
    replace_with_url = "<URL>" ,
    replace_with_email = "<EMAIL>" ,
    replace_with_phone_number = "<PHONE>" ,
    replace_with_number = "<NUMBER>" ,
    replace_with_digit = "0" ,
    replace_with_currency_symbol = "<CUR>" ,
    lang = "en"                       # set to 'de' for German special handling
)

仔細選擇適合您任務的論點。默認參數在上面列出。

您也可以僅使用特定功能進行清潔。為此，查看源代碼。

支持的語言

到目前為止，只有英語和德語得到充分支持。它應該適用於大多數西方語言。如果您需要對自己的語言進行一些特殊處理，請隨時做出貢獻。？

使用`scikit-learn`的`clean-text`

您的管道中還使用Scikit-Learn兼容API。上面的所有參數也在這里工作。

pip install clean-text[gpl,sklearn]
pip install clean-text[sklearn]

 from cleantext . sklearn import CleanTransformer

cleaner = CleanTransformer ( no_punct = False , lower = False )

cleaner . transform ([ 'Happily clean your text!' , 'Another Input' ])

發展

使用詩歌。

貢獻

如果您有問題，找到錯誤或想提出新功能，請查看“問題”頁面。

拉動請求在修復錯誤或提高代碼質量時特別受到歡迎。

如果您不喜歡clean-text的輸出，請考慮使用特定輸入和所需輸出添加測試。

致謝

建立在伯頓·迪維爾德（Burton Dewilde）的作品上。

執照

apache

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-04-17
大小 33.96KB
來自於 Github

相關應用

Deep Clean Idle遊戲

2024-10-03
Text With Jesus漢化

2023-08-23
與耶穌發簡訊

2023-08-17
Text With Jesus中文版

2023-08-17
發短信或死亡

2023-07-03
乾淨整潔

2022-08-05

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部

clean text

`clean-text`

安裝

用法

支持的語言

使用`scikit-learn`的`clean-text`

發展

貢獻

相關工作

通用文本清潔包

帶有一些文本清潔的成熟NLP庫

卸下或更換字符串

檢測日期

清潔大量的普通爬網數據

致謝

執照

Deep Clean Idle遊戲

Text With Jesus漢化

與耶穌發簡訊

Text With Jesus中文版

發短信或死亡

乾淨整潔

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

clean text

clean-text

安裝

用法

支持的語言

使用scikit-learn的clean-text

發展

貢獻

相關工作

通用文本清潔包

帶有一些文本清潔的成熟NLP庫

卸下或更換字符串

檢測日期

清潔大量的普通爬網數據

致謝

執照

`clean-text`

使用`scikit-learn`的`clean-text`