clean text下载 - clean text源代码下载

clean text

其他源码

1.0.0

下载

`clean-text`

网络和社交媒体上的用户生成的内容通常很脏。用clean-text进行预处理的刮擦数据，以创建标准化的文本表示。例如，转动此损坏的输入：

A bunch of \u2018new\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght &lt;3!«

进入这个干净的输出：

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text使用FTFY，UNIDECODE和许多手工制作的规则，即

安装

要与：

pip install clean-text[gpl]

您可能想弃权：

pip install clean-text

NB：此软件包被命名为clean-text ，而不是cleantext 。

如果没有UniDecode， clean-text将诉诸Python的Unicodedata.Sormalize以进行音译。向最接近的ASCII符号的音译涉及手动映射，即ê到e 。 unidecode的映射是超级的，但Unicodedata的映射是足够的。但是，您可能需要根据数据和用例完全禁用此功能。

为了清楚地表明：处理或没有unidecode的文本之间存在不一致之处。

用法

 from cleantext import clean

clean ( "some input" ,
    fix_unicode = True ,               # fix various unicode errors
    to_ascii = True ,                  # transliterate to closest ASCII representation
    lower = True ,                     # lowercase text
    no_line_breaks = False ,           # fully strip line breaks as opposed to only normalizing them
    no_urls = False ,                  # replace all URLs with a special token
    no_emails = False ,                # replace all email addresses with a special token
    no_phone_numbers = False ,         # replace all phone numbers with a special token
    no_numbers = False ,               # replace all numbers with a special token
    no_digits = False ,                # replace all digits with a special token
    no_currency_symbols = False ,      # replace all currency symbols with a special token
    no_punct = False ,                 # remove punctuations
    replace_with_punct = "" ,          # instead of removing punctuations you may replace them
    replace_with_url = "<URL>" ,
    replace_with_email = "<EMAIL>" ,
    replace_with_phone_number = "<PHONE>" ,
    replace_with_number = "<NUMBER>" ,
    replace_with_digit = "0" ,
    replace_with_currency_symbol = "<CUR>" ,
    lang = "en"                       # set to 'de' for German special handling
)

仔细选择适合您任务的论点。默认参数在上面列出。

您也可以仅使用特定功能进行清洁。为此，查看源代码。

支持的语言

到目前为止，只有英语和德语得到充分支持。它应该适用于大多数西方语言。如果您需要对自己的语言进行一些特殊处理，请随时做出贡献。？

使用`scikit-learn`的`clean-text`

您的管道中还使用Scikit-Learn兼容API。上面的所有参数也在这里工作。

pip install clean-text[gpl,sklearn]
pip install clean-text[sklearn]

 from cleantext . sklearn import CleanTransformer

cleaner = CleanTransformer ( no_punct = False , lower = False )

cleaner . transform ([ 'Happily clean your text!' , 'Another Input' ])

发展

使用诗歌。

贡献

如果您有问题，找到错误或想提出新功能，请查看“问题”页面。

拉动请求在修复错误或提高代码质量时特别受到欢迎。

如果您不喜欢clean-text的输出，请考虑使用特定输入和所需输出添加测试。

致谢

建立在伯顿·迪维尔德（Burton Dewilde）的作品上。

执照

apache

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-04-17
大小 33.96KB
来自于 Github

clean text

`clean-text`

安装

用法

支持的语言

使用`scikit-learn`的`clean-text`

发展

贡献

相关工作

通用文本清洁包

带有一些文本清洁的成熟NLP库

卸下或更换字符串

检测日期

清洁大量的普通爬网数据

致谢

执照

Deep Clean Idle游戏

Text With Jesus汉化

与耶稣发短信

Text With Jesus中文版

发短信或死亡

干净整洁

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

clean text

clean-text

安装

用法

支持的语言

使用scikit-learn的clean-text

发展

贡献

相关工作

通用文本清洁包

带有一些文本清洁的成熟NLP库

卸下或更换字符串

检测日期

清洁大量的普通爬网数据

致谢

执照

`clean-text`

使用`scikit-learn`的`clean-text`