clean text 다운로드 - clean text 소스 코드 다운로드

clean text

기타 소스코드

1.0.0

다운로드

`clean-text`

웹과 소셜 미디어에서 사용자가 생성 한 콘텐츠는 종종 더럽습니다. 정규화 된 텍스트 표현을 만들기 위해 긁힌 clean-text 로 스크랩 된 데이터를 전처리하십시오. 예를 들어,이 손상된 입력을 돌립니다.

A bunch of \u2018new\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght &lt;3!«

이 깨끗한 출력으로 :

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text ftfy, unidecode 및 수많은 손으로 만들어진 규칙, 즉 Regex를 사용합니다.

설치

GPL 라이센스 패키지 UNIDECODE를 함께 설치하려면 다음과 같이 설치합니다.

pip install clean-text[gpl]

GPL을 기권하고 싶을 수도 있습니다.

pip install clean-text

NB :이 패키지는 clean-text 로 명명되었으며 cleantext 가 아닙니다.

Unidecode를 사용할 수없는 경우 clean-text Python의 Unicodedata에 의지합니다. 가장 가까운 ASCII 기호로의 음역은 수동으로 매핑, 즉 e 에서 ê 수동으로 맵핑합니다. unidecode 의 매핑은 슈퍼 리이지만 유니 코드 데다는 충분합니다. 그러나 데이터 및 사용 사례에 따라이 기능을 모두 비활성화 할 수 있습니다.

명확하게하기 위해 : unidecode 있거나없는 처리 텍스트 사이에는 불일치가 있습니다.

용법

 from cleantext import clean

clean ( "some input" ,
    fix_unicode = True ,               # fix various unicode errors
    to_ascii = True ,                  # transliterate to closest ASCII representation
    lower = True ,                     # lowercase text
    no_line_breaks = False ,           # fully strip line breaks as opposed to only normalizing them
    no_urls = False ,                  # replace all URLs with a special token
    no_emails = False ,                # replace all email addresses with a special token
    no_phone_numbers = False ,         # replace all phone numbers with a special token
    no_numbers = False ,               # replace all numbers with a special token
    no_digits = False ,                # replace all digits with a special token
    no_currency_symbols = False ,      # replace all currency symbols with a special token
    no_punct = False ,                 # remove punctuations
    replace_with_punct = "" ,          # instead of removing punctuations you may replace them
    replace_with_url = "<URL>" ,
    replace_with_email = "<EMAIL>" ,
    replace_with_phone_number = "<PHONE>" ,
    replace_with_number = "<NUMBER>" ,
    replace_with_digit = "0" ,
    replace_with_currency_symbol = "<CUR>" ,
    lang = "en"                       # set to 'de' for German special handling
)

작업에 맞는 인수를 신중하게 선택하십시오. 기본 매개 변수는 위에 나열되어 있습니다.

청소를 위해 특정 기능 만 사용할 수도 있습니다. 이를 위해 소스 코드를 살펴보십시오.

지원되는 언어

지금까지 영어와 독일어만이 완전히 지원됩니다. 대부분의 서구 언어에 대해 작동해야합니다. 언어에 대한 특별한 취급이 필요한 경우 자유롭게 기여하십시오. ?

`scikit-learn` 과 함께 `clean-text` 사용

파이프 라인에 사용할 Scikit-Learn 호환 API도 있습니다. 위의 모든 매개 변수도 여기에서도 작동합니다.

pip install clean-text[gpl,sklearn]
pip install clean-text[sklearn]

 from cleantext . sklearn import CleanTransformer

cleaner = CleanTransformer ( no_punct = False , lower = False )

cleaner . transform ([ 'Happily clean your text!' , 'Another Input' ])

개발

시를 사용하십시오.

기여

질문이 있거나 버그를 찾거나 새로운 기능을 제안하려면 문제 페이지를 살펴보십시오.

풀 요청은 버그를 고치거나 코드 품질을 향상시킬 때 특히 환영합니다.

clean-text 의 출력이 마음에 들지 않으면 특정 입력 및 원하는 출력으로 테스트를 추가하십시오.

감사의 말

Burton Dewilde가 텍스트를 위해 작품을 구축했습니다.

특허

아파치

확장하다

추가 정보

버전 1.0.0
유형 기타 소스코드
업데이트 시간 2025-04-17
크기 33.96KB
출처 Github

clean text

`clean-text`

설치

용법

지원되는 언어

`scikit-learn` 과 함께 `clean-text` 사용

개발

기여

관련 작업

일반 텍스트 청소 패키지

일부 텍스트 청소 기능이있는 본격적인 NLP 라이브러리

줄을 제거하거나 교체하십시오

날짜를 감지하십시오

대규모 공통 크롤링 데이터를 깨끗하게하십시오

감사의 말

특허

딥 클린 유휴 게임

예수님과 함께하는 문자 중국어

예수님과 문자를 보내세요

예수님과 함께하는 문자 중국어 버전

텍스트 아니면 다이

삐걱거리는 소리

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

clean text

clean-text

설치

용법

지원되는 언어

scikit-learn 과 함께 clean-text 사용

개발

기여

관련 작업

일반 텍스트 청소 패키지

일부 텍스트 청소 기능이있는 본격적인 NLP 라이브러리

줄을 제거하거나 교체하십시오

날짜를 감지하십시오

대규모 공통 크롤링 데이터를 깨끗하게하십시오

감사의 말

특허

`clean-text`

`scikit-learn` 과 함께 `clean-text` 사용