text dedup 다운로드 - text dedup 소스 코드 다운로드

text dedup

기타 소스코드

Reference Snapshot

다운로드

설치

pip install text-dedup

또는

pip install git+https://github.com/ChenghaoMou/text-dedup

선적 서류 비치

Github 페이지

특징

이 저장소에는 필요에 따라 사용할 수 있거나 수정할 준비가 된 텍스트 중복 제거 스크립트 모음이 포함되어 있습니다.

Retsim/Unisim, 삽입 기반 근처 중복 제거 (WIP)
대형 (TB) 데이터 세트에 적합한 스파크 구현을 포함하여 Minhash + Minhashlsh
64 또는 128 비트 Simhash
접미사 기판
블룸 필터
정확한 해시 (문서 수준, 라인 레벨/CCNET)

나는 또한 미래에 대한 큰 계획이 있습니다.

스트리밍 처리를위한 메모리 벤치 마크
다타타 세트 간 중복 제거
파이썬에서 접미사 배열을 다시 작성하십시오
기타 중복 제거 방법의 컬렉션 : Superminhash, Probminhash, Treeminhash, Bagminhash, 빠르고 정확한 최소 해싱, 빠른 유사성 스케치를위한 최적의 밀도

그러나 나는 범용 중복 제거 라이브러리를 구축 할 의도가 없다. 나는 PYPI 패키지도 점차 은퇴 할 것입니다. 그 이유는 각 유스 케이스가 크게 다를 수 있으며 신중한 설계와 고려가 필요하기 때문입니다. 나는 당신이 스크립트를 먼저 읽도록 권장합니다 (비교적 짧습니다)를 사용할 때 여기에있는 것이 무엇인지 이해할 수 있습니다. 이를 사용하여 자신의 스크립트를 부트 스트랩하거나 참조로 사용할 수 있습니다.

감사의 말

이 저장소는 다음 프로젝트에서 영감을 얻었으며 BigScience (Apache 2.0) 및 BigCode (Apache 2.0)에 대한 저의 참여로부터 배운 교훈에 크게 영향을받습니다. 여행에 대한 블로그 게시물이 있습니다. 피드백을 환영합니다!

Datasketch (MIT)
Simhash-Py 및 Simhash-CPP (MIT)
중복 제거 교육 데이터는 언어 모델을 더 좋게 만듭니다 (Apache 2.0)
가오야 (MIT)

빠른 예

기본 Pyspark

자신의 프로젝트 및 데이터 세트에 대해 text_dedup/minhash_spark.py 먼저 수정하십시오!

"./temp-data"아래에 다운로드 된 데이터 세트 (파크 파일)가 있다고 가정하면 로컬 컴퓨팅으로 파일로 처리 할 수 있습니다.

 export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g 
    --driver-memory 20g 
    --executor-cores 3 
    --num-executors 2 
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 
    --conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    --conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    text_dedup/minhash_spark.py
    --input " ./temp-data " 
    --output " ./temp-output " 
    --column " text " 
    --threshold 0.7

 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

또는 GCP DatAPROC로 작업을 실행하는 방법에 대해 BigCode-V2/Run.sh를 살펴보십시오.

Unisim (WIP)

Google의 Retsim 모델 (Github, Arxiv)을 기반으로하는이 제품은 근접 복제 방법을 기반으로하는 임베딩입니다.

대형 데이터 세트의 경우 빠른 추론을 위해서는 GPU가 필요합니다.

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

산출:

 INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

접미사 어레이 하위 문자 정확한 중복 제거

 # input
python -m text_dedup.suffix_array 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/suffix_array/oscar_gl_dedup " 
    --column " text " 
    --google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets " 
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

중복 제거 근처의 Minhash

 # input
python -m text_dedup.minhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/minhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ? Happy Deduplicating ?

중복 제거 근처의 Simhash

 # input
python -m text_dedup.simhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/simhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ? Happy Deduplicating ?

정확한 해시 정확한 중복 제거

 # input
python -m text_dedup.exact_hash 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/exact_hash/oscar_gl_dedup " 
    --column " text " 
    --batch_size 1000 
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

블룸 필터 정확한 중복 제거

 # input
python -m text_dedup.bloom_filter 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/bloom_filter/oscar_gl_dedup " 
    --error_rate 1e-5 
    --column " text " 
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

벤치 마크

메모

Spark 구현에는 소형 데이터 세트에 대한 오버 헤드가 있으므로 대규모 데이터 세트와 충분한 컴퓨팅 리소스가있을 때만 스크립트를 사용하는 것이 좋습니다.

PENECONE/CORE-2020-05-10 분리

복제에 대해서는 tests/benchmark_core.py 참조하십시오.

연산	정밀도 (복제)	리콜 (복제)	정밀도 (중복 비)	리콜 (중복 비 복제)	매크로 F1 점수	정확성	시간
Unisim	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79s
Minhash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77S
민 하쉬	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88S
시 하쉬	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36S
정확한 제목	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
정확한 제목 일치 ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash 일치 ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
문서 벡터 유사성 ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
하이브리드 방법 ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
Labse ²	0.937	0.923	0.930	0.943	0.933	0.919	-
다국어 사용 ²	0.917	0.907	0.918	0.927	0.917	0.909	-
다국어 E5-베이스 ²	0.931	0.908	0.919	0.939	0.924	0.920	-
Minhash + LSH ²	0.929	0.902	0.915	0.938	0.921	0.918	-
Retsim 부분 덤프 ²	0.945	0.941	0.945	0.949	0.945	0.928	-
Retsim Near-Dup ²	0.928	0.937	0.942	0.934	0.935	0.926	-

뉴스-코피

복제에 대해서는 tests/benchmark_news.py 참조하십시오.

News-Copy 데이터 세트의 조정 된 랜드 색인 (ARI) :

모델/알고리즘	아리
시 하쉬	0.612
민 하쉬 (Spark)	0.740
민 하쉬	0.742
Retsim Near-Dup + Ann*	0.051
N- 그램 ³	0.440
Simhash ²	0.695
Minhash ³	0.737
Minhash ²	0.783
다국어 사용 ²	0.730
다국어 E5-베이스 ²	0.742
S-Bert ³	0.700
Retsim 부분 덤프 ²	0.831
Retsim Near-Dup ²	0.704
재 계급 ³	0.937
바이 코더 ³	0.915

*: 종이의 결과를 재현 할 수없는 것 같습니다.

특허

아파치 2.0

인용

일반적 으로이 저장소를 다음과 같이 인용 할 수 있습니다.

 @software { chenghao_mou_2023_8364980 ,
  author       = { Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu } ,
  title        = { ChenghaoMou/text-dedup: Reference Snapshot } ,
  month        = sep,
  year         = 2023 ,
  publisher    = { Zenodo } ,
  version      = { 2023.09.20 } ,
  doi          = { 10.5281/zenodo.8364980 } ,
  url          = { https://doi.org/10.5281/zenodo.8364980 }
}

Spark 버전은 BigCode (Apache 2.0) 및 BigScience (Apache 2.0)에서 태어 났으며 원하는 경우 원본 용지를 인용 할 수 있습니다.

 @article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}

지역 민감한 ^해싱 및 단어 임베딩 ^을 사용한 학술 문서의 중간 ^원인
RETSIM : 탄력적이고 효율적인 텍스트 유사성 ↩ ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵ ↩ ⁶ ↩ ⁷ ↩ ⁸ ↩ ⁹ ↩ ¹⁰ ↩ ¹¹ ↩ ¹²
^스케일 ^에서 ^노이즈 ^- 로버트 제거 퇴원

확장하다

추가 정보

버전 Reference Snapshot
유형 기타 소스코드
업데이트 시간 2025-04-19
크기 194.73KB
출처 Github

text dedup

설치

선적 서류 비치

특징

감사의 말

빠른 예

벤치 마크

특허

인용

예수님과 함께하는 문자 중국어

예수님과 문자를 보내세요

예수님과 함께하는 문자 중국어 버전

텍스트 아니면 다이

RTE(서식 있는 텍스트 편집기) ASP.NET

PHP 텍스트 링크 교환

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

text dedup

설치

선적 서류 비치

특징

감사의 말

빠른 예

벤치 마크

특허

인용

각주