ดาวน์โหลด text dedup - ดาวน์โหลดรหัสแหล่งที่มาของ text dedup ดาวน์โหลด

text dedup

ซอร์สโค้ดอื่น ๆ

Reference Snapshot

ดาวน์โหลด

การติดตั้ง

pip install text-dedup

หรือ

pip install git+https://github.com/ChenghaoMou/text-dedup

เอกสาร

หน้า GitHub

คุณสมบัติ

ที่เก็บนี้มีคอลเลกชันของสคริปต์การขจัดข้อมูลซ้ำซ้อนที่พร้อมใช้งานหรือแก้ไขตามความต้องการของคุณ:

retsim/unisim, การฝังอยู่ใกล้กับการซ้ำซ้อน (WIP)
Minhash + Minhashlsh รวมถึงการใช้งาน Spark ที่เหมาะสำหรับชุดข้อมูลขนาดใหญ่ (TB)
Simhash 64 หรือ 128 บิต
SuffixArray Substring
ตัวกรองบาน
แฮชที่แน่นอน (ระดับเอกสารระดับบรรทัด/ccnet)

ฉันยังมีแผนใหญ่สำหรับอนาคต:

มาตรฐานหน่วยความจำสำหรับการประมวลผลสตรีมมิ่ง
การหักบัญชีระหว่างกัน
เขียนคำต่อท้ายแบบใหม่ใน Python
คอลเลกชันของวิธีการซ้ำซ้อนอื่น ๆ : superminhash, probminhash, treeminhash, bagminhash, ความหนาแน่นที่ดีที่สุดสำหรับการแฮชแบบ minwise ที่รวดเร็วและแม่นยำ, ความคล้ายคลึงกันอย่างรวดเร็ว

อย่างไรก็ตามฉันไม่ได้ตั้งใจที่จะสร้างห้องสมุด deduplication วัตถุประสงค์ทั่วไปซึ่งเป็นเป้าหมายของ repo นี้ในช่วงต้น ฉันจะค่อยๆเกษียณแพ็คเกจ PYPI เช่นกัน เหตุผลที่อยู่เบื้องหลังคือแต่ละกรณีการใช้งานอาจแตกต่างกันอย่างดุเดือดและต้องมีการออกแบบและการพิจารณาอย่างรอบคอบ ฉันขอแนะนำให้คุณอ่านสคริปต์ก่อน (ค่อนข้างสั้น) เพื่อให้คุณสามารถเข้าใจสิ่งที่เป็นเดิมพันที่นี่เมื่อใช้งาน คุณสามารถใช้มันเพื่อ bootstrap สคริปต์ของคุณเองหรือเพียงแค่ใช้เป็นข้อมูลอ้างอิง

กิตติกรรมประกาศ

พื้นที่เก็บข้อมูลนี้ได้รับแรงบันดาลใจจากโครงการต่อไปนี้และได้รับอิทธิพลอย่างมากจากบทเรียนที่เรียนรู้จากการมีส่วนร่วมของฉันใน BigScience (Apache 2.0) และ BigCode (Apache 2.0) มีบล็อกโพสต์เกี่ยวกับการเดินทาง ยินดีต้อนรับการตอบกลับ!

DataSketch (MIT)
Simhash-Py และ Simhash-CPP (MIT)
ข้อมูลการฝึกอบรมซ้ำ ๆ ทำให้แบบจำลองภาษาดีขึ้น (Apache 2.0)
Gaoya (MIT)

ตัวอย่างด่วน

Pyspark พื้นเมือง

แก้ไข text_dedup/minhash_spark.py สำหรับโครงการและชุดข้อมูลของคุณเองก่อน!

สมมติว่าคุณมีชุดข้อมูลที่ดาวน์โหลด (ในไฟล์ Parquet) ภายใต้ "./temp-data" คุณสามารถประมวลผลด้วยไฟล์ด้วยการคำนวณในพื้นที่ของคุณโดย:

 export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g 
    --driver-memory 20g 
    --executor-cores 3 
    --num-executors 2 
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 
    --conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    --conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    text_dedup/minhash_spark.py
    --input " ./temp-data " 
    --output " ./temp-output " 
    --column " text " 
    --threshold 0.7

 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

หรือดู BigCode-V2/run.sh เกี่ยวกับวิธีการรันงานด้วย GCP DataProc

unisim (WIP)

ขึ้นอยู่กับโมเดล RetSIM ของ Google (GitHub, arxiv) มันเป็นการฝังที่ใช้วิธีการกำหนดความผิดปกติ

สำหรับชุดข้อมูลขนาดใหญ่มันจะต้องใช้ GPU สำหรับการอนุมานอย่างรวดเร็ว

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

เอาท์พุท:

 INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

Substring Substring Substring ที่แน่นอน

 # input
python -m text_dedup.suffix_array 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/suffix_array/oscar_gl_dedup " 
    --column " text " 
    --google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets " 
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

Minhash ใกล้ซ้ำซ้อน

 # input
python -m text_dedup.minhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/minhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ? Happy Deduplicating ?

Simhash ใกล้ซ้ำซ้อน

 # input
python -m text_dedup.simhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/simhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ? Happy Deduplicating ?

แฮชการซ้ำซ้อนที่แน่นอน

 # input
python -m text_dedup.exact_hash 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/exact_hash/oscar_gl_dedup " 
    --column " text " 
    --batch_size 1000 
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom Filter การซ้ำซ้อนที่แน่นอน

 # input
python -m text_dedup.bloom_filter 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/bloom_filter/oscar_gl_dedup " 
    --error_rate 1e-5 
    --column " text " 
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

เกณฑ์มาตรฐาน

บันทึก

การใช้งาน Spark มีค่าใช้จ่ายบางอย่างสำหรับชุดข้อมูลขนาดเล็กดังนั้นฉันขอแนะนำให้ใช้สคริปต์เฉพาะเมื่อคุณมีชุดข้อมูลขนาดใหญ่และทรัพยากรการคำนวณเพียงพอ

PINECONE/CORE-20120-05-10-DEDUPLICATION

ดู tests/benchmark_core.py สำหรับการทำซ้ำ

อัลกอริทึม	ความแม่นยำ (ซ้ำกัน)	เรียกคืน (ซ้ำกัน)	ความแม่นยำ (ไม่ใช่ซ้ำกัน)	เรียกคืน (ไม่ใช่ซ้ำกัน)	คะแนนมาโคร F1	ความแม่นยำ	เวลา
ไม่ได้รับความนิยม	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79S
Minhash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77S
Minhash	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88s
การสังเคราะห์	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36S
ชื่อที่แน่นอน	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
การจับคู่ชื่อที่แน่นอน ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
การจับคู่ Simhash ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
เอกสารเวกเตอร์ความคล้ายคลึงกัน ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
วิธีไฮบริด ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
Labse ²	0.937	0.923	0.930	0.943	0.933	0.919	-
การใช้หลายภาษา ²	0.917	0.907	0.918	0.927	0.917	0.909	-
E5-base หลายภาษา ²	0.931	0.908	0.919	0.939	0.924	0.920	-
Minhash + LSH ²	0.929	0.902	0.915	0.938	0.921	0.918	-
retsim บางส่วน dup ²	0.945	0.941	0.945	0.949	0.945	0.928	-
retsim ใกล้กับ DUP ²	0.928	0.937	0.942	0.934	0.935	0.926	-

สำเนาข่าว

ดู tests/benchmark_news.py สำหรับการทำซ้ำ

ดัชนี RAND ที่ปรับแล้ว (ARI) บนชุดข้อมูลสำเนาข่าว:

โมเดล/อัลกอริทึม	อารี
การสังเคราะห์	0.612
Minhash (Spark)	0.740
Minhash	0.742
retsim ใกล้กับ dup + ann*	0.051
N-Gram ³	0.440
Simhash ²	0.695
มินฮาช ³	0.737
Minhash ²	0.783
การใช้หลายภาษา ²	0.730
E5-base หลายภาษา ²	0.742
S-Bert ³	0.700
retsim บางส่วน dup ²	0.831
retsim ใกล้กับ DUP ²	0.704
จัดอันดับใหม่ ³	0.937
bi-encoder ³	0.915

*: ดูเหมือนจะทำซ้ำผลลัพธ์จากกระดาษ

ใบอนุญาต

Apache 2.0

การอ้างอิง

โดยทั่วไปคุณสามารถอ้างถึงที่เก็บนี้เป็น:

 @software { chenghao_mou_2023_8364980 ,
  author       = { Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu } ,
  title        = { ChenghaoMou/text-dedup: Reference Snapshot } ,
  month        = sep,
  year         = 2023 ,
  publisher    = { Zenodo } ,
  version      = { 2023.09.20 } ,
  doi          = { 10.5281/zenodo.8364980 } ,
  url          = { https://doi.org/10.5281/zenodo.8364980 }
}

รุ่น Spark เกิดจาก BigCode (Apache 2.0) และ BigScience (Apache 2.0) และคุณสามารถอ้างอิงกระดาษต้นฉบับได้หากคุณต้องการ:

 @article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}

การซ้ำซ้อนของเอกสารทางวิชาการโดยใช้การแฮชที่ไวต่อท้องถิ่นและการฝังคำศัพท์↩ ² ↩ ³ ↩ ⁴
Retsim: ความคล้ายคลึงกันของข้อความที่ยืดหยุ่นและมีประสิทธิภาพ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵ ↩ ⁶ ↩ ⁷ ↩ ⁸ ↩ ⁹ ↩ ¹⁰ ↩ ¹¹ ↩ ¹²
การทำสำเนารบกวนเสียงรบกวนในระดับ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵