
pip install text-dedup或者
pip install git+https://github.com/ChenghaoMou/text-dedupgithub页面
该存储库包含准备使用的文本重复数据删除脚本的集合,或根据您的需求修改:
我对未来也有巨大的计划:
但是,我无意建立通用删除库,这是该回购的目标。我也将逐渐退休PYPI软件包。其背后的原因是每个用例都可能大不相同,需要仔细的设计和考虑。我真诚地鼓励您首先阅读脚本(它们相对较短),以便您可以在使用该脚本时了解此处的危险。您可以使用它来引导自己的脚本,也可以将其用作参考。
该存储库的灵感来自以下项目的启发,并且受到我参与BigScience(Apache 2.0)和BigCode(Apache 2.0)的经验教训的严重影响。有一篇关于旅程的博客文章。欢迎反馈!
首先,修改text_dedup/minhash_spark.py首先!
假设您在“ ./temp-data”下有一个下载的数据集(在parquet文件中),则可以使用以下方式使用本地计算来处理文件:
export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g
--driver-memory 20g
--executor-cores 3
--num-executors 2
--packages graphframes:graphframes:0.8.2-spark3.2-s_2.12
--conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties "
--conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties "
text_dedup/minhash_spark.py
--input " ./temp-data "
--output " ./temp-output "
--column " text "
--threshold 0.7 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id : bigint
DEBUG __main__ - text : string
DEBUG __main__ - meta : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__ : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before: 88803
DEBUG __main__ - Number of rows after: 44092
DEBUG __main__ - Percentage of rows kept: 49.65%
DEBUG __main__ - Output: ./temp-output
DEBUG __main__ - Time: 68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
或查看如何使用GCP DataProc运行工作的bigcode-v2/run.sh。
基于Google的RETSIM模型(GitHub,Arxiv),它是基于近乎删除方法的嵌入。
对于大型数据集,它需要GPU(S)进行快速推断。
python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question输出:
INFO Load Dataset : 5.56s
INFO Index Dataset : 8.13s
INFO Clustering : 8.72s
INFO Filtering : 0.35s
INFO Saving : 0.01s
INFO Cleaning : 0.00s
INFO Total : 22.77s
INFO Before : 817
INFO After : 788
# input
python -m text_dedup.suffix_array
--path " oscar-corpus/OSCAR-2201 "
--name " gl "
--split " train "
--cache_dir " ./cache "
--output " output/suffix_array/oscar_gl_dedup "
--column " text "
--google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets "
--use_auth_token true
# output
INFO Loading : 2.75 seconds
INFO Preprocessing : 4.78 seconds
INFO SuffixArray : 98.29 seconds
INFO SelfSimilar : 4.24 seconds
INFO Restore : 0.25 seconds
INFO Deduplicate : 6.23 seconds
INFO Saving : 8.91 seconds
INFO Total : 125.45 seconds
INFO Before : 180332342 bytes (88803)
INFO After : 97646271 bytes (40404) # input
python -m text_dedup.minhash
--path " oscar-corpus/OSCAR-2201 "
--name " gl "
--split " train "
--cache_dir " ./cache "
--output " output/minhash/oscar_gl_dedup "
--column " text "
--batch_size 10000
--use_auth_token true
# output
INFO Loading : 2.62 seconds
INFO MinHashing : 0.08 seconds
INFO Clustering : 2.20 seconds
INFO Filtering : 0.53 seconds
INFO Saving : 9.86 seconds
INFO Total : 15.29 seconds
INFO Data Number (before) : 88803
INFO Data Number (after) : 44124 (49.69%)
INFO Duplicate Number : 44679 (50.31%)
INFO ? Happy Deduplicating ? # input
python -m text_dedup.simhash
--path " oscar-corpus/OSCAR-2201 "
--name " gl "
--split " train "
--cache_dir " ./cache "
--output " output/simhash/oscar_gl_dedup "
--column " text "
--batch_size 10000
--use_auth_token true
# output
INFO Loading : 2.60 seconds
INFO SimHashing : 0.04 seconds
INFO Indexing : 28.88 seconds
INFO Filtering : 0.88 seconds
INFO Saving : 10.41 seconds
INFO Total : 42.80 seconds
INFO Data Number (before) : 88803
INFO Data Number (after) : 46163 (51.98%)
INFO Duplicate Number : 42640 (48.02%)
INFO ? Happy Deduplicating ? # input
python -m text_dedup.exact_hash
--path " oscar-corpus/OSCAR-2201 "
--name " gl "
--split " train "
--cache_dir " ./cache "
--output " output/exact_hash/oscar_gl_dedup "
--column " text "
--batch_size 1000
--use_auth_token true
# output
INFO Loading : 2.95s
INFO Processing : 3.79s
INFO Filtering : 0.10s
INFO Saving : 2.89s
INFO Total : 9.72s
INFO Before : 88803
INFO After : 47049 # input
python -m text_dedup.bloom_filter
--path " oscar-corpus/OSCAR-2201 "
--name " gl "
--split " train "
--cache_dir " ./cache "
--output " output/bloom_filter/oscar_gl_dedup "
--error_rate 1e-5
--column " text "
--use_auth_token true --batch_size 1000
# output
INFO Loading : 2.72s
INFO Processing : 4.84s
INFO Filtering : 0.10s
INFO Saving : 2.88s
INFO Total : 10.54s
INFO Before : 88803
INFO After : 47045笔记
Spark实现具有一些小型数据集的开销,因此我建议仅当您拥有大型数据集和足够的计算资源时才使用脚本。
有关复制,请参见tests/benchmark_core.py 。
| 算法 | 精度(重复) | 召回(重复) | 精度(非重复) | 召回(非重复) | 宏F1得分 | 准确性 | 时间 |
|---|---|---|---|---|---|---|---|
| Unisim | 0.9307 | 0.8924 | 0.9055 | 0.9394 | 0.9181 | 0.9054 | 1305.79 |
| Minhash Spark | 0.957 | 0.9445 | 0.9471 | 0.959 | 0.952 | 0.9202 | 691.77 |
| Minhash | 0.9594 | 0.9445 | 0.9474 | 0.9616 | 0.9534 | 0.924 | 18.88 |
| Simhash | 0.9042 | 0.721 | 0.792 | 0.9329 | 0.8481 | 0.8321 | 644.36 |
| 确切标题 | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - |
| 确切的标题匹配1 | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
| Simhash匹配1 | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
| 文档向量相似性1 | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
| 混合方法1 | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
| LABSE 2 | 0.937 | 0.923 | 0.930 | 0.943 | 0.933 | 0.919 | - |
| 多语言用途2 | 0.917 | 0.907 | 0.918 | 0.927 | 0.917 | 0.909 | - |
| 多语言E5基准2 | 0.931 | 0.908 | 0.919 | 0.939 | 0.924 | 0.920 | - |
| Minhash + LSH 2 | 0.929 | 0.902 | 0.915 | 0.938 | 0.921 | 0.918 | - |
| retsim partial-dup 2 | 0.945 | 0.941 | 0.945 | 0.949 | 0.945 | 0.928 | - |
| retsim近dup 2 | 0.928 | 0.937 | 0.942 | 0.934 | 0.935 | 0.926 | - |
有关复制,请参见tests/benchmark_news.py 。
调整后的兰德指数(ARI)在新闻副本数据集上:
| 模型/算法 | 阿里 |
|---|---|
| Simhash | 0.612 |
| Minhash(火花) | 0.740 |
| Minhash | 0.742 |
| retsim近dup + ann* | 0.051 |
| n-gram 3 | 0.440 |
| Simhash 2 | 0.695 |
| Minhash 3 | 0.737 |
| Minhash 2 | 0.783 |
| 多语言用途2 | 0.730 |
| 多语言E5基准2 | 0.742 |
| S-Bert 3 | 0.700 |
| retsim partial-dup 2 | 0.831 |
| retsim近dup 2 | 0.704 |
| 重新排列3 | 0.937 |
| Bi-编码器3 | 0.915 |
*:我似乎无法从论文中复制结果。
Apache 2.0
通常,您可以将此存储库引用为:
@software { chenghao_mou_2023_8364980 ,
author = { Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu } ,
title = { ChenghaoMou/text-dedup: Reference Snapshot } ,
month = sep,
year = 2023 ,
publisher = { Zenodo } ,
version = { 2023.09.20 } ,
doi = { 10.5281/zenodo.8364980 } ,
url = { https://doi.org/10.5281/zenodo.8364980 }
}SPARK版本来自BigCode(Apache 2.0)和BigScience(Apache 2.0),如果需要,您可以引用原始纸:
@article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}使用局部敏感的散列和单词嵌入来重复学术文档↩2↩3↩4
retsim :弹性有效的文本相似性↩2↩2↩4↩5↩6↩7↩7↩9↩9↩10↩1111↩1212
噪声刺激性降低de lightication 。