text dedup下载 - text dedup源代码下载

text dedup

其他源码

Reference Snapshot

下载

安装

pip install text-dedup

或者

pip install git+https://github.com/ChenghaoMou/text-dedup

文档

github页面

特征

该存储库包含准备使用的文本重复数据删除脚本的集合，或根据您的需求修改：

RETSIM/UNISIM，一种基于嵌入的接近重复数据删除（WIP）
Minhash + Minhashlsh，包括适合大型（TB）数据集的火花实现
64或128位Simhash
suffixarray子弦
布卢姆过滤器
精确哈希（文档级，线路级/ccnet）

我对未来也有巨大的计划：

用于流处理的存储基准
数据间删除
重写python中的后缀阵列
其他重复数据删除方法的集合：Superminhash，ProbMinhash，Treeminhash，Bagminhash，快速准确的迈置，快速的快速相似性草图的最佳致密化

但是，我无意建立通用删除库，这是该回购的目标。我也将逐渐退休PYPI软件包。其背后的原因是每个用例都可能大不相同，需要仔细的设计和考虑。我真诚地鼓励您首先阅读脚本（它们相对较短），以便您可以在使用该脚本时了解此处的危险。您可以使用它来引导自己的脚本，也可以将其用作参考。

致谢

该存储库的灵感来自以下项目的启发，并且受到我参与BigScience（Apache 2.0）和BigCode（Apache 2.0）的经验教训的严重影响。有一篇关于旅程的博客文章。欢迎反馈！

DataSketch（MIT）
Simhash-Py和Simhash-CPP（麻省理工学院）
重复数据培训数据使语言模型变得更好（Apache 2.0）
Gaoya（麻省理工学院）

快速示例

本地Pyspark

首先，修改text_dedup/minhash_spark.py首先！

假设您在“ ./temp-data”下有一个下载的数据集（在parquet文件中），则可以使用以下方式使用本地计算来处理文件：

 export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g 
    --driver-memory 20g 
    --executor-cores 3 
    --num-executors 2 
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 
    --conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    --conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    text_dedup/minhash_spark.py
    --input " ./temp-data " 
    --output " ./temp-output " 
    --column " text " 
    --threshold 0.7

 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

或查看如何使用GCP DataProc运行工作的bigcode-v2/run.sh。

Unisim（WIP）

基于Google的RETSIM模型（GitHub，Arxiv），它是基于近乎删除方法的嵌入。

对于大型数据集，它需要GPU（S）进行快速推断。

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

输出：

 INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

后缀阵列子字符串精确重复数据删除

 # input
python -m text_dedup.suffix_array 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/suffix_array/oscar_gl_dedup " 
    --column " text " 
    --google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets " 
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

Minhash几乎重复数据删除

 # input
python -m text_dedup.minhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/minhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ? Happy Deduplicating ?

Simhash附近重复数据删除

 # input
python -m text_dedup.simhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/simhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ? Happy Deduplicating ?

确切的重复数据删除

 # input
python -m text_dedup.exact_hash 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/exact_hash/oscar_gl_dedup " 
    --column " text " 
    --batch_size 1000 
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom滤波器精确重复数据删除

 # input
python -m text_dedup.bloom_filter 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/bloom_filter/oscar_gl_dedup " 
    --error_rate 1e-5 
    --column " text " 
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

基准

笔记

Spark实现具有一些小型数据集的开销，因此我建议仅当您拥有大型数据集和足够的计算资源时才使用脚本。

Pinecone/Core-2020-05-10-DEDUPLICATION

有关复制，请参见tests/benchmark_core.py 。

算法	精度（重复）	召回（重复）	精度（非重复）	召回（非重复）	宏F1得分	准确性	时间
Unisim	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79
Minhash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77
Minhash	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88
Simhash	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36
确切标题	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
确切的标题匹配¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash匹配¹	0.697	0.247	0.598	0.985	0.631	0.616	-
文档向量相似性¹	0.912	0.779	0.861	0.986	0.885	0.883	-
混合方法¹	0.908	0.828	0.899	0.979	0.904	0.903	-
LABSE ²	0.937	0.923	0.930	0.943	0.933	0.919	-
多语言用途²	0.917	0.907	0.918	0.927	0.917	0.909	-
多语言E5基准²	0.931	0.908	0.919	0.939	0.924	0.920	-
Minhash + LSH ²	0.929	0.902	0.915	0.938	0.921	0.918	-
retsim partial-dup ²	0.945	0.941	0.945	0.949	0.945	0.928	-
retsim近dup ²	0.928	0.937	0.942	0.934	0.935	0.926	-

新闻拷贝

有关复制，请参见tests/benchmark_news.py 。

调整后的兰德指数（ARI）在新闻副本数据集上：

模型/算法	阿里
Simhash	0.612
Minhash（火花）	0.740
Minhash	0.742
retsim近dup + ann*	0.051
n-gram ³	0.440
Simhash ²	0.695
Minhash ³	0.737
Minhash ²	0.783
多语言用途²	0.730
多语言E5基准²	0.742
S-Bert ³	0.700
retsim partial-dup ²	0.831
retsim近dup ²	0.704
重新排列³	0.937
Bi-编码器³	0.915

*：我似乎无法从论文中复制结果。

执照

Apache 2.0

引用

通常，您可以将此存储库引用为：

 @software { chenghao_mou_2023_8364980 ,
  author       = { Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu } ,
  title        = { ChenghaoMou/text-dedup: Reference Snapshot } ,
  month        = sep,
  year         = 2023 ,
  publisher    = { Zenodo } ,
  version      = { 2023.09.20 } ,
  doi          = { 10.5281/zenodo.8364980 } ,
  url          = { https://doi.org/10.5281/zenodo.8364980 }
}

SPARK版本来自BigCode（Apache 2.0）和BigScience（Apache 2.0），如果需要，您可以引用原始纸：

 @article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}

^使用局部敏感的散列和^单词嵌入来重复学术文档^↩2↩3↩4
^retsim ^：^弹性^有效^的^文本^相似^性^{↩2↩2↩4↩5↩6↩7↩7↩9↩9↩10↩1111↩1212}
噪声^刺激性^降低^de lightication ^。

展开

附加信息

版本 Reference Snapshot
类型其他源码
更新时间 2025-04-19
大小 194.73KB
来自于 Github

text dedup

安装

文档

特征

致谢

快速示例

基准

执照

引用

Text With Jesus汉化

与耶稣发短信

Text With Jesus中文版

发短信或死亡

RTE（富文本编辑器）ASP.NET

PHP文本交换链(Text Link Exchange)

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

text dedup

安装

文档

特征

致谢

快速示例

基准

执照

引用

脚注