text dedupダウンロード - text dedupソースコードのダウンロード

text dedup

その他のソースコード

Reference Snapshot

ダウンロード

インストール

pip install text-dedup

または

pip install git+https://github.com/ChenghaoMou/text-dedup

ドキュメント

githubページ

特徴

このリポジトリには、使用可能なテキスト重複排除スクリプトのコレクションが含まれています。

retsim/unisim、埋め込みベース近くの重複排除（WIP）
Minhash + Minhashlsh、大規模（TB）データセットに適したスパーク実装を含む
64または128ビットSimhash
サフィキサレイサブストリング
ブルームフィルター
正確なハッシュ（ドキュメントレベル、ラインレベル/ccnet）

私はまた、将来の大きな計画を持っています：

ストリーミング処理のメモリベンチマーク
データ間重複排除
Pythonで接尾辞アレイを書き直します
その他の重複排除方法のコレクション：Superminhash、Probminhash、Treeminhash、Bagminhash、迅速かつ正確なミニワイズハッシュのための最適な密度、高速類似性スケッチ

しかし、私は汎用延長ライブラリを構築するつもりはありません。これは、このレポの目標でした。私は徐々にPYPIパッケージを引退します。その背後にある理由は、各ユースケースが大きく異なる可能性があり、慎重な設計と考慮が必要であるためです。最初にスクリプトを読むことを心からお勧めします（比較的短いです）。それを使用して、独自のスクリプトをブートストラップするか、参照として使用することもできます。

謝辞

このリポジトリは、以下のプロジェクトに触発されており、BigScience（Apache 2.0）とBigCode（Apache 2.0）への私自身の参加から学んだ教訓の影響を強く受けています。旅についてのブログ投稿があります。フィードバックは大歓迎です！

DataSketch（MIT）
Simhash-pyとSimhash-cpp（MIT）
トレーニングデータを強化すると言語モデルが良くなります（Apache 2.0）
Gaoya（MIT）

簡単な例

ネイティブPyspark

最初にあなた自身のプロジェクトとデータセットのためにtext_dedup/minhash_spark.pyを変更してください！

「./temp-data」の下にダウンロードされたデータセット（Parquetファイル）があると仮定すると、次のようなローカルコンピューティングでファイルを使用して処理できます。

 export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g 
    --driver-memory 20g 
    --executor-cores 3 
    --num-executors 2 
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 
    --conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    --conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    text_dedup/minhash_spark.py
    --input " ./temp-data " 
    --output " ./temp-output " 
    --column " text " 
    --threshold 0.7

 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

または、GCP Dataprocでジョブを実行する方法について、BigCode-V2/run.shをご覧ください。

Unisim（WIP）

GoogleのRetSIMモデル（Github、Arxiv）に基づいて、近親者に近い方法に基づいた埋め込みです。

大規模なデータセットの場合、迅速な推論にはGPUが必要になります。

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

出力：

 INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

接尾辞アレイサブストリングの正確な重複排除

 # input
python -m text_dedup.suffix_array 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/suffix_array/oscar_gl_dedup " 
    --column " text " 
    --google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets " 
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

重複排除の近くのミンハッシュ

 # input
python -m text_dedup.minhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/minhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ? Happy Deduplicating ?

重複排除の近くのsimhash

 # input
python -m text_dedup.simhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/simhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ? Happy Deduplicating ?

正確なハッシュの正確な重複排除

 # input
python -m text_dedup.exact_hash 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/exact_hash/oscar_gl_dedup " 
    --column " text " 
    --batch_size 1000 
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

ブルームフィルター正確な重複排除

 # input
python -m text_dedup.bloom_filter 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/bloom_filter/oscar_gl_dedup " 
    --error_rate 1e-5 
    --column " text " 
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

ベンチマーク

注記

Sparkの実装には小さなデータセット用のオーバーヘッドがあるため、大きなデータセットと十分な計算リソースがある場合にのみ、スクリプトを使用することをお勧めします。

Pinecone/Core-2020-05-10-Deduplication

複製については、 tests/benchmark_core.pyを参照してください。

アルゴリズム	精度（複製）	リコール（複製）	精度（非重複）	リコール（非重複）	マクロF1スコア	正確さ	時間
ユニシム	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79S
ミンハッシュスパーク	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77S
ミンハッシュ	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88S
Simhash	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36s
正確なタイトル	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
正確なタイトルマッチング¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhashマッチング¹	0.697	0.247	0.598	0.985	0.631	0.616	-
ドキュメントベクトルの類似性¹	0.912	0.779	0.861	0.986	0.885	0.883	-
ハイブリッド方法¹	0.908	0.828	0.899	0.979	0.904	0.903	-
ラボ²	0.937	0.923	0.930	0.943	0.933	0.919	-
多言語使用²	0.917	0.907	0.918	0.927	0.917	0.909	-
多言語E5ベース²	0.931	0.908	0.919	0.939	0.924	0.920	-
Minhash + LSH ²	0.929	0.902	0.915	0.938	0.921	0.918	-
Retsim Partial-Dup ²	0.945	0.941	0.945	0.949	0.945	0.928	-
Retsim近^Dup2	0.928	0.937	0.942	0.934	0.935	0.926	-

ニュースコピー

複製については、 tests/benchmark_news.pyを参照してください。

ニュースコピーデータセットの調整済みRANDインデックス（ARI）：

モデル/アルゴリズム	アリ
Simhash	0.612
ミンハッシュ（火花）	0.740
ミンハッシュ	0.742
Retsim Near-Dup + Ann*	0.051
n-gram ³	0.440
^Simhash2	0.695
ミンハッシュ³	0.737
ミンハッシュ²	0.783
多言語使用²	0.730
多言語E5ベース²	0.742
S-Bert ³	0.700
Retsim Partial-Dup ²	0.831
Retsim近^Dup2	0.704
再ランク³	0.937
Bi-Encoder ³	0.915

*：紙の結果を再現できないようです。

ライセンス

Apache 2.0

引用

通常、このリポジトリを次のように引用できます。

 @software { chenghao_mou_2023_8364980 ,
  author       = { Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu } ,
  title        = { ChenghaoMou/text-dedup: Reference Snapshot } ,
  month        = sep,
  year         = 2023 ,
  publisher    = { Zenodo } ,
  version      = { 2023.09.20 } ,
  doi          = { 10.5281/zenodo.8364980 } ,
  url          = { https://doi.org/10.5281/zenodo.8364980 }
}

Sparkバージョンは、BigCode（Apache 2.0）とBigScience（Apache 2.0）から生まれました。必要に応じて、元の論文を引用できます。

 @article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}

地域の敏感なハッシュと単語^の埋め込みを使用した学術文書の重複^数^↩2↩3↩4
^Retsim ：^回復^力^と^効率^的^な^テキスト^の^類似^性
スケール^での^ノイズ^-ロバスト^脱体複合

拡大する

追加情報

バージョン Reference Snapshot
タイプその他のソースコード
更新時間 2025-04-19
サイズ 194.73KB
から Github

text dedup

インストール

ドキュメント

特徴

謝辞

簡単な例

ベンチマーク

ライセンス

引用

イエスとのテキストメッセージ中国語

イエスとのテキストメッセージ

イエスとテキストメッセージ中国語版

テキスト・オア・ダイ

RTE (リッチテキストエディター) ASP.NET

PHPテキストリンク交換

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

text dedup

インストール

ドキュメント

特徴

謝辞

簡単な例

ベンチマーク

ライセンス

引用

脚注