Téléchargement text dedup - Téléchargement du code source text dedup

text dedup

Autre code source

Reference Snapshot

Télécharger

Installation

pip install text-dedup

ou

pip install git+https://github.com/ChenghaoMou/text-dedup

Documentation

Pages github

Caractéristiques

Ce référentiel contient une collection de scripts de déduplication de texte qui sont prêts à l'emploi ou à modifier en fonction de vos besoins:

RETSIM / UNISIM, une déduplication proche de l'incorporation (WIP)
Minhash + Minhashlsh, y compris une implémentation Spark adaptée aux grands ensembles de données (TB)
Simhash 64 ou 128 bits
Sous-chaîne suffixarray
Filtre à floraison
Hachage exact (niveau de document, niveau de ligne / ccnet)

J'ai également de grands projets pour l'avenir:

Benchmark de mémoire pour le traitement du streaming
Déduplication inter-dataset
Réécrivez le tableau du suffixe en python
Une collection d'autres méthodes de déduplication: Superminhash, Probminhash, TreeMinhash, Bagminhash, densification optimale pour un hachage minuscule rapide et précis, un croquis de similitude rapide

Cependant, je n'ai pas l'intention de construire une bibliothèque de déduplication à des fins générales, qui était l'objectif de ce référentiel dès le début. Je vais également progresser progressivement le package PYPI. La raison derrière elle est que chaque cas d'utilisation peut être extrêmement différent et nécessite une conception et une considération minutieuses. Je vous encourage sincèrement à lire le script en premier (ils sont relativement courts) afin que vous puissiez comprendre ce qui est en jeu ici lorsque vous l'utilisez. Vous pouvez l'utiliser pour bootstrap votre propre script, ou simplement l'utiliser comme référence.

Remerciements

Ce référentiel est inspiré par les projets suivants et est fortement influencé par les leçons tirées de ma propre participation à BigScience (Apache 2.0) et BigCode (Apache 2.0). Il y a un article de blog sur le voyage. Les commentaires sont les bienvenus!

DataSketch (MIT)
Simhash-Py et Simhash-CPP (MIT)
La déduplication des données de formation rend les modèles de langage meilleurs (Apache 2.0)
Gaoya (MIT)

Exemples rapides

Pyspark natif

Modifiez text_dedup/minhash_spark.py pour votre propre projet et votre ensemble de données d'abord!

En supposant que vous avez un ensemble de données téléchargé (dans des fichiers parquet) sous "./temp-data", vous pouvez traiter avec un fichier avec votre calcul local par:

 export PYSPARK_PYTHON= " path to your python with scipy, xxhash, and numpy installed "
spark-submit --executor-memory 16g 
    --driver-memory 20g 
    --executor-cores 3 
    --num-executors 2 
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 
    --conf " spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    --conf " spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties " 
    text_dedup/minhash_spark.py
    --input " ./temp-data " 
    --output " ./temp-output " 
    --column " text " 
    --threshold 0.7

 DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Using B=25, R=10
DEBUG __main__ - Loaded documents: 88803
DEBUG __main__ - args.input='./temp-data'
DEBUG __main__ - args.output='./temp-output'
DEBUG __main__ - args.threshold=0.7
DEBUG __main__ - args.ngram_size=5
DEBUG __main__ - args.min_length=5
DEBUG __main__ - args.num_perm=250
DEBUG __main__ - args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Initial edges: 52102
DEBUG __main__ - Edges DataFrame: 52102
DEBUG __main__ - Vertices DataFrame: 50206
DEBUG __main__ - Assignment DataFrame: 50206
DEBUG __main__ - Merging records: 88803
INFO  __main__ - Saving with 1 partitions and 44092 rows each
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - Number of rows before:    88803
DEBUG __main__ - Number of rows after:     44092
DEBUG __main__ - Percentage of rows kept:  49.65%
DEBUG __main__ - Output:                   ./temp-output
DEBUG __main__ - Time:                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

Ou jetez un œil à BigCode-V2 / Run.sh sur la façon d'exécuter le travail avec GCP DataProc.

Unisim (WIP)

Sur la base du modèle RETSIM de Google (GitHub, ArXIV), il s'agit d'une intégration basée sur une méthode de quasi-déduplication.

Pour un grand ensemble de données, cela nécessiterait des GPU pour une inférence rapide.

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

Sortir:

 INFO     Load Dataset                    : 5.56s
INFO     Index Dataset                   : 8.13s
INFO     Clustering                      : 8.72s
INFO     Filtering                       : 0.35s
INFO     Saving                          : 0.01s
INFO     Cleaning                        : 0.00s
INFO     Total                           : 22.77s
INFO     Before                          : 817
INFO     After                           : 788

Suffix Array Sous-chaîne Exact Deduplication

 # input
python -m text_dedup.suffix_array 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/suffix_array/oscar_gl_dedup " 
    --column " text " 
    --google_repo_path " /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets " 
    --use_auth_token true

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

Minhash près de la déduplication

 # input
python -m text_dedup.minhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/minhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     ? Happy Deduplicating ?

Simhash près de la déduplication

 # input
python -m text_dedup.simhash 
  --path " oscar-corpus/OSCAR-2201 " 
  --name " gl " 
  --split " train " 
  --cache_dir " ./cache " 
  --output " output/simhash/oscar_gl_dedup " 
  --column " text " 
  --batch_size 10000 
  --use_auth_token true

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     ? Happy Deduplicating ?

Déduplication exacte du hachage exact

 # input
python -m text_dedup.exact_hash 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/exact_hash/oscar_gl_dedup " 
    --column " text " 
    --batch_size 1000 
    --use_auth_token true

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Filtre de fleur de déduplication exacte

 # input
python -m text_dedup.bloom_filter 
    --path " oscar-corpus/OSCAR-2201 " 
    --name " gl " 
    --split " train " 
    --cache_dir " ./cache " 
    --output " output/bloom_filter/oscar_gl_dedup " 
    --error_rate 1e-5 
    --column " text " 
    --use_auth_token true    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

Repères

Note

L'implémentation Spark a des frais généraux pour de petits ensembles de données, donc je recommande d'utiliser le script uniquement lorsque vous avez un ensemble de données important et suffisamment de ressources de calcul.

Pinecone / Core-2020-05-10-Deduplication

Voir tests/benchmark_core.py pour la reproduction.

Algorithme	Précision (doublons)	Rappel (doublons)	Précision (non doublé)	Rappel (non doublé)	Score de macro F1	Précision	Temps
Unissim	0.9307	0,8924	0,9055	0,9394	0,9181	0,9054	1305.79S
Étincelle de Minhash	0,957	0,9445	0,9471	0,959	0,952	0,9202	691.77
Minhash	0,9594	0,9445	0,9474	0,9616	0,9534	0,924	18.88
Simhash	0,9042	0,721	0,792	0,9329	0,8481	0,8321	644.36
Titre exact	0,8302	0,5521	0,7098	0,9065	0,77	0,7456	-
Titre exact correspondant ¹	0,830	0,50	0,709	0,992	0,757	0,746	-
Simhash correspondant ¹	0,697	0,247	0,598	0,985	0,631	0,616	-
Document Vector similitude ¹	0,912	0,779	0,861	0,986	0,885	0,883	-
Méthode hybride ¹	0,908	0,828	0,899	0,979	0,904	0,903	-
Labse ²	0,937	0,923	0,930	0,943	0,933	0,919	-
Utilisation multilingue ²	0,917	0,907	0,918	0,927	0,917	0,909	-
Base E5 multilingue ²	0,931	0,908	0,919	0,939	0,924	0,920	-
Minhash + lsh ²	0,929	0,902	0,915	0,938	0,921	0,918	-
RETSIM Partial-Dup ²	0,945	0,941	0,945	0,949	0,945	0,928	-
RetSim Near-Dup ²	0,928	0,937	0,942	0,934	0,935	0,926	-

Copie des nouvelles

Voir tests/benchmark_news.py pour la reproduction.

Index RAND ajusté (ARI) sur l'ensemble de données de copie des nouvelles:

Modèle / algorithme	Ari
Simhash	0,612
Minhash (Spark)	0,740
Minhash	0,742
RetSim Near-Dup + Ann *	0,051
n-gram ³	0,440
Simhash ²	0,695
Minhash ³	0,737
Minhash ²	0,783
Utilisation multilingue ²	0,730
Base E5 multilingue ²	0,742
S-Bert ³	0,700
RETSIM Partial-Dup ²	0,831
RetSim Near-Dup ²	0,704
RECORDER ³	0,937
Bi-encodeur ³	0,915

*: Je n'arrive pas à reproduire les résultats du papier.

Licence

Apache 2.0

Citations

Généralement, vous pouvez citer ce référentiel comme:

 @software { chenghao_mou_2023_8364980 ,
  author       = { Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu } ,
  title        = { ChenghaoMou/text-dedup: Reference Snapshot } ,
  month        = sep,
  year         = 2023 ,
  publisher    = { Zenodo } ,
  version      = { 2023.09.20 } ,
  doi          = { 10.5281/zenodo.8364980 } ,
  url          = { https://doi.org/10.5281/zenodo.8364980 }
}

La version Spark est née de BigCode (Apache 2.0) et de BigScience (Apache 2.0), et vous pouvez citer le papier d'origine si vous le souhaitez:

 @article {
kocetkov2023the,
title = { The Stack: 3 {TB} of permissively licensed source code } ,
author = { Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries } ,
journal = { Transactions on Machine Learning Research } ,
issn = { 2835-8856 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=pxpbTdUEpD } ,
note = { }
}

Déduplication de documents savants utilisant le hachage sensible de la localité et les incorporations de mots ↩ ↩ ² ↩ ³ ↩ ⁴
RETSIM: similitude de texte résilient et efficace ↩ ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵ ↩ ⁶ ↩ ⁷ ↩ ⁸ ↩ ⁹ ↩ ¹⁰ ↩ ¹¹ ↩ ¹²
Détectif de robuste de bruit à l'échelle ↩ ↩ ² ↩ ³ ↩ ⁴ ↩ ⁵