Neuspell:神經拼寫校正工具包
neuspell現在可以通過PIP獲得。通過PIP查看安裝BERT預估計模型現在可以作為HuggingFace模型的一部分,作為murali1996/bert-base-cased-spell-correction 。我們為好奇的從業人員提供一個示例代碼段。git clone https://github.com/neuspell/neuspell ; cd neuspell
pip install -e .要安裝額外的要求,
pip install -r extras-requirements.txt或單獨為:
pip install -e .[elmo]
pip install -e .[spacy]注意:對於ZSH ,使用“。[Elmo]”和“。[Spacy]”
此外, spacy models可以下載為:
python -m spacy download en_core_web_sm然後,下載neuspell預估計的模型之後下載檢查點
這是用於使用檢查器模型的快速啟動代碼段(命令行的用法)。有關更多用法模式,請參見test_neuspell_correctors.py。
import neuspell
from neuspell import available_checkers , BertChecker
""" see available checkers """
print ( f"available checkers: { neuspell . available_checkers () } " )
# → available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']
""" select spell checkers & load """
checker = BertChecker ()
checker . from_pretrained ()
""" spell correction """
checker . correct ( "I luk foward to receving your reply" )
# → "I look forward to receiving your reply"
checker . correct_strings ([ "I luk foward to receving your reply" , ])
# → ["I look forward to receiving your reply"]
checker . correct_from_file ( src = "noisy_texts.txt" )
# → "Found 450 mistakes in 322 lines, total_lines=350"
""" evaluation of models """
checker . evaluate ( clean_file = "bea60k.txt" , corrupt_file = "bea60k.noise.txt" )
# → data size: 63044
# → total inference time for this data is: 998.13 secs
# → total token count: 1032061
# → confusion table: corr2corr:940937, corr2incorr:21060,
# incorr2corr:55889, incorr2incorr:14175
# → accuracy is 96.58%
# → word correction rate is 79.76%另外,一旦可以選擇和加載咒語檢查器,如下所示:
from neuspell import SclstmChecker
checker = SclstmChecker ()
checker = checker . add_ ( "elmo" , at = "input" ) # "elmo" or "bert", "input" or "output"
checker . from_pretrained ()目前,為選定模型支持添加Elmo或Bert模型的此功能。有關詳細信息,請參見工具包中的神經模型列表。
如果有興趣,請遵循安裝非神經咒語檢查器的其他要求Aspell和Jamspell 。
pip install neuspell在V1.0中, allennlp庫未自動安裝,該庫用於包含Elmo的型號。因此,要使用這些檢查器,請按安裝和快速啟動進行源安裝
Neuspell是一種開源工具包,用於上下文敏感的英語拼寫校正。該工具包包括10個咒語檢查器,並對來自多個(公開)來源的自然散佈進行了評估。為了製作拼寫檢查上下文依賴性的神經模型,(i)我們使用上下文中的拼寫錯誤訓練神經模型,這是通過反向工程隔離的錯誤拼寫構建的; (ii)使用上下文的更豐富的表示。此工具包使NLP實踐者能夠通過簡單的統一命令行以及Web界面使用我們建議的和現有的拼寫校正系統。在許多潛在的應用中,我們證明了我們的拼寫檢查器在對抗障礙拼寫過程中的實用性。

CNN-LSTMSC-LSTMNested-LSTMBERTSC-LSTM plus ELMO (at input)SC-LSTM plus ELMO (at output)SC-LSTM plus BERT (at input)SC-LSTM plus BERT (at output)
該管道對應於`sc-lstm加Elmo(在輸入)`模型。
| 拼寫 檢查器 | 單詞 更正 速度 | 時間 句子 (以毫秒為單位) |
|---|---|---|
Aspell | 48.7 | 7.3* |
Jamspell | 68.9 | 2.6* |
CNN-LSTM | 75.8 | 4.2 |
SC-LSTM | 76.7 | 2.8 |
Nested-LSTM | 77.3 | 6.4 |
BERT | 79.1 | 7.1 |
SC-LSTM plus ELMO (at input) | 79.8 | 15.8 |
SC-LSTM plus ELMO (at output) | 78.5 | 16.3 |
SC-LSTM plus BERT (at input) | 77.0 | 6.7 |
SC-LSTM plus BERT (at output) | 76.0 | 7.2 |
Neuspell工具包在BEA-60K數據集上具有實際拼寫錯誤的性能。 *表示對CPU的評估(對於其他我們使用GeForce RTX 2080 Ti GPU)。
要下載選定的檢查點,請從下面選擇一個檢查點名稱,然後運行下載。每個檢查點與表中的神經咒語檢查器相關聯。
| 拼寫檢查器 | 班級 | 檢查點名稱 | 磁盤空間(大約) |
|---|---|---|---|
CNN-LSTM | CnnlstmChecker | 'CNN-LSTM-Probwordnoise' | 450 MB |
SC-LSTM | SclstmChecker | 'scrnn-probwordnoise' | 450 MB |
Nested-LSTM | NestedlstmChecker | 'lstm-lstm-probwordnoise' | 455 MB |
BERT | BertChecker | “ Subwordbert-Probwordnoise” | 740 MB |
SC-LSTM plus ELMO (at input) | ElmosclstmChecker | 'elmoscrnn-probwordnoise' | 840 MB |
SC-LSTM plus BERT (at input) | BertsclstmChecker | 'Bertscrnn-Probwordnoise' | 900 MB |
SC-LSTM plus BERT (at output) | SclstmbertChecker | “ scrnnbert-probwordnoise' | 1.19 GB |
SC-LSTM plus ELMO (at output) | SclstmelmoChecker | 'scrnnelmo-probwordnoise' | 1.23 GB |
import neuspell
neuspell . seq_modeling . downloads . download_pretrained_model ( "subwordbert-probwordnoise" )或者,通過運行以下(v1.0之後的版本可用)下載所有Neuspell神經模型:
import neuspell
neuspell . seq_modeling . downloads . download_pretrained_model ( "_all_" )或者,
我們策劃了幾個合成和自然數據集,用於培訓/評估Neuspell模型。有關完整的詳細信息,請檢查我們的論文。運行以下內容以下載所有數據集。
cd data/traintest
python download_datafiles.py
有關更多詳細信息,請參見data/traintest/README.md 。
火車文件被稱為名稱.random , .word , .prob , .probword ,用於用於創建它們的不同Nodising Startegies。對於每種策略(請參閱合成數據創建),我們在乾淨的語料庫中噪聲約為20%。我們將160萬個One billion word benchmark數據集中的句子作為清潔語料庫。
為了設置演示,請執行以下步驟:
pip install -e ".[flask]"CUDA_VISIBLE_DEVICES=0 python app.py (在gpu)或python app.py (在cpu上),在文件夾中啟動燒瓶服務器。 /scripts/flask-server。該工具包提供了3種人們物策略(從現有文獻中識別),以生成合成的並行訓練數據,以訓練神經模型進行咒語校正。這些策略包括簡單的基於查找的en-word-replacement-noise noise ),字符級噪聲誘導,例如en-char-replacement-noise noise ),以及基於混淆矩陣的en-probchar-replacement-noise )驅動。有關這些方法的完整詳細信息,請查看我們的論文。
以下是使用上述噪聲策劃的相應類映射。由於某些噪音使用了一些預構建的數據文件,因此我們還提供了它們的大致磁盤空間。
| 文件夾 | 班級名稱 | 磁盤空間(大約) |
|---|---|---|
en-word-replacement-noise | WordReplacementNoiser | 2 MB |
en-char-replacement-noise | CharacterReplacementNoiser | - - |
en-probchar-replacement-noise | ProbabilisticCharacterReplacementNoiser | 80 MB |
以下是用於使用這些噪音的片段 -
from neuspell . noising import WordReplacementNoiser
example_texts = [
"This is an example sentence to demonstrate noising in the neuspell repository." ,
"Here is another such amazing example !!"
]
word_repl_noiser = WordReplacementNoiser ( language = "english" )
word_repl_noiser . load_resources ()
noise_texts = word_repl_noiser . noise ( example_texts )
print ( noise_texts ) Coming Soon ...
neuspell預算模型的頂部進行填充 from neuspell import BertChecker
checker = BertChecker ()
checker . from_pretrained ()
checker . finetune ( clean_file = "sample_clean.txt" , corrupt_file = "sample_corrupt.txt" , data_dir = "default" )此功能僅適用於BertChecker和ElmosclstmChecker 。
現在,我們支持初始化擁抱面模型並在您的自定義數據上進行填充。這是一個代碼段,證明了:
首先標記包含線條分離格式的干淨和損壞文本的文件
from neuspell . commons import DEFAULT_TRAINTEST_DATA_PATH
data_dir = DEFAULT_TRAINTEST_DATA_PATH
clean_file = "sample_clean.txt"
corrupt_file = "sample_corrupt.txt" from neuspell . seq_modeling . helpers import load_data , train_validation_split
from neuspell . seq_modeling . helpers import get_tokens
from neuspell import BertChecker
# Step-0: Load your train and test files, create a validation split
train_data = load_data ( data_dir , clean_file , corrupt_file )
train_data , valid_data = train_validation_split ( train_data , 0.8 , seed = 11690 )
# Step-1: Create vocab file. This serves as the target vocab file and we use the defined model's default huggingface
# tokenizer to tokenize inputs appropriately.
vocab = get_tokens ([ i [ 0 ] for i in train_data ], keep_simple = True , min_max_freq = ( 1 , float ( "inf" )), topk = 100000 )
# # Step-2: Initialize a model
checker = BertChecker ( device = "cuda" )
checker . from_huggingface ( bert_pretrained_name_or_path = "distilbert-base-cased" , vocab = vocab )
# Step-3: Finetune the model on your dataset
checker . finetune ( clean_file = clean_file , corrupt_file = corrupt_file , data_dir = data_dir )您可以按照以下方式在自定義數據上進一步評估您的模型:
from neuspell import BertChecker
checker = BertChecker ()
checker . from_pretrained (
bert_pretrained_name_or_path = "distilbert-base-cased" ,
ckpt_path = f" { data_dir } /new_models/distilbert-base-cased" # "<folder where the model is saved>"
)
checker . evaluate ( clean_file = clean_file , corrupt_file = corrupt_file , data_dir = data_dir )在上面的用法之後,一旦可以無縫地使用多語言模型,例如xlm-roberta-base , bert-base-multilingual-cased和distilbert-base-multilingual-cased of非英語腳本。
./applications/Adversarial-Misspellings-arxiv夾中可用。請參閱readme.md。Aspell Checker的要求:
wget https://files.pythonhosted.org/packages/53/30/d995126fe8c4800f7a9b31aa0e7e5b2896f5f84db4b7513df746b2a286da/aspell-python-py3-1.15.tar.bz2
tar -C . -xvf aspell-python-py3-1.15.tar.bz2
cd aspell-python-py3-1.15
python setup.py install
Jamspell檢查器的要求:
sudo apt-get install -y swig3.0
wget -P ./ https://github.com/bakwc/JamSpell-models/raw/master/en.tar.gz
tar xf ./en.tar.gz --directory ./
@inproceedings{jayanthi-etal-2020-neuspell,
title = "{N}eu{S}pell: A Neural Spelling Correction Toolkit",
author = "Jayanthi, Sai Muralidhar and
Pruthi, Danish and
Neubig, Graham",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.21",
doi = "10.18653/v1/2020.emnlp-demos.21",
pages = "158--164",
abstract = "We introduce NeuSpell, an open-source toolkit for spelling correction in English. Our toolkit comprises ten different models, and benchmarks them on naturally occurring misspellings from multiple sources. We find that many systems do not adequately leverage the context around the misspelt token. To remedy this, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings; and (ii) use richer representations of the context. By training on our synthetic examples, correction rates improve by 9{%} (absolute) compared to the case when models are trained on randomly sampled character perturbations. Using richer contextual representations boosts the correction rate by another 3{%}. Our toolkit enables practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings. The toolkit can be accessed at neuspell.github.io.",
}
鏈接出版。任何問題或建議,請通過JSaimurali001 [at] gmail [dot] com與作者聯繫