JamSpell is a spell checking library with following features:
Colab example
jamspell.com - check out a new jamspell version with following features
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
Java, C#, Ruby support| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words/second) |
|
| JamSpell | 3.25% | 1.27% | 79.53% | 84.10% | 0.64% | 4854 |
| Norvig | 7.62% | 5.00% | 46.58% | 66.51% | 0.69% | 395 |
| Hunspell | 13.10% | 10.33% | 47.52% | 68.56% | 7.14% | 163 |
| Dummy | 13.14% | 13.14% | 0.00% | 0.00% | 0.00% | - |
Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).
We used following metrics:
To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:
| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words per second) | |
| JamSpell | 3.56% | 1.27% | 72.03% | 79.73% | 0.50% | 5524 |
| Norvig | 7.60% | 5.30% | 35.43% | 56.06% | 0.45% | 647 |
| Hunspell | 9.36% | 6.44% | 39.61% | 65.77% | 2.95% | 284 |
| Dummy | 11.16% | 11.16% | 0.00% | 0.00% | 0.00% | - |
More details about reproducing available in "Train" section.
Install swig3 (usually it is in your distro package manager)
Install jamspell:
pip install jamspellDownload or train language model
Use it:
import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)Add jamspell and contrib dirs to your project
Use it:
#include <jamspell/spell_corrector.hpp>
int main(int argc, const char** argv) {
NJamSpell::TSpellCorrector corrector;
corrector.LoadLangModel("model.bin");
corrector.FixFragment(L"I am the begt spell cherken!");
// "I am the best spell checker!"
corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
// "best", "beat", "belt", "bet", "bent", ... )
corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
// "checker", "chicken", "checked", "wherein", "coherent", ... )
return 0;
}You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.
Install cmake
Clone and build jamspell (it includes http server):
git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make./web_server/web_server en.bin localhost 8080$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checkercurl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates{
"results": [
{
"candidates": [
"best",
"beat",
"belt",
"bet",
"bent",
"beet",
"beit"
],
"len": 4,
"pos_from": 9
},
{
"candidates": [
"checker",
"chicken",
"checked",
"wherein",
"coherent",
"cheered",
"cherokee"
],
"len": 7,
"pos_from": 20
}
]
}Here pos_from - misspelled word first letter position, len - misspelled word len
To train custom model you need:
Install cmake
Clone and build jamspell:
git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
makePrepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)
Train model:
./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.binevaluate/evaluate.py script:python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txtevaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.