Learns an n-gram language model given a corpus. The corpus should be text file, with a single word per line, containing no inter-word spaces.
The learned quantities are:
Test the script by running with no argument:
python3 ngramModelTrainer
Use the -h flag for details on how to use the tool with proper input:
python3 ngramModelTrainer -h
There are a few example inputs on fixtures/.
The output is saved as four MATLAB matrices.
An alphabet of specific acceptable unigrams is required to be defined. By default, we are using an alphabet of 36 possible letters/digits. These are held in a python list called 'alphabet', in the following order:
Non-'standard' versions of the above alphabet may be used. These include: dutta_extended: a number of extra characters (these are notably encodings of the characters and punctuation found in the George Washington handwritten document set). sophia: polytonic greek characters. dummy: a limited testing set of 3 characters