This experimental benchmark compares different ways of representing words in a simplified task of building the so -called Language Model.
Language Model is a way to calculate the probability of a given fragment of the text (words chain). NLP previously widely used various options for building LM based on counting N-grams and smoothing (smoothing) to compensate for the discharge for the N-grams of large arches. For example, NLTK has the implementation of the Kneser-Ney method. There are also a large number of different implementations that differ in technical details regarding the storage of large volumes of N-gramer information and the speed of text processing, for example https://github.com/kpu/kenlm.
In the past couple of years, different options for Neural Language Models (NLMS) have received widespread use, which use various neurosoteal architectures, including recurrent and sparkle. NLM can be divided into two groups: working at the level of words (Word-AWARE NLM) and at the level of characters (Character-Aware NLM). The second option is especially interesting in that it allows the model to work with morphemes . This is extremely significant for languages with rich morphology. For some words, hundreds of grammatical forms can potentially exist, as in the case of adjectives in the Russian language. Even a very large text case does not guarantee that all variants of words forms will be included in the vocabulary for Word-AWARE NLM. On the other hand, morphemic structures almost always obey regular rules, so knowledge of the basis or basic form of the word allows you to generate its grammatical and semantic options: bold, bold, bite, bold, etc., as well as determine the semantic connection between the same -rooted words, even if the corresponding words are entrenched for the first time and the statistics on their context of their use are not known. The last remark becomes very important for the fields of NLP with a dynamic word formation, first of all, it is all sorts of social media.
Within the framework of the benchmark, all the necessary code has been prepared to study and compare both NLM options, along with other methods of building LM-Characte-AWARE and Word-AWARE.
We will a little simplify our Langage Model, setting the task in this form. There is a N-gram of pre-selected length (see the ngram_order Constant in the DatasetVectorizers module). If it is obtained from the text case (the path to the sticked UTF-8 file is sewn in the _Get_corpus_path class Basevectorizer), then we believe that this is a valid combination of words and the target value y = 1. If the N-Gram is obtained by a random replacement of one of the words and such a chain is not found in the case, then the target value is y = 0.
Unacceptable N-grams are generated during the analysis of the case in the same amount as valid. It turns out a balanced dataset, which facilitates the task. And the lack of the need for manual markings allows you to easily "play" with such a parameter as the number of records in the training dataset.
Thus, the binary classification problem is solved. The classifier will be the implementation of the gradient boosting XGBOOST and the neural network, implemented on Keras, as well as several other options on different programming languages and DL frameworks.
The object of the study is the influence of the method of representing words in the input matrix X on the accuracy of the classification.
A detailed analysis of various approaches to NLP, including for similar tasks, can be found in the article: A Primer on Neural Network Models for Natural Language Processing.
The following options are checked for XGBOOST:
W2V - we use the core trained Word2Vec, gluing words of words into one long vector
W2V_TAGS - expansion of W2V representation, additional categories of the final vector are obtained from morphological tags of words. It turns out something like Factored Language Model
SDR - Sparse Distributed Representations words obtained by the factorization of the Word2Vector Model
RANDOM_BITVECTOR - each word is attributed to a random binary fixed -length vector with a given proportion of 0/1
BC - the vectors created as a result of the work of Brown Clustering (see description https://en.wikipedia.org/wiki/bron_clustering and the implementation https://github.com/percyliang/brunki) are used as a result
chars - each word is encoded as a chain of 1 -hot representation of symbols
Hashing_trick - used hashing trick to encode words with a limited number of index bits (see description https://en.wikipedia.org/wiki/feature_hashing and implementation https://radimrehurek.com/gensim/corpora/hashdictionary.html)
AE - words of words are obtained as activation on the inner layer of the autoencoder. Training in the autoencoder and obtaining words vectors is performed by the Word_autoencoder3.py script in the Pymodels/Wordautoencoders folder.
Two additional presentation methods are available for neural networks, which are converted into a certain vector representation using the Embedding layer (or similar to used library used.
Word_indeces - a vocabulary is compiled, each word is assigned a unique integer index. Thus, the 3-grade seems to be a triple of integers.
Char_indeces - an alphabet is compiled, each symbol is assigned a unique integer index. Further, the chains of symbol indexes are supplemented by a sample index to the same length (the maximum length of the word), and the resulting chain of symbol indices is returned as the presentation of the N-gram. Using a layer of embedding (Embedding in Keras or similar), symbols are converted into a DENSE Vector form with which the rest of the neural network works. It turns out the version of the Character-Aware Neural Language Model.
To solve the problem, various libraries of machine learning, neural networks and libraries for matrix computing were used, including:
Keras (with Theano Backend)
Lasagne (Theano)
Nolearn (Theano)
Tensorflow
Python solutions:
Pymodels/wr_xgboost.py - resolver on the basis of XGBOOST
Pymodels/wr_catboost.py - CatbooSt resolver on words indexes using the ability to indicate indexes of categorical features in the Dataset so that the booster independently learns them during training
Pymodels/Wr_keras.py - a resolver based on Feed Forward Neurosettes implemented on Keras
Pymodels/Wr_keras_SDR2.PY - a separate solution for checking Sparse Distributed Representation of a large dimension (from 1024) with portioned generation of matrices for learning and validation, on keras. Pymodels/wr_lasagne.py - a resolver based on Feed Forward Neurosettes implemented on Lasagne (Theano)
Pymodels/Wr_NOLEARN.PY - a resolver based on the Feed Forward Neurosettes implemented on Nolearn+Lasagne (Theano)
Pymodels/Wr_TensorFlow.py - a solver based on Fed Forward Neurosettes implemented on Tensorflow
Solutions on C#:
Csharpmodels/withaccordnet/program.cs - a resolver based on the Feed Forward grids Accord.net (C#, project for VS 2015)
CSHARPMODELS/MyBASELINE/Program.cs - Solution based on my implementation Vanilla MLP (C#, Project for VS 2015)
Solutions on C ++:
Cxxmodels/tinydnn_model/tinydn_model.cpp - MLP resolution, implemented by Tiny -Dnn library (C ++, project for VS 2015)
CXXMODELS/SINGA_MODEL/ALEXNET.CC - a resolver on the basis of a neural network implemented by Apache.Singa (C ++, project for VS 2015)
CXXMODELS/Opennn_Model/Main.cpp - a resolver based on a neuroset, implemented by Opennn (C ++, project for VS 2015)
Solutions on Java Javamodels/With4J/SRC/Main/Java/WordrepresentationSTest.java - MLP resolver, implemented by Deeplearning4J library
Internal classes and tools:
Pymodels/CorpusReaders.py - Classes for reading lines from text buildings of different formats (zipped | Plain txt) pymodels/datasetvectorizers.py - Dataset vectorizers and factory for a convenient choice
Pymodels/store_dataset_file.py - generation of the dataset and its preservation for C# and C ++ models
For a neural networks, 2 architects were implemented on the basis of Keras. MLP is a simple Feed Forward Nefete with full -tied layers. Convnet is a mesh that uses bundle layers.
Switching architectures is performed by Net_ARCH in the NET_ARCH parameter in the Wr_keras.py file.
It can be noted that the bristly option often gives a more accurate solution, but the corresponding models have a larger variant, simply put - the final accuracy greatly jumps for several models with the same parameters.
A detailed description of the architecture and numerical results of the experiments can be found here: https://kelijah.livejournal.com/224925.html.
All benchmark options use a text file in the UTF-8 encoding to receive N-gram lists. It is assumed that the division of the text into words and bringing to the lower register are made in advance by a third -party code. Therefore, scripts read lines from this file, break them into words by spaces.
For the convenience of playing experiments, I placed a compressed exposure of a large case with a size of 100 thousand lines in the DATA folder. This is enough to repeat all measurements. In order not to overload the repository with a large file, it is chopped. The BaseVectorizer._load_ngrams method itself unpacked its contents on the fly, so you do not need to do any manual manipulations.
In the variant of vectorization of W2V_TAGS, morphological features are added to W2V veins. These signs for each word are taken from the word2tags_bin.zip file in the Data subcatal, which is obtained by the converting of my grammatical dictionary (http://solarix.ru/sql-dictionary-sdk.shtml). The result of the conversion is drawn to 165 MB, which is a bit too much for the GIT repository. Therefore, I shouted the resulting file of the grammatical dictionary, and the corresponding class of vectorization of the dataset itself unpacked it on the fly during operation.
Please note that due to homonymy of grammatical forms, many words have more than one set of tags. For example, the word 'bears' can be a form of a single or plural, respectively, in a genitive or nominative case. When forming tags for words, I combine homonyms tags.
A two -stage process is used. First we get W2V vectors of words. A small script on the python unloads vectors into a text file in CSV format.
Then, using the utility https://github.com/mfaruqui/sparse-coding, a command is performed:
./nonneg.o input_vectors.txt 16 0.2 1e-5 4 out_Vecs.txt
The OUT_VECS.TXT file containing SDR words with the indicated parameters is generated: size = 16*Vector W2V, Filling = 0.2 This file is loaded with the SDR_Vectorizer class.
https://habrahabr.ru/post/335838/
http://kelijah.livejournal.com/217608.html
As a check, we use the memorization of the N-grams from the training set and the search among the remembered N-grams for the test set. The small size of the training set leads to the fact that on the test set, the model behaves simply as a random guessor, giving an accuracy of 0.50.
The main table of the results is available on the link. The rest of the following results were obtained for a set of 1.000,000 3 grams.
For a resolver on the basis of xGBOOST (WR_XGBOOST.PY), the following results were obtained.
For a dataset containing 1.000,000 trigrams: W2V_TAGS (Word2Vector + Morph Tags) ==> 0.81
W2V (Word2Vector DIM = 32) ==> 0.80
SDR (Sparse Distributed Represents, Length 512, 20% Units) ==> 0.78 BC (Brown Clustering) ==> 0.70
chars (Char One-Hot Encoding) ==> 0.71
RANDOM_BITVECTOR (RANDOM BIT Vectors, share of units 16%) ==> 0.696
Hashing_trick (Hashing Trick with 32,000 Slots) ==> 0.64
For a solver based on Lasagne MLP:
Accuracy = 0.71
For a solver based on Nolearn+Lasagne MLP:
Accuracy = 0.64
For a resolver based on Tiny-DNN MLP:
Accuracy = 0.58
For a solver based on accord.net Feed Forward Neural Net:
Word2Vector ==> 0.68
For a CATBOOST resolution on the categorical characteristics of the "Word Index":
Accuracy = 0.50
For a solver based on apache.singa mlp:
Accuracy = 0.74
Baseline solution - memorizing N -grams from the training dataset
Accuracy = 0.50