Language modeling. This codebase contains implementation of G-LSTM and F-LSTM cells from [1]. It also might contain some ongoing experiments.
This code was forked from https://github.com/rafaljozefowicz/lm and contains "BIGLSTM" language model baseline from [2].
Current code runs on Tensorflow r1.5 and supports multi-GPU data parallelism using synchronized gradient updates.
On One Billion Words benchmark using 8 GPUs in one DGX-1, BIG G-LSTM G4 was able to achieve 24.29 after 2 weeks of training and 23.36 after 3 weeks.
On 02/06/2018 We found an issue with our experimental setup which makes perplexity numbers listed in the paper invalid.
See current numbers in the table below.
On DGX Station, after 1 week of training using all 4 GPUs (Tesla V100) and batch size of 256 per GPU:
| Model | Perplexity | Steps | WPS |
|---|---|---|---|
| BIGLSTM | 35.1 | ~0.99M | ~33.8K |
| BIG F-LSTM F512 | 36.3 | ~1.67M | ~56.5K |
| BIG G-LSTM G4 | 40.6 | ~1.65M | ~56K |
| BIG G-LSTM G2 | 36 | ~1.37M | ~47.1K |
| BIG G-LSTM G8 | 39.4 | ~1.7M | ~58.5 |
Assuming the data directory is in: /raid/okuchaiev/Data/LM1B/1-billion-word-language-modeling-benchmark-r13output/, execute:
export CUDA_VISIBLE_DEVICES=0,1,2,3
SECONDS=604800
LOGSUFFIX=FLSTM-F512-1week
python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=4 --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256,fact_size=512 >> train_$LOGSUFFIX.log 2>&1
python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=1 --mode=eval_full --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=1,fact_size=512
num_of_groups parameter.fact_size parameter.Note, that current data reader may miss some tokens when constructing mini-batches which can have a minor effect on final perplexity.
For most accurate results, use batch_size=1 and num_steps=1 in evaluation. Thanks to Ciprian for noticing this.
The command accepts and additional argument --hpconfig which allows to override various hyper-parameters, including:
Forked code and GLSTM/FLSTM cells: [email protected]