language_model_tf Download - language_model

language_model_tf

AI Source Code

1.0.0

Download

Language Model

Language modeling is a task that assigns probabilities to sequences of words or various linguistic units (e.g. char, subword, sentence, etc.). Language modeling is one of the most important problem in modern natural language processing (NLP) and it's used in many NLP applications (e.g. speech recognition, machine translation, text summarization, spell correction, auto-completion, etc.). In the past few years, neural approaches have achieved better results than traditional statistical approaches on many language model benchmarks. Moreover, recent work has shown language model pre-training can improve many NLP tasks in different ways, including feature-based strategies (e.g. ELMo, etc.) and fine-tuning strategies (e.g. OpenAI GPT, BERT, etc.), or even in zero-shot setting (e.g. OpenAI GPT-2, etc.).

Figure 1: An example of auto-completion powered by language modeling

Setting

Python 3.6.6
Tensorflow 1.12
NumPy 1.15.4
NLTK 3.3

DataSet

Wikipedia corpus contains about 2 billion words of text from a 2014 dump of the Wikipedia (about 4.4 million pages). As far as we are aware, our Wikipedia full-text data is the only version available from a recent copy of Wikipedia.
BooksCorpus: Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This work aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets.
One Billion Word benchmark is targeted to make available a standard training and test setup for language modeling experiments. This benchmark contains almost one billion words of training data, and it's aiming to help researcher to quickly evaluate novel their language modeling techniques, and to easily compare the contributions when combined with other advanced techniques.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Usage

Preprocess data

# convert raw data
python preprocess/convert_data.py --dataset wikipedia --input_dir data/wikipedia/raw --output_dir data/wikipedia/processed --min_seq_len 0 --max_seq_len 512
# prepare vocab & embed files
python prepare_resource.py 
--input_dir data/wikipedia/processed --max_word_size 512 --max_char_size 16 
--full_embedding_file data/glove/glove.840B.300d.txt --word_embedding_file data/wikipedia/resource/lm.word.embed --word_embed_dim 300 
--word_vocab_file data/wikipedia/resource/lm.word.vocab --word_vocab_size 100000 
--char_vocab_file data/wikipedia/resource/lm.char.vocab --char_vocab_size 1000

Run experiment

# run experiment in train + eval mode
python language_model_run.py --mode train_eval --config config/config_lm_template.xxx.json
# run experiment in train only mode
python language_model_run.py --mode train --config config/config_lm_template.xxx.json
# run experiment in eval only mode
python language_model_run.py --mode eval --config config/config_lm_template.xxx.json

Encode text

# encode text as ELMo vector
python language_model_run.py --mode encode --config config/config_lm_template.xxx.json

Search hyper-parameter

# random search hyper-parameters
python hparam_search.py --base-config config/config_lm_template.xxx.json --search-config config/config_search_template.xxx.json --num-group 10 --random-seed 100 --output-dir config/search

Visualize summary

# visualize summary via tensorboard
tensorboard --logdir=output

Model

Bi-directional Language Model (biLM)

Given a sequence, the bi-directional language model computes the probability of the sequence forward,

then it runs over the sequence in reverse order to compute the probability of the sequence,

the sequence first goes through a shared embedding layer, then is modeled by multi-layer RNN (e.g. LSTM, GRU, etc.) in both directions and finally softmax normalization is applied to get probabilities,

Figure 2: bi-directional language model architecture (source: Generalized Language Models)

the model is trained by jointly minimizing the negative log likelihood of the forward and backward directions,

Reference

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matthew Gardner, Christopher T Clark, Kenton Lee, and Luke S. Zettlemoyer. Deep contextualized word representations [2018]
Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. Improving language understanding by generative pre-training [2018]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding [2018]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. Language models are unsupervised multitask learners [2019]

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-09-10
size 512.94KB
From Github

Related Applications

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
Dog_Fox_Bunny

2022-08-01
Lihua data analysis engine free version 3.0_search_navigation_collection_public opinion_ranking_api

2022-06-28

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All