ngram language modelダウンロード - ngram language modelソースコードダウンロード

ngram language model

AI ソースコード

1.0.0

ダウンロード

n-gram言語モデル

ラプラスのスムージングと文の生成を備えたN-Gram言語モデルのPython実装。

一部のNLTK関数（ nltk.ngrams 、 nltk.FreqDist ）が使用されますが、ほとんどすべてが手で実装されています。

注： LanguageModelクラスには、既に文によってトークン化されているデータが与えられると予想されます。含まれているload_data関数を使用する場合、 train.txtとtest.txtファイルは既に処理されている必要があります。

句読点が削除されます
各文は独自の行にあります

例については、 data/ディレクトリを参照してください。

data/train.txtでトレーニングされ、 data/test.txtに対してテストされたTrigramモデルの例の例：

 Loading 3-gram model...
Vocabulary size: 23505
Generating sentences...
...
<s> <s> the company said it has agreed to sell its shares in a statement </s> (0.03163)
<s> <s> he said the company also announced measures to boost its domestic economy and could be a long term debt </s> (0.01418)
<s> <s> this is a major trade bill that would be the first quarter of 1987 </s> (0.02182)
...
Model perplexity: 51.555

生成された文の横にある括弧内の数字は、発生する文の累積確率です。

使用情報：

 usage: N-gram Language Model [-h] --data DATA --n N [--laplace LAPLACE] [--num NUM]

optional arguments:
  -h, --help         show this help message and exit
  --data DATA        Location of the data directory containing train.txt and test.txt
  --n N              Order of N-gram model to create (i.e. 1 for unigram, 2 for bigram, etc.)
  --laplace LAPLACE  Lambda parameter for Laplace smoothing (default is 0.01 -- use 1 for add-1 smoothing)
  --num NUM          Number of sentences to generate (default 10)

もともとジョシュ・ローールとロビン・コスベイが執筆し、わずかな変更を加えていました。最終編集2018年2月8日。

拡大する

追加情報