ngram language model下载ngram language model源代码下载

ngram language model

Ai源码

1.0.0

下载

n-gram语言模型

具有拉普拉斯平滑和句子生成的N-Gram语言模型的Python实现。

使用了一些NLTK功能（ nltk.ngrams ， nltk.FreqDist ），但是大多数所有内容都是手工实现的。

注意： LanguageModel类期望获得已通过句子标记的数据。如果使用随附的load_data函数，则应已经处理train.txt和test.txt文件，以便：

标点符号被删除
每个句子都有自己的行

有关示例，请参见data/目录。

在data/train.txt上训练并根据data/test.txt进行测试的Trigram模型的示例输出：

 Loading 3-gram model...
Vocabulary size: 23505
Generating sentences...
...
<s> <s> the company said it has agreed to sell its shares in a statement </s> (0.03163)
<s> <s> he said the company also announced measures to boost its domestic economy and could be a long term debt </s> (0.01418)
<s> <s> this is a major trade bill that would be the first quarter of 1987 </s> (0.02182)
...
Model perplexity: 51.555

生成句子旁边的括号中的数字是这些句子发生的累积概率。

用法信息：

 usage: N-gram Language Model [-h] --data DATA --n N [--laplace LAPLACE] [--num NUM]

optional arguments:
  -h, --help         show this help message and exit
  --data DATA        Location of the data directory containing train.txt and test.txt
  --n N              Order of N-gram model to create (i.e. 1 for unigram, 2 for bigram, etc.)
  --laplace LAPLACE  Lambda parameter for Laplace smoothing (default is 0.01 -- use 1 for add-1 smoothing)
  --num NUM          Number of sentences to generate (default 10)

最初由Josh Loehr和Robin Cosbey撰写，并进行了轻微的修改。上次编辑2018年2月8日。

展开

附加信息