ngrams下載 - ngrams源代碼下載

ngrams

Ai源碼

1.0.0

下載

什麼是ngrams？

Ngrams是一個簡單的N-Gram索引，能夠從數據語料庫中學習並以相同樣式生成隨機輸出。索引和令牌化系統是作為接口實現的，因此您可以滾動自己的解決方案。

快速開始

您可以通過在cmd/rest/trigrams.go中運行小的REST網絡服務器來測試ngrams 。 cmd/grpc中還有一個GRPC-Proto示例。

 $ git clone https://github.com/mochi-co/ngrams.git
$ cd ngrams

 $ go test -cover ./...

 $ go build cmd/rest/trigrams.go
or
$ go run cmd/rest/trigrams.go

Web服務器將提供兩個端點：

發布`localhost:8080/learn`

索引純文本的數據體。培訓文本可以在training中找到。

 $ curl --data-binary @"training/pride-prejudice.txt" localhost:8080/learn
# {"parsed_tokens":139394}

curl -d 'posting a string of text' -H "Content-Type: text/plain" -X POST localhost:8080/learn
# {"parsed_tokens":5}

獲取`localhost:8080/generate[?limit=n]`

從學習的NGrams生成隨機輸出。 limit查詢參數可用於更改用於創建輸出的代幣數量（默認50）。

 $ curl localhost:8080/generate
# {
  "body": "They have solaced their wretchedness, however, and had a most conscientious
  		and polite young man, that might be able to keep him quiet. The arrival of the
  		fugitives. Not so much wickedness existed in the only one daughter will be 
  		having a daughter married.",
  "limit": 50
}
$ curl localhost:8080/generate?limit=10
# {
	"body": "Of its late possessor, she added, so untidy.",
	"limit": 10
}

基本用法

可以在cmd/rest/trigrams.go中找到用作作為庫的示例。 Trigrams示例使用tokenizers.DefaultWord Tokenizer，它將基於一般拉丁字母規則來解析和格式。

 import "github.com/mochi-co/ngrams"

 // Initialize a new ngram index for 3-grams (trigrams), with default options.
index = ngrams . NewIndex ( 3 , nil )

// Parse and index ngrams from a dataset.
tokens , err := index . Parse ( "to be or not to be, that is the question." )

// Generate a random sequence from the indexed ngrams.
out , err := index . Babble ( "to be" , 50 )

自定義索引初始化

索引器的數據存儲和令牌化機制都可以通過滿足商店和令牌界面來代替，從而可以為不同的目的調整索引器。該索引通過更改NewIndex n （3）值來支持Bigram，Trigrams，Quadgrams等。

 // Initialize with custom tokenizers and memory stores.
// The DefaultWordTokenizer take a bool to strip linebreaks in parsed text.
index = ngrams . NewIndex ( 3 , ngrams. Options {
	Store : stores . NewMemoryStore (),
	Tokenizer : tokenizers . NewDefaultWordTokenizer ( true ),
})

象徵器

令牌儀由一個Tokenize方法組成，該方法用於將輸入數據解析為ngram令牌，以及一種Format方法，該方法以預期的格式將它們拼湊在一起。該庫默認使用tokenizers.DefaultWord tokenizer，這是將大多數基於拉丁語（英語，法語等）解析為ngram nmargragen代幣的簡單令牌。

Format方法將盡力嘗試以語法正確的方式將任何選定的Ngram令牌拼湊在一起（或適合執行的令牌化類型）。

可以通過滿足tokenizers.Tokenizer界面（例如CJK數據集或氨基序列的令牌化）來創建新的Tokenizer。