prose下载 - prose源代码下载

prose

其他源码

v1.2.1

下载

散文

prose是一个自然语言处理库（目前仅是英语）。它支持象征化，分割，言论部分标记和指定实用性提取。

您可以在此处找到有关图书馆性能的更详细的摘要：介绍prose v2.0.0：将NLP带入。

安装

$ go get github.com/jdkato/prose/v2

用法

内容

概述
令牌化
细分
标记
ner

概述

 package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main () {
    // Create a new document with the default configuration:
    doc , err := prose . NewDocument ( "Go is an open-source programming language created at Google." )
    if err != nil {
        log . Fatal ( err )
    }

    // Iterate over the doc's tokens:
    for _ , tok := range doc . Tokens () {
        fmt . Println ( tok . Text , tok . Tag , tok . Label )
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _ , ent := range doc . Entities () {
        fmt . Println ( ent . Text , ent . Label )
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _ , sent := range doc . Sentences () {
        fmt . Println ( sent . Text )
        // Go is an open-source programming language created at Google.
    }
}

文档创建过程遵循以下步骤序列：

 tokenization -> POS tagging -> NE extraction
            
             segmentation

通过适当的功能选项，可以禁用每个步骤（假设不需要以后的步骤）。例如，为了禁用命名 - 实体提取，您将执行以下操作：

 doc , err := prose . NewDocument (
        "Go is an open-source programming language created at Google." ,
        prose . WithExtraction ( false ))

令牌化

prose包括能够处理现代文本的代币器，包括下面显示的非单词角色跨度。

类型	例子
电子邮件地址	`[email protected]`
主题标签	`#trending`
提及	`@jdkato`
URL	`https://github.com/jdkato/prose`
表情符号	`:-)` ， `>:(` ， `o_0` ，等等。

 package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main () {
    // Create a new document with the default configuration:
    doc , err := prose . NewDocument ( "@jdkato, go to http://example.com thanks :)." )
    if err != nil {
        log . Fatal ( err )
    }

    // Iterate over the doc's tokens:
    for _ , tok := range doc . Tokens () {
        fmt . Println ( tok . Text , tok . Tag )
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

细分

根据pragmatic_segmenter的开发人员创建的黄金规则， prose包括可用的最准确的句子细分器之一。

姓名	语言	执照	grs（英语）	grs（其他）	速度†
务实的分段	红宝石	麻省理工学院	98.08％（51/52）	100.00％	3.84 s
散文	去	麻省理工学院	75.00％（39/52）	N/A。	0.96 s
TactfulTokenizer	红宝石	GNU GPLV3	65.38％（34/52）	48.57％	46.32 s
OpenNLP	爪哇	APLV2	59.62％（31/52）	45.71％	1.27 s
Standford Corenlp	爪哇	GNU GPLV3	59.62％（31/52）	31.43％	0.92 s
Splitta	Python	APLV2	55.77％（29/52）	37.14％	N/A。
朋克	Python	APLV2	46.15％（24/52）	48.57％	1.79 s
SRX英语	红宝石	GNU GPLV3	30.77％（16/52）	28.57％	6.19 s
肩cap	红宝石	GNU GPLV3	28.85％（15/52）	20.00％	0.13 s

†原始测试是使用MacBook Pro 3.7 GHz四核Intel Xeon E5进行10.9.5进行的，而prose则使用MacBook Pro 2.9 GHz Intel Core i7定时运行10.13.3 。

 package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main () {
    // Create a new document with the default configuration:
    doc , _ := prose . NewDocument ( strings . Join ([] string {
        "I can see Mt. Fuji from here." ,
        "St. Michael's Church is on 5th st. near the light." }, " " ))

    // Iterate over the doc's sentences:
    sents := doc . Sentences ()
    fmt . Println ( len ( sents )) // 2
    for _ , sent := range sents {
        fmt . Println ( sent . Text )
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

标记

prose包括一个基于TextBlob的“快速准确” POS Tagger的标记。以下是其性能与NLTK在Treebank语料库上实现同一标签器的比较：

图书馆	准确性	5局平均（SEC）
NLTK	0.893	7.224
`prose`	0.961	2.538

（有关更多信息，请参见scripts/test_model.py 。）

支持的POS标签的完整列表如下。

标签	描述
`(`	左转支架
`)`	右转支架
`,`	逗号
`:`	冒号
`.`	时期
`''`	关闭引号标记
``	开头报价标记
`#`	数字标志
`$`	货币
`CC`	连词，协调
`CD`	基数
`DT`	确定器
`EX`	那里存在
`FW`	外语
`IN`	连词，从属或介词
`JJ`	形容词
`JJR`	形容词，比较
`JJS`	形容词，最高级
`LS`	列表项目标记
`MD`	动词，模态辅助
`NN`	名词，单数或质量
`NNP`	名词，适当的单数
`NNPS`	名词，适当的复数
`NNS`	名词，复数
`PDT`	预定器
`POS`	所有格结局
`PRP`	代词，个人
`PRP$`	代词，所有格
`RB`	副词
`RBR`	副词，比较
`RBS`	副词，最高级
`RP`	副词，粒子
`SYM`	象征
`TO`	无罪
`UH`	欹
`VB`	动词，基本形式
`VBD`	动词，过去时
`VBG`	动词，gerund或现在分词
`VBN`	动词，过去分词
`VBP`	动词，非第三人称单数礼物
`VBZ`	动词，第三人称单数礼物
`WDT`	whiterminer
`WP`	wh-pronoun，个人
`WP$`	wh-pronoun，所有格
`WRB`	wh-adverb

ner

prose v2.0.0包括v1.0.0的块包装的改进版本，默认情况下可以识别人员（ PERSON ）和地理/政治实体（ GPE ）。

 package main

import (
    "github.com/jdkato/prose/v2"
)

func main () {
    doc , _ := prose . NewDocument ( "Lebron James plays basketball in Los Angeles." )
    for _ , ent := range doc . Entities () {
        fmt . Println ( ent . Text , ent . Label )
        // Lebron James PERSON
        // Los Angeles GPE
    }
}