英语语音贴标记库;语言的红宝石港:: en :: tagger
Perl Lingua :: en :: Tagger的红宝石端口,这是一个基于概率的,由语料库训练的标签器,它根据查找字典和一组概率值将POS标签分配给英语文本。标记器根据条件概率分配了适当的标签 - 它检查了上述标签,以确定当前单词的适当标签。根据单词形态学对未知单词进行分类,或者可以将其视为名词或其他语音部分。标记器还使用一组正则表达式提取尽可能多的名词和名词短语。
require 'engtagger'
# Create a parser object
tgr = EngTagger . new
# Sample text
text = "Alice chased the big fat cat."
# Add part-of-speech tags to text
tagged = tgr . add_tags ( text )
#=> "<nnp>Alice</nnp> <vbd>chased</vbd> <det>the</det> <jj>big</jj> <jj>fat</jj><nn>cat</nn> <pp>.</pp>"
# Get a list of all nouns and noun phrases with occurrence counts
word_list = tgr . get_words ( text )
#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
# Get a readable version of the tagged text
readable = tgr . get_readable ( text )
#=> "Alice/NNP chased/VBD the/DET big/JJ fat/JJ cat/NN ./PP"
# Get all nouns from a tagged output
nouns = tgr . get_nouns ( tagged )
#=> {"cat"=>1, "Alice"=>1}
# Get all proper nouns
proper = tgr . get_proper_nouns ( tagged )
#=> {"Alice"=>1}
# Get all past tense verbs
pt_verbs = tgr . get_past_tense_verbs ( tagged )
#=> {"chased"=>1}
# Get all the adjectives
adj = tgr . get_adjectives ( tagged )
#=> {"big"=>1, "fat"=>1}
# Get all noun phrases of any syntactic level
# (same as word_list but take a tagged input)
nps = tgr . get_noun_phrases ( tagged )
#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1} 这里使用的POS标签集是Penn Treebank标签集的修改版本。已重新定义具有非字母字符的标签以在我们的数据结构中更好地工作。另外,为了避免与HTML标签混淆,“确定器”标签(det)已从'dt'更改为<DT> 。
CC Conjunction, coordinating and, or
CD Adjective, cardinal number 3, fifteen
DET Determiner this, each, some
EX Pronoun, existential there there
FW Foreign words
IN Preposition / Conjunction for, of, although, that
JJ Adjective happy, bad
JJR Adjective, comparative happier, worse
JJS Adjective, superlative happiest, worst
LS Symbol, list item A, A.
MD Verb, modal can, could, 'll
NN Noun aircraft, data
NNP Noun, proper London, Michael
NNPS Noun, proper, plural Australians, Methodists
NNS Noun, plural women, books
PDT Determiner, prequalifier quite, all, half
POS Possessive 's, '
PRP Determiner, possessive second mine, yours
PRPS Determiner, possessive their, your
RB Adverb often, not, very, here
RBR Adverb, comparative faster
RBS Adverb, superlative fastest
RP Adverb, particle up, off, out
SYM Symbol *
TO Preposition to
UH Interjection oh, yes, mmm
VB Verb, infinitive take, live
VBD Verb, past tense took, lived
VBG Verb, gerund taking, living
VBN Verb, past/passive participle taken, lived
VBP Verb, base present form take, live
VBZ Verb, present 3SG -s form takes, lives
WDT Determiner, question which, whatever
WP Pronoun, question who, whoever
WPS Determiner, possessive & question whose
WRB Adverb, question when, how, however
PP Punctuation, sentence ender ., !, ?
PPC Punctuation, comma ,
PPD Punctuation, dollar sign $
PPL Punctuation, quotation mark left ``
PPR Punctuation, quotation mark right ''
PPS Punctuation, colon, semicolon, elipsis :, ..., -
LRB Punctuation, left bracket (, {, [
RRB Punctuation, right bracket ), }, ]
推荐的方法(没有sudo):
建议在没有根特权的情况下将engtagger Gem安装在您的用户环境中。这样可以确保适当的文件许可并避免潜在的问题。您可以使用Ruby版本经理(例如rbenv或rvm来管理Ruby版本和宝石来实现这一目标。
要在没有sudo的情况下安装,只需运行:
gem install engtagger替代方法(与sudo):
如果您必须使用sudo进行安装,则需要此后调整文件权限以确保可访问性。
sudo安装宝石: sudo gem install engtaggersudo chown -R $( whoami ) /Library/Ruby/Gems/2.6.0/gems/engtagger-0.4.1注意:上面的路径假定您使用的是Ruby版本2.6.0。如果您使用的是其他版本,则需要相应地修改路径。您可以通过运行ruby -v找到Ruby版本。
许可问题:
如果您在安装后遇到“无法加载此类文件”错误,则可能是由于文件权限不正确。如果您在安装过程中使用了sudo ,请确保您遵循调整权限的说明。
Yoichiro Hasebe(Yohasebe [at] gmail.com)
非常感谢此GitHub页面右栏中列出的合作者。
这个红宝石库是CPAN可用的Lingua :: en :: Tagger的直接端口。因此,其算法/设计的关键部分的信用归功于原始Perl版本的作者Aaron Coburn。
该库分布在GPL下。请参阅许可证文件。