英語語音貼標記庫;語言的紅寶石港:: en :: tagger
Perl Lingua :: en :: Tagger的紅寶石端口,這是一個基於概率的,由語料庫訓練的標籤器,它根據查找字典和一組概率值將POS標籤分配給英語文本。標記器根據條件概率分配了適當的標籤 - 它檢查了上述標籤,以確定當前單詞的適當標籤。根據單詞形態學對未知單詞進行分類,或者可以將其視為名詞或其他語音部分。標記器還使用一組正則表達式提取盡可能多的名詞和名詞短語。
require 'engtagger'
# Create a parser object
tgr = EngTagger . new
# Sample text
text = "Alice chased the big fat cat."
# Add part-of-speech tags to text
tagged = tgr . add_tags ( text )
#=> "<nnp>Alice</nnp> <vbd>chased</vbd> <det>the</det> <jj>big</jj> <jj>fat</jj><nn>cat</nn> <pp>.</pp>"
# Get a list of all nouns and noun phrases with occurrence counts
word_list = tgr . get_words ( text )
#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
# Get a readable version of the tagged text
readable = tgr . get_readable ( text )
#=> "Alice/NNP chased/VBD the/DET big/JJ fat/JJ cat/NN ./PP"
# Get all nouns from a tagged output
nouns = tgr . get_nouns ( tagged )
#=> {"cat"=>1, "Alice"=>1}
# Get all proper nouns
proper = tgr . get_proper_nouns ( tagged )
#=> {"Alice"=>1}
# Get all past tense verbs
pt_verbs = tgr . get_past_tense_verbs ( tagged )
#=> {"chased"=>1}
# Get all the adjectives
adj = tgr . get_adjectives ( tagged )
#=> {"big"=>1, "fat"=>1}
# Get all noun phrases of any syntactic level
# (same as word_list but take a tagged input)
nps = tgr . get_noun_phrases ( tagged )
#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1} 這裡使用的POS標籤集是Penn Treebank標籤集的修改版本。已重新定義具有非字母字符的標籤以在我們的數據結構中更好地工作。另外,為了避免與HTML標籤混淆,“確定器”標籤(det)已從'dt'更改為<DT> 。
CC Conjunction, coordinating and, or
CD Adjective, cardinal number 3, fifteen
DET Determiner this, each, some
EX Pronoun, existential there there
FW Foreign words
IN Preposition / Conjunction for, of, although, that
JJ Adjective happy, bad
JJR Adjective, comparative happier, worse
JJS Adjective, superlative happiest, worst
LS Symbol, list item A, A.
MD Verb, modal can, could, 'll
NN Noun aircraft, data
NNP Noun, proper London, Michael
NNPS Noun, proper, plural Australians, Methodists
NNS Noun, plural women, books
PDT Determiner, prequalifier quite, all, half
POS Possessive 's, '
PRP Determiner, possessive second mine, yours
PRPS Determiner, possessive their, your
RB Adverb often, not, very, here
RBR Adverb, comparative faster
RBS Adverb, superlative fastest
RP Adverb, particle up, off, out
SYM Symbol *
TO Preposition to
UH Interjection oh, yes, mmm
VB Verb, infinitive take, live
VBD Verb, past tense took, lived
VBG Verb, gerund taking, living
VBN Verb, past/passive participle taken, lived
VBP Verb, base present form take, live
VBZ Verb, present 3SG -s form takes, lives
WDT Determiner, question which, whatever
WP Pronoun, question who, whoever
WPS Determiner, possessive & question whose
WRB Adverb, question when, how, however
PP Punctuation, sentence ender ., !, ?
PPC Punctuation, comma ,
PPD Punctuation, dollar sign $
PPL Punctuation, quotation mark left ``
PPR Punctuation, quotation mark right ''
PPS Punctuation, colon, semicolon, elipsis :, ..., -
LRB Punctuation, left bracket (, {, [
RRB Punctuation, right bracket ), }, ]
推薦的方法(沒有sudo):
建議在沒有根特權的情況下將engtagger Gem安裝在您的用戶環境中。這樣可以確保適當的文件許可並避免潛在的問題。您可以使用Ruby版本經理(例如rbenv或rvm來管理Ruby版本和寶石來實現這一目標。
要在沒有sudo的情況下安裝,只需運行:
gem install engtagger替代方法(與sudo):
如果您必須使用sudo進行安裝,則需要此後調整文件權限以確保可訪問性。
sudo安裝寶石: sudo gem install engtaggersudo chown -R $( whoami ) /Library/Ruby/Gems/2.6.0/gems/engtagger-0.4.1注意:上面的路徑假定您使用的是Ruby版本2.6.0。如果您使用的是其他版本,則需要相應地修改路徑。您可以通過運行ruby -v找到Ruby版本。
許可問題:
如果您在安裝後遇到“無法加載此類文件”錯誤,則可能是由於文件權限不正確。如果您在安裝過程中使用了sudo ,請確保您遵循調整權限的說明。
Yoichiro Hasebe(Yohasebe [at] gmail.com)
非常感謝此GitHub頁面右欄中列出的合作者。
這個紅寶石庫是CPAN可用的Lingua :: en :: Tagger的直接端口。因此,其算法/設計的關鍵部分的信用歸功於原始Perl版本的作者Aaron Coburn。
該庫分佈在GPL下。請參閱許可證文件。