Table of Contents
Pure Python Code for Korean Analysis. It aims to be a non -map learning approach that can find words in the data without using learning data, disassemble sentences with words, or discriminate parts of speech.
WordExtractor or Nounextractor provided by Soynlp works using statistical information learned from multiple documents. Non -map learning -based approaches extract words using statistical patterns, so they work well in the same group of documents (Homogeneous documents) that are somewhat larger than in one sentence or document. It is good to learn extractors by collecting only documents using the same words, such as movie comments or news articles of the day. Documents of heterogeneous groups are not extracted well when they collect them together.
Until soynlp = 0.0.46, there were no rules in the names of Parameters, which require the minimum or maximum value, such as min_score, minimum_score, l_len_min. Among the code you have worked so far, you can confuse those who have set Parameters directly, but we have modified the name of the variable to reduce the inconvenience that will occur later before it is later .
After 0.0.47, the variable name that contains the meaning of minimum and maximum is reduced to min and max. After that, write a name for what item is Threshold Parameter. Unify the parameter name with the following patterns: Unify the name with {min, max} _ {noun, word} _ {score, threshold}. If the item is obvious, you can omit it.
Substring counting is often done in Soynlp. Parameter associated with frequency is unified with prequency, not count.
Index and IDX are unified with IDX.
Num and n, which means numbers, are unified with num.
$ pip install soynlpAs a result of various attempts to extract nouns, V1, News, and V2 three versions were created. The best performance is V2.
WordExtractor is to learn the boundary scores of words using statistics, and do not judge the parts of each word. Sometimes you need to know the parties of each word. In addition, new words are most common in nouns than other parts of speech. On the right side of the noun, certain letters often appear, such as -silver, -e, and -. If you look at the distribution of what letters appear on the right side of the substring on the left side of the words (Space -based standard unit), you can determine whether or not it is a noun. Soynlp offers two types of noun extractors. It's hard to say that something is better because both are in the development stage, but NewSnounextractor contains more features. In the future, the noun extractor will be summarized as one class.
from soynlp . noun import LRNounExtractor
noun_extractor = LRNounExtractor ()
nouns = noun_extractor . train_extract ( sentences ) # list of str like
from soynlp . noun import NewsNounExtractor
noun_extractor = NewsNounExtractor ()
nouns = noun_extractor . train_extract ( sentences ) # list of str likeThis is an example of a noun learned from the news of 2016-10-20.
덴마크 웃돈 너무너무너무 가락동 매뉴얼 지도교수
전망치 강구 언니들 신산업 기뢰전 노스
할리우드 플라자 불법조업 월스트리트저널 2022년 불허
고씨 어플 1987년 불씨 적기 레스
스퀘어 충당금 건축물 뉴질랜드 사각 하나씩
근대 투자주체별 4위 태권 네트웍스 모바일게임
연동 런칭 만성 손질 제작법 현실화
오해영 심사위원들 단점 부장조리 차관급 게시물
인터폰 원화 단기간 편곡 무산 외국인들
세무조사 석유화학 워킹 원피스 서장 공범
More detailed explanations are in the tutorial.
Soynlp = 0.0.46+ offers noun extractor version 2. This is a version that revises the accuracy of the previous version of the nouns, the synthetic noun recognition ability, and the error of the output information. Usage is similar to Version 1.
from soynlp . utils import DoublespaceLineCorpus
from soynlp . noun import LRNounExtractor_v2
corpus_path = '2016-10-20-news'
sents = DoublespaceLineCorpus ( corpus_path , iter_sent = True )
noun_extractor = LRNounExtractor_v2 ( verbose = True )
nouns = noun_extractor . train_extract ( sents )The extracted nouns are {STR: namedtuple} format.
print ( nouns [ '뉴스' ]) # NounScore(frequency=4319, score=1.0)_Compounds_components stores information from single nouns that make up complex nouns. It is actually a complex form, such as 'Korea' and 'Green Growth', but if used as a single noun, it is recognized as a single noun.
list ( noun_extractor . _compounds_components . items ())[: 5 ]
# [('잠수함발사탄도미사일', ('잠수함', '발사', '탄도미사일')),
# ('미사일대응능력위원회', ('미사일', '대응', '능력', '위원회')),
# ('글로벌녹색성장연구소', ('글로벌', '녹색성장', '연구소')),
# ('시카고옵션거래소', ('시카고', '옵션', '거래소')),
# ('대한민국특수임무유공', ('대한민국', '특수', '임무', '유공')),LRGRAPH stores the LR structure of the Erase in the learned Corpus. You can check this using Get_r and Get_l.
noun_extractor . lrgraph . get_r ( '아이오아이' )
# [('', 123),
# ('의', 47),
# ('는', 40),
# ('와', 18),
# ('가', 18),
# ('에', 7),
# ('에게', 6),
# ('까지', 2),
# ('랑', 2),
# ('부터', 1)]More details are in Tutorial 2.
In October 2016, there are words such as 'TWICE' and 'Iowa'. However, I have never seen these words this word. Because new words are always made, there is an unregistered word problem (OOV) that does not recognize the words you haven't learned. However, if you read several entertainment news articles written at this time, you can see that words such as 'TWICE' and 'IoI' appear, and people can learn it. If we define the continuous word heat that often appears in the document set, we can extract it using statistics. There are many ways to learn words (boundarys) based on statistics. Soynlp provides Cohesion Score, Branching Entropy, and Accessor Variety.
from soynlp . word import WordExtractor
word_extractor = WordExtractor ( min_frequency = 100 ,
min_cohesion_forward = 0.05 ,
min_right_branching_entropy = 0.0
)
word_extractor . train ( sentences ) # list of str or like
words = word_extractor . extract ()Words is a dict that holds a namedtuple called Scores.
words [ '아이오아이' ]
Scores ( cohesion_forward = 0.30063636035733476 ,
cohesion_backward = 0 ,
left_branching_entropy = 0 ,
right_branching_entropy = 0 ,
left_accessor_variety = 0 ,
right_accessor_variety = 0 ,
leftside_frequency = 270 ,
rightside_frequency = 0
)This is an example sorted by the word score (cohesion * branking entropy) learned from the news article of 2016-10-26.
단어 (빈도수, cohesion, branching entropy)
촬영 (2222, 1.000, 1.823)
서울 (25507, 0.657, 2.241)
들어 (3906, 0.534, 2.262)
롯데 (1973, 0.999, 1.542)
한국 (9904, 0.286, 2.729)
북한 (4954, 0.766, 1.729)
투자 (4549, 0.630, 1.889)
떨어 (1453, 0.817, 1.515)
진행 (8123, 0.516, 1.970)
얘기 (1157, 0.970, 1.328)
운영 (4537, 0.592, 1.768)
프로그램 (2738, 0.719, 1.527)
클린턴 (2361, 0.751, 1.420)
뛰어 (927, 0.831, 1.298)
드라마 (2375, 0.609, 1.606)
우리 (7458, 0.470, 1.827)
준비 (1736, 0.639, 1.513)
루이 (1284, 0.743, 1.354)
트럼프 (3565, 0.712, 1.355)
생각 (3963, 0.335, 2.024)
팬들 (999, 0.626, 1.341)
산업 (2203, 0.403, 1.769)
10 (18164, 0.256, 2.210)
확인 (3575, 0.306, 2.016)
필요 (3428, 0.635, 1.279)
문제 (4737, 0.364, 1.808)
혐의 (2357, 0.962, 0.830)
평가 (2749, 0.362, 1.787)
20 (59317, 0.667, 1.171)
스포츠 (3422, 0.428, 1.604)
More details are in the Word Extraction Tutorial. The functions provided in the current version are as follows:
If you have learned a word score from the wordXtractor, you can use it to break down the sentence into a word heat along the boundary of the word. Soynlp offers three torque knights. If you are good at spacing, you can use LTOKENIZER. I think the structure of Korean language is "L + [R]" like "noun + survey".
L PARTS may be nouns/verbs/adjectives/adverbs. If only L recognizes only L in the word, the rest is R Parts. LTOKENIZER enters the word score of L PARTS.
from soynlp . tokenizer import LTokenizer
scores = { '데이' : 0.5 , '데이터' : 0.5 , '데이터마이닝' : 0.5 , '공부' : 0.5 , '공부중' : 0.45 }
tokenizer = LTokenizer ( scores = scores )
sent = '데이터마이닝을 공부한다'
print ( tokenizer . tokenize ( sent , flatten = False ))
#[['데이터마이닝', '을'], ['공부', '중이다']]
print ( tokenizer . tokenize ( sent ))
# ['데이터마이닝', '을', '공부', '중이다']If you calculate the word scores using WordExtRactor, you can create scores by choosing one of the words scores. Below is only the score of Forward Cohesion. In addition, various words scores can be defined and used.
from soynlp . word import WordExtractor
from soynlp . utils import DoublespaceLineCorpus
file_path = 'your file path'
corpus = DoublespaceLineCorpus ( file_path , iter_sent = True )
word_extractor = WordExtractor (
min_frequency = 100 , # example
min_cohesion_forward = 0.05 ,
min_right_branching_entropy = 0.0
)
word_extractor . train ( sentences )
words = word_extractor . extract ()
cohesion_score = { word : score . cohesion_forward for word , score in words . items ()}
tokenizer = LTokenizer ( scores = cohesion_score )You can also use the noun score and cohesion of the noun extractor. For example, if you want to use the "Cohesion Score + Noun Score" as a word score, you can work as follows.
from soynlp . noun import LRNounExtractor_2
noun_extractor = LRNounExtractor_v2 ()
nouns = noun_extractor . train_extract ( corpus ) # list of str like
noun_scores = { noun : score . score for noun , score in nouns . items ()}
combined_scores = { noun : score + cohesion_score . get ( noun , 0 )
for noun , score in noun_scores . items ()}
combined_scores . update (
{ subword : cohesion for subword , cohesion in cohesion_score . items ()
if not ( subword in combined_scores )}
)
tokenizer = LTokenizer ( scores = combined_scores )If the spacing is not properly observed, the unit divided by the spacing basis of the sentence is L + [R] structure. However, people are noticed from familiar words in sentences that are not kept spaces. MaxScoretokeenizer, which moved this process to the model, also uses a word score.
from soynlp . tokenizer import MaxScoreTokenizer
scores = { '파스' : 0.3 , '파스타' : 0.7 , '좋아요' : 0.2 , '좋아' : 0.5 }
tokenizer = MaxScoreTokenizer ( scores = scores )
print ( tokenizer . tokenize ( '난파스타가좋아요' ))
# ['난', '파스타', '가', '좋아', '요']
print ( tokenizer . tokenize ( '난파스타가 좋아요' , flatten = False ))
# [[('난', 0, 1, 0.0, 1), ('파스타', 1, 4, 0.7, 3), ('가', 4, 5, 0.0, 1)],
# [('좋아', 0, 2, 0.5, 2), ('요', 2, 3, 0.0, 1)]]MAXSCORETOKENIZER also uses the results of WordXTRACTOR and uses scores appropriately like the example above. If there is already known word dictionary, these words give a greater score than any other word, and the word is cut into one word.
You can also make a word heat on the basis of rules. In the point where language changes, we recognize the boundaries of words. For example, "Oh haha ㅜ ㅜ really?" The words are easily divided with [Oh, haha, ㅜㅜ, real,?].
from soynlp . tokenizer import RegexTokenizer
tokenizer = RegexTokenizer ()
print ( tokenizer . tokenize ( '이렇게연속된문장은잘리지않습니다만' ))
# ['이렇게연속된문장은잘리지않습니다만']
print ( tokenizer . tokenize ( '숫자123이영어abc에섞여있으면ㅋㅋ잘리겠죠' ))
# ['숫자', '123', '이영어', 'abc', '에섞여있으면', 'ㅋㅋ', '잘리겠죠'] If the word dictionary is well established, you can use it to create a pre -based part -of -speech determinant. However, because it is not to analyze morphemes, 'do', 'da', 'and' and 'and' and 'and are all verbs. Lemmatizer is currently in development and organization.
pos_dict = {
'Adverb' : { '너무' , '매우' },
'Noun' : { '너무너무너무' , '아이오아이' , '아이' , '노래' , '오' , '이' , '고양' },
'Josa' : { '는' , '의' , '이다' , '입니다' , '이' , '이는' , '를' , '라' , '라는' },
'Verb' : { '하는' , '하다' , '하고' },
'Adjective' : { '예쁜' , '예쁘다' },
'Exclamation' : { '우와' }
}
from soynlp . postagger import Dictionary
from soynlp . postagger import LRTemplateMatcher
from soynlp . postagger import LREvaluator
from soynlp . postagger import SimpleTagger
from soynlp . postagger import UnknowLRPostprocessor
dictionary = Dictionary ( pos_dict )
generator = LRTemplateMatcher ( dictionary )
evaluator = LREvaluator ()
postprocessor = UnknowLRPostprocessor ()
tagger = SimpleTagger ( generator , evaluator , postprocessor )
sent = '너무너무너무는아이오아이의노래입니다!!'
print ( tagger . tag ( sent ))
# [('너무너무너무', 'Noun'),
# ('는', 'Josa'),
# ('아이오아이', 'Noun'),
# ('의', 'Josa'),
# ('노래', 'Noun'),
# ('입니다', 'Josa'),
# ('!!', None)]More detailed usage is described in the tutorial, and the development process notes are described here.
Create the document into Sparse Matrix using the torque knighter, or using the learned torque knighter. Minimum / Maximum of Term of Term Frequency / Document Frequency can be adjusted. Verbose Mode prints the current vectorous situation.
vectorizer = BaseVectorizer (
tokenizer = tokenizer ,
min_tf = 0 ,
max_tf = 10000 ,
min_df = 0 ,
max_df = 1.0 ,
stopwords = None ,
lowercase = True ,
verbose = True
)
corpus . iter_sent = False
x = vectorizer . fit_transform ( corpus )If the document is large or not immediately using Sparse Matrix, you can save it as a file without putting it on the memory. Fit_to_file () or to_file () function records the Term Frequency Vector for one document as soon as you get. The parameters available in the baseVectorizer are the same.
vectorizer = BaseVectorizer ( min_tf = 1 , tokenizer = tokenizer )
corpus . iter_sent = False
matrix_path = 'YOURS'
vectorizer . fit_to_file ( corpus , matrix_path )You can output one document with List of int instead of Sparse Matrix. At this time, words that are not learned in vectorizer.vocabulary_ will not be encoding.
vectorizer . encode_a_doc_to_bow ( '오늘 뉴스는 이것이 전부다' )
# {3: 1, 258: 1, 428: 1, 1814: 1}List of int is possible with List of Str.
vectorizer . decode_from_bow ({ 3 : 1 , 258 : 1 , 428 : 1 , 1814 : 1 })
# {'뉴스': 1, '는': 1, '오늘': 1, '이것이': 1}Encoding is also available with the Bag of Words in the dict format.
vectorizer . encode_a_doc_to_list ( '오늘의 뉴스는 매우 심각합니다' )
# [258, 4, 428, 3, 333]The Bag of Words in the dict format can be decoding.
vectorizer . decode_from_list ([ 258 , 4 , 428 , 3 , 333 ])
[ '오늘' , '의' , '뉴스' , '는' , '매우' ]It provides a function for the summary of the repeated emoticons in conversation data, comments, and to leave only Korean or text.
from soynlp . normalizer import *
emoticon_normalize ( 'ㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋ쿠ㅜㅜㅜㅜㅜㅜ' , num_repeats = 3 )
# 'ㅋㅋㅋㅜㅜㅜ'
repeat_normalize ( '와하하하하하하하하하핫' , num_repeats = 2 )
# '와하하핫'
only_hangle ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜ 아핫'
only_hangle_number ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜ 123 아핫'
only_text ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫'More detailed explanations are in the tutorial.
It provides a function for calculating co-OCCURRENCE MATRIX for association analysis and Point-Wise Mutual Information (PMI).
You can create MATRIX using the Sent_to_word_Contexts_matrix function below (Word, Context Words). X is scipy.sparse.csr_matrix, (n_vocabs, n_vocabs) size. IDX2VOCAB is a List of str that contains words corresponding to each row, column of x. Recognize the front and rear Windows words as a context of sentence, and only calculate the words that appear as the frequency of min_tf or higher. Dynamic_Weight is weighting inversely proportional to the context length. If Windows is 3, the co-OCCURRENCE of 1, 2, and 3 squares is calculated as 1, 2/3, 1/3.
from soynlp . vectorizer import sent_to_word_contexts_matrix
x , idx2vocab = sent_to_word_contexts_matrix (
corpus ,
windows = 3 ,
min_tf = 10 ,
tokenizer = tokenizer , # (default) lambda x:x.split(),
dynamic_weight = False ,
verbose = True
)If you enter X, which is a co-OCCURRENCE MATRIX, in PMI, PMI is calculated by each axis of ROW and Column. PMI_DOK is scipy.sparse.dok_matrix format. Only the value of min_pmi is stored, and the default is min_pmi = 0, so positive PMI (ppmi). Alpha is a Smoothing parameter input to PMI (x, y) = p (x, y) / (p (p) * (p) + alpha). The calculation process takes a long time, so set it to Verbose = True to output the current progress.
from soynlp . word import pmi
pmi_dok = pmi (
x ,
min_pmi = 0 ,
alpha = 0.0001 ,
verbose = True
)More detailed explanations are in the tutorial.
It provides functions for purifying Sejong Horse Data for natural language processing model learning. It provides a function that creates a type of learning data purified in the form of morphem and speech, a function that makes a table and makes a table, and a function that simplifies the parts and speech system of Sejong Malm.
If there is a spacing error, it can be easy to analyze text by removing it. Based on the data you want to analyze, learn the spacing engine and use it to correct the spacing error.
Without having to learn a torque knicer or word extractor, you can extract keywords from Substring Graph using Hits Algorithm.
Keyword extractor. It offers two types of keyword extractors, using models and statistical -based models using Logistic Regression. It supports the Sparse Matrix format and text file format in scipy.sparse.