soynlp Download - soynlp Source code download

Table of Contents

Soynlp
- Guide
  - Usage guide
  - Parameter Naming
- Setup
- Python Version
- Requires
- Noun extractor
  - Noun extractor ver 1 & news noun extractor
  - Noun extractor ver 2
- Word ExtRACTION
- Tokenizer
  - Ltokenizer
  - MAXSCORETOKENIZER
  - Regextokenizer
- Part of Speech Tagger
- Vectorizer
- Normalizer
- Point-wise mutual information (PMI)
- notes
  - Slides
  - Blogs
- Good libraries to use together
  - UTILS for the purification of Sejong Horses
  - soyspacing
  - KR-Wordrank
  - Soykeyword

Soynlp

Pure Python Code for Korean Analysis. It aims to be a non -map learning approach that can find words in the data without using learning data, disassemble sentences with words, or discriminate parts of speech.

Guide

Usage guide

WordExtractor or Nounextractor provided by Soynlp works using statistical information learned from multiple documents. Non -map learning -based approaches extract words using statistical patterns, so they work well in the same group of documents (Homogeneous documents) that are somewhat larger than in one sentence or document. It is good to learn extractors by collecting only documents using the same words, such as movie comments or news articles of the day. Documents of heterogeneous groups are not extracted well when they collect them together.

Parameter Naming

Until soynlp = 0.0.46, there were no rules in the names of Parameters, which require the minimum or maximum value, such as min_score, minimum_score, l_len_min. Among the code you have worked so far, you can confuse those who have set Parameters directly, but we have modified the name of the variable to reduce the inconvenience that will occur later before it is later .

After 0.0.47, the variable name that contains the meaning of minimum and maximum is reduced to min and max. After that, write a name for what item is Threshold Parameter. Unify the parameter name with the following patterns: Unify the name with {min, max} _ {noun, word} _ {score, threshold}. If the item is obvious, you can omit it.

Substring counting is often done in Soynlp. Parameter associated with frequency is unified with prequency, not count.

Index and IDX are unified with IDX.

Num and n, which means numbers, are unified with num.

Setup

$ pip install soynlp

Python Version

Python 3.5+ supports. It is recommended that you use it as 3.X because you work mainly at 3.X.
Python 2.x is not finished with all functions.

Requires

numpy> = 1.12.1
psutil> = 5.0.1
scipy> = 1.1.0
Scikit-Learn> = 0.20.0

Noun extractor

As a result of various attempts to extract nouns, V1, News, and V2 three versions were created. The best performance is V2.

WordExtractor is to learn the boundary scores of words using statistics, and do not judge the parts of each word. Sometimes you need to know the parties of each word. In addition, new words are most common in nouns than other parts of speech. On the right side of the noun, certain letters often appear, such as -silver, -e, and -. If you look at the distribution of what letters appear on the right side of the substring on the left side of the words (Space -based standard unit), you can determine whether or not it is a noun. Soynlp offers two types of noun extractors. It's hard to say that something is better because both are in the development stage, but NewSnounextractor contains more features. In the future, the noun extractor will be summarized as one class.

Noun extractor ver 1 & news noun extractor

 from soynlp . noun import LRNounExtractor
noun_extractor = LRNounExtractor ()
nouns = noun_extractor . train_extract ( sentences ) # list of str like

from soynlp . noun import NewsNounExtractor
noun_extractor = NewsNounExtractor ()
nouns = noun_extractor . train_extract ( sentences ) # list of str like

This is an example of a noun learned from the news of 2016-10-20.

 덴마크  웃돈  너무너무너무  가락동  매뉴얼  지도교수
전망치  강구  언니들  신산업  기뢰전  노스
할리우드  플라자  불법조업  월스트리트저널  2022년  불허
고씨  어플  1987년  불씨  적기  레스
스퀘어  충당금  건축물  뉴질랜드  사각  하나씩
근대  투자주체별  4위  태권  네트웍스  모바일게임
연동  런칭  만성  손질  제작법  현실화
오해영  심사위원들  단점  부장조리  차관급  게시물
인터폰  원화  단기간  편곡  무산  외국인들
세무조사  석유화학  워킹  원피스  서장  공범

More detailed explanations are in the tutorial.

Noun extractor ver 2

Soynlp = 0.0.46+ offers noun extractor version 2. This is a version that revises the accuracy of the previous version of the nouns, the synthetic noun recognition ability, and the error of the output information. Usage is similar to Version 1.

 from soynlp . utils import DoublespaceLineCorpus
from soynlp . noun import LRNounExtractor_v2

corpus_path = '2016-10-20-news'
sents = DoublespaceLineCorpus ( corpus_path , iter_sent = True )

noun_extractor = LRNounExtractor_v2 ( verbose = True )
nouns = noun_extractor . train_extract ( sents )

The extracted nouns are {STR: namedtuple} format.

 print ( nouns [ '뉴스' ]) # NounScore(frequency=4319, score=1.0)

_Compounds_components stores information from single nouns that make up complex nouns. It is actually a complex form, such as 'Korea' and 'Green Growth', but if used as a single noun, it is recognized as a single noun.

 list ( noun_extractor . _compounds_components . items ())[: 5 ]

# [('잠수함발사탄도미사일', ('잠수함', '발사', '탄도미사일')),
#  ('미사일대응능력위원회', ('미사일', '대응', '능력', '위원회')),
#  ('글로벌녹색성장연구소', ('글로벌', '녹색성장', '연구소')),
#  ('시카고옵션거래소', ('시카고', '옵션', '거래소')),
#  ('대한민국특수임무유공', ('대한민국', '특수', '임무', '유공')),

LRGRAPH stores the LR structure of the Erase in the learned Corpus. You can check this using Get_r and Get_l.

 noun_extractor . lrgraph . get_r ( '아이오아이' )

# [('', 123),
#  ('의', 47),
#  ('는', 40),
#  ('와', 18),
#  ('가', 18),
#  ('에', 7),
#  ('에게', 6),
#  ('까지', 2),
#  ('랑', 2),
#  ('부터', 1)]

Word ExtRACTION

In October 2016, there are words such as 'TWICE' and 'Iowa'. However, I have never seen these words this word. Because new words are always made, there is an unregistered word problem (OOV) that does not recognize the words you haven't learned. However, if you read several entertainment news articles written at this time, you can see that words such as 'TWICE' and 'IoI' appear, and people can learn it. If we define the continuous word heat that often appears in the document set, we can extract it using statistics. There are many ways to learn words (boundarys) based on statistics. Soynlp provides Cohesion Score, Branching Entropy, and Accessor Variety.

 from soynlp . word import WordExtractor

word_extractor = WordExtractor ( min_frequency = 100 ,
    min_cohesion_forward = 0.05 , 
    min_right_branching_entropy = 0.0
)
word_extractor . train ( sentences ) # list of str or like
words = word_extractor . extract ()

Words is a dict that holds a namedtuple called Scores.

 words [ '아이오아이' ]

Scores ( cohesion_forward = 0.30063636035733476 ,
        cohesion_backward = 0 ,
        left_branching_entropy = 0 ,
        right_branching_entropy = 0 ,
        left_accessor_variety = 0 ,
        right_accessor_variety = 0 ,
        leftside_frequency = 270 ,
        rightside_frequency = 0
)

This is an example sorted by the word score (cohesion * branking entropy) learned from the news article of 2016-10-26.

 단어   (빈도수, cohesion, branching entropy)

촬영     (2222, 1.000, 1.823)
서울     (25507, 0.657, 2.241)
들어     (3906, 0.534, 2.262)
롯데     (1973, 0.999, 1.542)
한국     (9904, 0.286, 2.729)
북한     (4954, 0.766, 1.729)
투자     (4549, 0.630, 1.889)
떨어     (1453, 0.817, 1.515)
진행     (8123, 0.516, 1.970)
얘기     (1157, 0.970, 1.328)
운영     (4537, 0.592, 1.768)
프로그램  (2738, 0.719, 1.527)
클린턴   (2361, 0.751, 1.420)
뛰어     (927, 0.831, 1.298)
드라마   (2375, 0.609, 1.606)
우리     (7458, 0.470, 1.827)
준비     (1736, 0.639, 1.513)
루이     (1284, 0.743, 1.354)
트럼프   (3565, 0.712, 1.355)
생각     (3963, 0.335, 2.024)
팬들     (999, 0.626, 1.341)
산업     (2203, 0.403, 1.769)
10      (18164, 0.256, 2.210)
확인     (3575, 0.306, 2.016)
필요     (3428, 0.635, 1.279)
문제     (4737, 0.364, 1.808)
혐의     (2357, 0.962, 0.830)
평가     (2749, 0.362, 1.787)
20      (59317, 0.667, 1.171)
스포츠    (3422, 0.428, 1.604)

More details are in the Word Extraction Tutorial. The functions provided in the current version are as follows:

Tokenizer

If you have learned a word score from the wordXtractor, you can use it to break down the sentence into a word heat along the boundary of the word. Soynlp offers three torque knights. If you are good at spacing, you can use LTOKENIZER. I think the structure of Korean language is "L + [R]" like "noun + survey".

Ltokenizer

L PARTS may be nouns/verbs/adjectives/adverbs. If only L recognizes only L in the word, the rest is R Parts. LTOKENIZER enters the word score of L PARTS.

 from soynlp . tokenizer import LTokenizer

scores = { '데이' : 0.5 , '데이터' : 0.5 , '데이터마이닝' : 0.5 , '공부' : 0.5 , '공부중' : 0.45 }
tokenizer = LTokenizer ( scores = scores )

sent = '데이터마이닝을 공부한다'

print ( tokenizer . tokenize ( sent , flatten = False ))
#[['데이터마이닝', '을'], ['공부', '중이다']]

print ( tokenizer . tokenize ( sent ))
# ['데이터마이닝', '을', '공부', '중이다']

If you calculate the word scores using WordExtRactor, you can create scores by choosing one of the words scores. Below is only the score of Forward Cohesion. In addition, various words scores can be defined and used.

 from soynlp . word import WordExtractor
from soynlp . utils import DoublespaceLineCorpus

file_path = 'your file path'
corpus = DoublespaceLineCorpus ( file_path , iter_sent = True )

word_extractor = WordExtractor (
    min_frequency = 100 , # example
    min_cohesion_forward = 0.05 ,
    min_right_branching_entropy = 0.0
)

word_extractor . train ( sentences )
words = word_extractor . extract ()

cohesion_score = { word : score . cohesion_forward for word , score in words . items ()}
tokenizer = LTokenizer ( scores = cohesion_score )

You can also use the noun score and cohesion of the noun extractor. For example, if you want to use the "Cohesion Score + Noun Score" as a word score, you can work as follows.

 from soynlp . noun import LRNounExtractor_2
noun_extractor = LRNounExtractor_v2 ()
nouns = noun_extractor . train_extract ( corpus ) # list of str like

noun_scores = { noun : score . score for noun , score in nouns . items ()}
combined_scores = { noun : score + cohesion_score . get ( noun , 0 )
    for noun , score in noun_scores . items ()}
combined_scores . update (
    { subword : cohesion for subword , cohesion in cohesion_score . items ()
    if not ( subword in combined_scores )}
)

tokenizer = LTokenizer ( scores = combined_scores )

MAXSCORETOKENIZER

If the spacing is not properly observed, the unit divided by the spacing basis of the sentence is L + [R] structure. However, people are noticed from familiar words in sentences that are not kept spaces. MaxScoretokeenizer, which moved this process to the model, also uses a word score.

 from soynlp . tokenizer import MaxScoreTokenizer

scores = { '파스' : 0.3 , '파스타' : 0.7 , '좋아요' : 0.2 , '좋아' : 0.5 }
tokenizer = MaxScoreTokenizer ( scores = scores )

print ( tokenizer . tokenize ( '난파스타가좋아요' ))
# ['난', '파스타', '가', '좋아', '요']

print ( tokenizer . tokenize ( '난파스타가 좋아요' , flatten = False ))
# [[('난', 0, 1, 0.0, 1), ('파스타', 1, 4, 0.7, 3),  ('가', 4, 5, 0.0, 1)],
#  [('좋아', 0, 2, 0.5, 2), ('요', 2, 3, 0.0, 1)]]

MAXSCORETOKENIZER also uses the results of WordXTRACTOR and uses scores appropriately like the example above. If there is already known word dictionary, these words give a greater score than any other word, and the word is cut into one word.

Regextokenizer

You can also make a word heat on the basis of rules. In the point where language changes, we recognize the boundaries of words. For example, "Oh haha ㅜ ㅜ really?" The words are easily divided with [Oh, haha, ㅜㅜ, real,?].

 from soynlp . tokenizer import RegexTokenizer

tokenizer = RegexTokenizer ()

print ( tokenizer . tokenize ( '이렇게연속된문장은잘리지않습니다만' ))
# ['이렇게연속된문장은잘리지않습니다만']

print ( tokenizer . tokenize ( '숫자123이영어abc에섞여있으면ㅋㅋ잘리겠죠' ))
# ['숫자', '123', '이영어', 'abc', '에섞여있으면', 'ㅋㅋ', '잘리겠죠']

Part of Speech Tagger

If the word dictionary is well established, you can use it to create a pre -based part -of -speech determinant. However, because it is not to analyze morphemes, 'do', 'da', 'and' and 'and' and 'and are all verbs. Lemmatizer is currently in development and organization.

 pos_dict = {
    'Adverb' : { '너무' , '매우' }, 
    'Noun' : { '너무너무너무' , '아이오아이' , '아이' , '노래' , '오' , '이' , '고양' },
    'Josa' : { '는' , '의' , '이다' , '입니다' , '이' , '이는' , '를' , '라' , '라는' },
    'Verb' : { '하는' , '하다' , '하고' },
    'Adjective' : { '예쁜' , '예쁘다' },
    'Exclamation' : { '우와' }    
}

from soynlp . postagger import Dictionary
from soynlp . postagger import LRTemplateMatcher
from soynlp . postagger import LREvaluator
from soynlp . postagger import SimpleTagger
from soynlp . postagger import UnknowLRPostprocessor

dictionary = Dictionary ( pos_dict )
generator = LRTemplateMatcher ( dictionary )    
evaluator = LREvaluator ()
postprocessor = UnknowLRPostprocessor ()
tagger = SimpleTagger ( generator , evaluator , postprocessor )

sent = '너무너무너무는아이오아이의노래입니다!!'
print ( tagger . tag ( sent ))
# [('너무너무너무', 'Noun'),
#  ('는', 'Josa'),
#  ('아이오아이', 'Noun'),
#  ('의', 'Josa'),
#  ('노래', 'Noun'),
#  ('입니다', 'Josa'),
#  ('!!', None)]

More detailed usage is described in the tutorial, and the development process notes are described here.

Vectorizer

Create the document into Sparse Matrix using the torque knighter, or using the learned torque knighter. Minimum / Maximum of Term of Term Frequency / Document Frequency can be adjusted. Verbose Mode prints the current vectorous situation.

 vectorizer = BaseVectorizer (
    tokenizer = tokenizer ,
    min_tf = 0 ,
    max_tf = 10000 ,
    min_df = 0 ,
    max_df = 1.0 ,
    stopwords = None ,
    lowercase = True ,
    verbose = True
)

corpus . iter_sent = False
x = vectorizer . fit_transform ( corpus )

If the document is large or not immediately using Sparse Matrix, you can save it as a file without putting it on the memory. Fit_to_file () or to_file () function records the Term Frequency Vector for one document as soon as you get. The parameters available in the baseVectorizer are the same.

 vectorizer = BaseVectorizer ( min_tf = 1 , tokenizer = tokenizer )
corpus . iter_sent = False

matrix_path = 'YOURS'
vectorizer . fit_to_file ( corpus , matrix_path )

You can output one document with List of int instead of Sparse Matrix. At this time, words that are not learned in vectorizer.vocabulary_ will not be encoding.

 vectorizer . encode_a_doc_to_bow ( '오늘 뉴스는 이것이 전부다' )
# {3: 1, 258: 1, 428: 1, 1814: 1}

List of int is possible with List of Str.

 vectorizer . decode_from_bow ({ 3 : 1 , 258 : 1 , 428 : 1 , 1814 : 1 })
# {'뉴스': 1, '는': 1, '오늘': 1, '이것이': 1}

Encoding is also available with the Bag of Words in the dict format.

 vectorizer . encode_a_doc_to_list ( '오늘의 뉴스는 매우 심각합니다' )
# [258, 4, 428, 3, 333]

The Bag of Words in the dict format can be decoding.

 vectorizer . decode_from_list ([ 258 , 4 , 428 , 3 , 333 ])
[ '오늘' , '의' , '뉴스' , '는' , '매우' ]

Normalizer

It provides a function for the summary of the repeated emoticons in conversation data, comments, and to leave only Korean or text.

 from soynlp . normalizer import *

emoticon_normalize ( 'ㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋㅋ쿠ㅜㅜㅜㅜㅜㅜ' , num_repeats = 3 )
# 'ㅋㅋㅋㅜㅜㅜ'

repeat_normalize ( '와하하하하하하하하하핫' , num_repeats = 2 )
# '와하하핫'

only_hangle ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜ 아핫'

only_hangle_number ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜ 123 아핫'

only_text ( '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫' )
# '가나다ㅏㅑㅓㅋㅋ쿠ㅜㅜㅜabcd123!!아핫'

More detailed explanations are in the tutorial.

Point-wise mutual information (PMI)

It provides a function for calculating co-OCCURRENCE MATRIX for association analysis and Point-Wise Mutual Information (PMI).

You can create MATRIX using the Sent_to_word_Contexts_matrix function below (Word, Context Words). X is scipy.sparse.csr_matrix, (n_vocabs, n_vocabs) size. IDX2VOCAB is a List of str that contains words corresponding to each row, column of x. Recognize the front and rear Windows words as a context of sentence, and only calculate the words that appear as the frequency of min_tf or higher. Dynamic_Weight is weighting inversely proportional to the context length. If Windows is 3, the co-OCCURRENCE of 1, 2, and 3 squares is calculated as 1, 2/3, 1/3.

 from soynlp . vectorizer import sent_to_word_contexts_matrix

x , idx2vocab = sent_to_word_contexts_matrix (
    corpus ,
    windows = 3 ,
    min_tf = 10 ,
    tokenizer = tokenizer , # (default) lambda x:x.split(),
    dynamic_weight = False ,
    verbose = True
)

If you enter X, which is a co-OCCURRENCE MATRIX, in PMI, PMI is calculated by each axis of ROW and Column. PMI_DOK is scipy.sparse.dok_matrix format. Only the value of min_pmi is stored, and the default is min_pmi = 0, so positive PMI (ppmi). Alpha is a Smoothing parameter input to PMI (x, y) = p (x, y) / (p (p) * (p) + alpha). The calculation process takes a long time, so set it to Verbose = True to output the current progress.

 from soynlp . word import pmi

pmi_dok = pmi (
    x ,
    min_pmi = 0 ,
    alpha = 0.0001 ,
    verbose = True
)

More detailed explanations are in the tutorial.

notes

Slides

I wrote down the principles and explanations of the algorithms in slide files. This is the data released by Data Yanolja.
We are creating TextMing Tutorial. SLIDES is a description of the algorithms implemented in Soynlp Project and the machine learning methods used for text mining.

Blogs

GitHub IO Blog posts text descriptions in SLIDES. I recommend reading it when you want to see the content of SLIDES in more detail.

Good libraries to use together

UTILS for the purification of Sejong Horses

It provides functions for purifying Sejong Horse Data for natural language processing model learning. It provides a function that creates a type of learning data purified in the form of morphem and speech, a function that makes a table and makes a table, and a function that simplifies the parts and speech system of Sejong Malm.

https://github.com/lovit/sejong_corpus_cleaner

soyspacing

If there is a spacing error, it can be easy to analyze text by removing it. Based on the data you want to analyze, learn the spacing engine and use it to correct the spacing error.

https://github.com/lovit/soyspacing
PIP Install Soyspacing

KR-Wordrank

Without having to learn a torque knicer or word extractor, you can extract keywords from Substring Graph using Hits Algorithm.

https://github.com/lovit/kr-wordrank
PIP Install Krwordrank

Soykeyword

Keyword extractor. It offers two types of keyword extractors, using models and statistical -based models using Logistic Regression. It supports the Sparse Matrix format and text file format in scipy.sparse.

https://github.com/lovit/soyKeyword
PIP Install Soykeyword

Expand