Corpora에서 구별 용어를 찾아 대화 형 HTML 산점도에 표시하는 도구. 용어에 해당하는 포인트는 선택적으로 레이블이 지정되어 다른 레이블이나 포인트와 겹치지 않도록합니다.
인용 : Jason S. Kessler. ScatterText : Corpora의 다른 방식을 시각화하기위한 브라우저 기반 도구. ACL 시스템 데모. 2017.
아래는 ScatterText를 사용하여 2012 년 미국 정치 협약에서 사용되는 시각화 용어를 만드는 예입니다. 2,000 개의 파티 관련 유니 그램은 산포도의 점으로 표시됩니다. 그들의 X와 Y- 축은 각각 공화당과 민주당 연설자들의 밀집된 사용 순위입니다.
import scattertext as st
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))
html = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
metadata=corpus.get_df()['speaker'],
transform=st.Scalers.dense_rank,
include_gradient=True,
left_gradient_term='More Republican',
middle_gradient_term='Metric: Dense Rank Difference',
right_gradient_term='More Democratic',
)
open('./demo_compact.html', 'w').write(html)
작성된 HTML 파일은 아래 이미지처럼 보입니다. 실제 대화식 시각화를 보려면 클릭하십시오.
Jason S. Kessler. ScatterText : Corpora의 다른 방식을 시각화하기위한 브라우저 기반 도구. ACL 시스템 데모. 2017. 종이 링크 : arxiv.org/abs/1703.00565
@article{kessler2017scattertext,
author = {Kessler, Jason S.},
title = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},
booktitle = {Proceedings of ACL-2017 System Demonstrations},
year = {2017},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics},
}
목차
설치
개요
시각화를 사용자 정의하고 분산을 플로팅합니다
지도 시간
스케일링 된 F- 점수 이해
대체 용어 스코어링 방법
위치 선택 플롯 프로세스
고급 용도
예
차트 레이아웃에 대한 메모
새로운 것
출처
Python 3.11 이상을 설치하고 실행하십시오.
$ pip install scattertext
Spacy를 설치할 수 없거나 원하지 않는 경우 nlp = spacy.load('en') 라인을 nlp = scattertext.WhitespaceNLP.whitespace_nlp 로 대체하십시오. 이는 word_similarity_explorer 와 호환되지 않으며, 토큰 화 및 문장 경계 감지 기능은 저 성능의 정규 표현식이됩니다. 예를 들어 demo_without_spacy.py 참조하십시오.
Scattertext를 최대한 활용하기 위해 jieba , spacy , empath , astropy , flashtext , gensim 및 umap-learn 설치하는 것이 좋습니다.
ScatterText는 주로 Python 2.7에서 작동해야하지만 그렇지 않을 수도 있습니다.
HTML 출력은 크롬과 사파리에서 가장 잘 보입니다.
이 프로젝트의 이름은 ScatterText입니다. "ScatterText"는 단일 단어로 작성되며 대문자를 대문자로 작성해야합니다. Python에서 사용될 때 패키지 scattertext st , IE, import scattertext as st 정의해야합니다.
이것은 어떤 단어와 문구가 다른 것보다 범주의 특징인지를 시각화하기위한 도구입니다.
페이지 상단의 예제를 고려하십시오.
이것을 보는 것은 압도적 인 것 같습니다. 실제로, 그것은 2012 년 정치 대회에서 단어 사용을 비교적 간단하게 시각화 한 것입니다. 각 점은 협약 중에 공화당이나 민주당이 언급 한 단어 나 문구에 해당합니다. 점이 줄거리의 상단에 가까울수록 민주당이 더 자주 사용했습니다. 더 오른쪽의 점수 일수록 공화당은 단어 나 문구를 더 많이 사용했습니다. "OF"및 "THE"및 "MITT"와 같이 양 당사자가 자주 사용하는 단어는 오른쪽 상단에서 발생하는 경향이 있습니다. 매우 낮은 주파수 단어는 컴퓨팅 리소스를 보존하기 위해 숨겨져 있었지만 "기린"과 같은 파티 어느 쪽도 사용하지 않은 단어는 왼쪽 하단 코너에있을 것입니다.
흥미로운 일은 왼쪽 상단과 오른쪽 하단 코너에 가깝게 발생합니다. 왼쪽 상단 코너에서는 "자동 구제 금융에서와 같이"자동 "및"백만장 자 "와 같은 단어는 민주당이 자주 사용하지만 공화당은 자주 사용하거나 공화당이 사용하지 않습니다. 마찬가지로, 공화당이 자주 사용하고 민주당이 자주 사용하는 용어는 오른쪽 바닥을 차지합니다. 여기에는 롬니 주지사가 참여한 솔트 레이크 시티 올림픽을 언급하는 "큰 정부"와 "올림픽"이 포함됩니다.
용어는 그들의 연관성에 의해 채색됩니다. 민주당과 더 관련이있는 사람들은 파란색이며 공화당과 더 관련이 있습니다.
두 문서 세트의 가장 특징 인 용어는 시각화의 가장 오른쪽에 표시됩니다.
이 시각화에 대한 영감은 DataclySM에서 나왔습니다 (Rudder, 2014).
ScatterText는 이러한 그래프를 작성하고 그 그래프에 효율적으로 레이블을 지정할 수 있도록 설계되었습니다.
문서 (이 readme 포함)는 진행중인 작업입니다. 아래 튜토리얼과 Pydata 2017 튜토리얼을 참조하십시오.
코드와 테스트를 중심으로하는 것은 일이 어떻게 작동하는지에 대한 좋은 아이디어를 제공해야합니다.
라이브러리는 스케일링 된 F- 점수를 포함하여 몇 가지 참신하고 효과적인 용어 중요 공식을 다룹니다.
ScatterText 0.1.0의 새로운 경우 용어/메타 데이터 위치 및 기타 용어 별 데이터에 대한 데이터 프레임을 사용할 수 있습니다. 또한 용어를 클릭 한 후 표시되는 용어 별 정보를 결정하는 데 사용될 수도 있습니다.
이 예에서 볼 수 있듯이 Scattertext에서 문서 범주의 사용을 비활성화 할 수 있습니다.
이 예제는 단어 주파수에 대한 플로팅 용어 분산 및 주파수가 주어지면 가장 많이 분산 된 항을 식별하는 것을 다룹니다. Rosengren의 분산 척도 (Gries 2021)를 사용하여, 용어는 더 자주 발생함에 따라 분산 점수가 증가하는 경향이 있습니다. 우리는 어떻게이 효과를 줄이고 주파수의 영향을 고려할 수 있는지 살펴볼 것입니다.
이것은 Gries (2021)에 제시된 여러 가지 분산 지표와 함께 Dispersion 클래스에 사용 가능하고 문서화되며,이 섹션의 뒷부분에서 사용할 수 있습니다.
컨벤션 코퍼스를 만들 때부터 시작하겠습니다. 그러나 우리는 Corpus를 사용하여 CorpusWithoutCategoriesFromParsedDocuments 사용하여 Corpus를 사용하여 말단에 카테고리가 포함되어 있지 않도록합니다. 문서 카테고리를 찾으려고하면 모든 문서에 '_'카테고리가 있음을 알 수 있습니다.
import scattertext as st
df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences ))
corpus = st . CorpusWithoutCategoriesFromParsedDocuments (
df , parsed_col = 'parse'
). build (). get_unigram_corpus (). remove_infrequent_words ( minimum_term_count = 6 )
corpus . get_categories ()
# Returns ['_']다음으로, 우리가 계획 할 모든 용어에 대한 데이터 프레임을 만들 것입니다. 우리는 각 용어의 주파수와 다양한 분산 지표를 캡처하는 데이터 프레임을 만드는 것으로 시작합니다. 이들은 플롯에서 용어가 활성화 된 후에 표시됩니다.
dispersion = st . Dispersion ( corpus )
dispersion_df = dispersion . get_df ()
dispersion_df . head ( 3 )반환
Frequency Range SD VC Juilland's D Rosengren's S DP DP norm KL-divergence Dissemination
thank 363 134 3.108113 1.618274 0.707416 0.694898 0.391548 0.391560 0.748808 0.972954
you 1630 177 12.383708 1.435902 0.888596 0.898805 0.233627 0.233635 0.263337 0.963905
so 549 155 3.523380 1.212967 0.774299 0.822244 0.283151 0.283160 0.411750 0.986423```
These are discussed in detail in [Gries 2021](http://www.stgries.info/research/ToApp_STG_Dispersion_PHCL.pdf).
Dissementation is presented in Altmann et al. (2011).
We'll use Rosengren's S to find the dispersion of each term. It's which a metric designed for corpus parts
(convention speeches in our case) of varying length. Where n is the number of documents in the corpus, s_i is the
percentage of tokens in the corpus found in document i, v_i is term count in document i, and f is the total number
of tokens in the corpus of type term type.
Rosengren's
S: [^2}{f})](https://render.githubusercontent.com/render/math?math=frac{Sum_{i=1}^{n}sqrt{s_i%20cdot%20v_i})
^2}{f})
In order to start plotting, we'll need to add coordinates for each term to the data frame.
To use the `dataframe_scattertext` function, you need, at a minimum a dataframe with 'X' and 'Y' columns.
The `Xpos` and `Ypos` columns indicate the positions of the original `X` and `Y` values on the scatterplot, and
need to be between 0 and 1. Functions in `st.Scalers` perform this scaling. Absent `Xpos` or `Ypos`,
`st.Scalers.scale` would be used.
Here is a sample of values:
* `st.Scalers.scale(vec)` Rescales the vector to where the minimum value is 0 and the maximum is 1.
* `st.Scalers.log_scale(vec)` Rescales the lgo of the vector
* `st.Scalers.dense_ranke(vec)` Rescales the dense rank of the vector
* `st.Scalers.scale_center_zero_abs(vec)` Rescales a vector with both positive and negative values such that the 0 value
in the original vector is plotted at 0.5, negative values are projected from [-argmax(abs(vec)), 0] to [0, 0.5] and
positive values projected from [0, argmax(abs(vec))] to [0.5, 1].
```python
dispersion_df = dispersion_df.assign(
X=lambda df: df.Frequency,
Xpos=lambda df: st.Scalers.log_scale(df.X),
Y=lambda df: df["Rosengren's S"],
Ypos=lambda df: st.Scalers.scale(df.Y),
)
Y 자동으로 스케일링되므로 여기서 Ypos 열은 필요하지 않습니다.
마지막으로, 범주를 구별하지 않기 때문에 ignore_categories=True 설정할 수 있습니다.
이제 dataframe_scattertext 함수를 사용 하여이 그래프를 플로팅 할 수 있습니다.
html = st . dataframe_scattertext (
corpus ,
plot_df = dispersion_df ,
metadata = corpus . get_df ()[ 'speaker' ] + ' (' + corpus . get_df ()[ 'party' ]. str . upper () + ')' ,
ignore_categories = True ,
x_label = 'Log Frequency' ,
y_label = "Rosengren's S" ,
y_axis_labels = [ 'Less Dispersion' , 'Medium' , 'More Dispersion' ],
)수확량 (대화식 버전을 클릭하십시오) :
표준 사용 통계 외에도 용어 이름으로 다양한 분산 통계를 볼 수 있습니다. 표시된 통계를 사용자 정의하려면 표시 할 열 이름 목록이있는 term_description_column=[...] 매개 변수를 설정하십시오.
이 분산 차트에서 일반적으로 분산 메트릭에 흔한 경향이있는 한 가지 문제는 분산과 주파수가 상관 관계가 높지만 복잡하고 비선형 곡선을 갖는 경향이 있다는 것입니다. 메트릭에 따라이 상관 관계 곡선은 전력, 선형, 시그 모이 드 또는 일반적으로 다른 것일 수 있습니다.
이 상관 관계를 고려하기 위해 비모수 적 회귀기를 사용하여 주파수로부터의 분산을 예측하고 주파수를 기준으로 예상 분산액에 대해 가장 높고 가장 낮은 잔차가 있는지 확인할 수 있습니다.
이 경우 10 명의 이웃이있는 KNN 회귀 분석기를 사용하여 Rosengren의 용어 주파수 (각각 dispersion_df.X 및 .Y )를 예측하고 잔차를 계산합니다.
우리는 잔차 대 색상, 약 0 정도의 잔차에 대한 중립적 인 색상과 양수 및 음수 값에 대한 다른 색상을 사용합니다. 포인트 색상의 데이터 프레임에 열을 추가하고 ColorsCore라고합니다. 그것은 0과 1 사이의 값으로 채워져 있으며, d3 interpolateWarm 색상 스케일에서 순색으로 0.5로 0.5입니다. 위에서 논의한 st.Scalers.scale_center_zero_abs 사용 하여이 변환을 수행합니다.
from sklearn . neighbors import KNeighborsRegressor
dispersion_df = dispersion_df . assign (
Expected = lambda df : KNeighborsRegressor ( n_neighbors = 10 ). fit (
df . X . values . reshape ( - 1 , 1 ), df . Y
). predict ( df . X . values . reshape ( - 1 , 1 )),
Residual = lambda df : df . Y - df . Expected ,
ColorScore = lambda df : st . Scalers . scale_center_zero_abs ( df . Residual )
) 이제 우리는 컬러 분산 차트를 플로팅 할 준비가되었습니다. dataframe_scattertext 에서 ColorsCore 열 이름을 color_score_column 매개 변수에 할당합니다.
또한, 우리는 왼쪽의 두 용어 목록을 왼쪽의 잔류 값과 낮은 잔류 값을 가진 용어로 채우고, 주파수 예상 수준에 비해 가장 분산이 가장 낮고 가장 낮은 항을 나타냅니다. left_list_column 매개 변수로이를 수행 할 수 있습니다. header_names 매개 변수를 사용하여 상단 및 하부 항 목록 이름을 지정할 수 있습니다. 마지막으로, 우리는 매력적인 배경색을 추가하여 줄거리를 뿌릴 수 있습니다.
html = st . dataframe_scattertext (
corpus ,
plot_df = dispersion_df ,
metadata = corpus . get_df ()[ 'speaker' ] + ' (' + corpus . get_df ()[ 'party' ]. str . upper () + ')' ,
ignore_categories = True ,
x_label = 'Log Frequency' ,
y_label = "Rosengren's S" ,
y_axis_labels = [ 'Less Dispersion' , 'Medium' , 'More Dispersion' ],
color_score_column = 'ColorScore' ,
header_names = { 'upper' : 'Lower than Expected' , 'lower' : 'More than Expected' },
left_list_column = 'Residual' ,
background_color = '#e5e5e3'
)수확량 (대화식 버전을 클릭하십시오) :
Python에 ScatterText를 완전히 사용하여 학습해야하지만 기본 기능 중 일부를 CommandLine 도구에 넣었습니다. 위에 놓인 절차를 따를 때 도구가 설치됩니다.
전체 사용법 정보를 확인하려면 $ scattertext --help 실행하십시오. 다음은 CSV 파일에서 바닐라 산란 텍스트를 사용하는 방법에 대한 간단한 예입니다. 파일에는 분석 할 텍스트가 포함 된 두 개 이상의 열이 있어야하며 다른 하나는 범주를 포함해야합니다. 아래의 예제에서 열은 각각 텍스트와 파티입니다.
아래의 예는 CSV 파일을 처리하고 그 결과 HTML 시각화를 CLI_DEMO.html로 처리합니다.
매개 변수 --minimum_term_frequency=8 8 회 미만의 항을 생략하고 --regex_parser 간단한 정규 표현식 파서를 스파크 대신 사용해야 함을 나타냅니다. 플래그 --one_use_per_doc 문서에서 용어를 한 번만 계산하지 않음에 대해서만 계산되어야 함을 나타냅니다.
영어가 아닌 텍스트를 구문 분석하려면 --spacy_language_model 인수를 사용하여 도구가 사용할 스파크 언어 모델을 구성 할 수 있습니다. 기본값은 'en'이며 다른 것들은 https://spacy.io/docs/api/language-models에서 볼 수 있습니다.
$ curl -s https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv | head -2
party,speaker,text
democrat,BARACK OBAMA, " Thank you. Thank you. Thank you. Thank you so much.Thank you.Thank you so much. Thank you. Thank you very much, everybody. Thank you.
$
$ scattertext --datafile=https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv
> --text_column=text --category_column=party --metadata_column=speaker --positive_category=democrat
> --category_display_name=Democratic --not_category_display_name=Republican --minimum_term_frequency=8
> --one_use_per_doc --regex_parser --outputfile=cli_demo.html다음 코드는 2012 년 정당 협약에서 민주당과 공화당이 사용하는 단어를 분석하고 주목할만한 용어 협회를 출력하는 독립형 HTML 파일을 만듭니다.
먼저 ScatterText 및 Spacy를 가져옵니다.
>>> import scattertext as st
>>> import spacy
>>> from pprint import pprint
다음으로 분석하려는 데이터를 Pandas 데이터 프레임으로 조립하십시오. 최소한 두 개의 열, 분석하려는 텍스트 및 공부하려는 범주가 있어야합니다. 여기에는 text 열에는 컨벤션 연설이 포함되어 있고 party 칼럼에는 화자 파티가 포함되어 있습니다. 결국 speaker 열을 사용하여 시각화에서 스 니펫을 레이블을 지정합니다.
>>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> convention_df.iloc[0]
party democrat
speaker BARACK OBAMA
text Thank you. Thank you. Thank you. Thank you so ...
Name: 0, dtype: object
데이터 프레임을 산점도 텍스트 코퍼스로 바꾸어 분석을 시작하십시오. 당사자의 차이점을 찾으려면 category_col 매개 변수를 'party' 로 설정하고 text COL 매개 변수를 설정하여 분석 할 텍스트로 text 열에있는 연설을 사용하십시오. 마지막으로, 스파이 모델을 nlp 인수로 전달하고 build() 호출하여 코퍼스를 구성하십시오.
# Turn it into a Scattertext Corpus
>>> nlp = spacy.load('en')
>>> corpus = st.CorpusFromPandas(convention_df,
... category_col='party',
... text_col='text',
... nlp=nlp).build()
코퍼스에서 특징적인 용어와 가장 관련 민주당과 공화당 인 용어를 보자. 이러한 접근 방식에 대한 자세한 내용은 아이디어 커널의 컨텐츠 회전되지 않은 콘텐츠의 슬라이드 52 ~ 59를 참조하십시오.
다음은 코퍼스를 일반적인 영어 코퍼스와 차별화하는 용어입니다.
>>> print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))
['obama',
'romney',
'barack',
'mitt',
'obamacare',
'biden',
'romneys',
'hardworking',
'bailouts',
'autoworkers']
민주당과 가장 관련이있는 용어는 다음과 같습니다.
>>> term_freq_df = corpus.get_term_freq_df()
>>> term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
>>> pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))
['auto',
'america forward',
'auto industry',
'insurance companies',
'pell',
'last week',
'pell grants',
"women 's",
'platform',
'millionaires']
그리고 공화당 :
>>> term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')
>>> pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))
['big government',
"n't build",
'mitt was',
'the constitution',
'he wanted',
'hands that',
'of mitt',
'16 trillion',
'turned around',
'in florida']
이제 Scatter 플롯을 독립형 HTML 파일로 작성해 봅시다. 우리는 Y 축 카테고리 "민주당"을 만들고 프레젠테이션 목적으로 "민주당"범주를 자본 "D"로 지명합니다. 우리는 다른 범주의 "공화당"을 수도 "R"로 지정합니다. "민주당"범주가없는 코퍼스의 모든 문서는 공화당으로 간주됩니다. 시각화 너비를 픽셀로 설정하고 metadata 매개 변수를 사용하여 각 발췌문에 스피커로 레이블을 지정합니다. 마지막으로 시각화를 HTML 파일에 작성합니다.
>>> html = st.produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'])
>>> open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))
아래는 웹 페이지의 모습입니다. 그것을 클릭하고 대화식 버전을 위해 몇 분 동안 기다리십시오.
ScatterText는 또한 다양한 다른 문구 유형의 범주 연관을 시각화하는 데 사용될 수 있습니다. "문구"라는 단어는 단일 또는 다중 단어 배치를 나타냅니다.
Paco Nathan이 만든 Pytextrank는 Textrank 알고리즘의 수정 된 버전을 구현 한 것입니다 (Mihalcea and Tarau 2004). 여기에는 문서에서 가장 두드러진 문구의 점수 목록을 추출하기 위해 그래프 중심성 알고리즘이 포함됩니다. 여기에서 Spacy가 인정하는 지명 된 엔티티. Spacy 버전 2.2에서, 이들은 Ontonotes 5에서 훈련 된 NER 시스템에서 나온 것입니다.
이 자습서를 계속하기 전에 pytextrank $ pip3 install pytextrank 하십시오.
사용하려면 정상적으로 코퍼스를 구축하지만 Spacy를 사용하여 각 문서를 구문 분석하여 내장 된 whitespace_nlp 타입 토큰 화기와 반대하십시오. PyTextRankPhrases 객체에 의해 별도로 실행되므로 Spacy 파이프 라인에 pytextrank를 추가하는 것은 필요하지 않습니다. AssociationCompactor 사용하여 차트에 표시된 문구 수를 2000으로 줄일 수 있습니다. 생성 된 문서는 문서 점수가 단어 수에 해당하지 않기 때문에 텍스트가 아닌 기능처럼 취급됩니다.
import pytextrank, spacy
import scattertext as st
nlp = spacy.load('en')
nlp.add_pipe("textrank", last=True)
convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(nlp),
party=lambda df: df.party.apply({'democrat': 'Democratic', 'republican': 'Republican'}.get)
)
corpus = st.CorpusFromParsedDocuments(
convention_df,
category_col='party',
parsed_col='parse',
feats_from_spacy_doc=st.PyTextRankPhrases()
).build(
).compact(
AssociationCompactor(2000, use_non_text_features=True)
)
코퍼스에 존재하는 용어는 엔터티로 지명되었으며 주파수 수와 달리 점수는 TexTrank 알고리즘에 의해 지정된 고유성 점수입니다. running corpus.get_metadata_freq_df('') 각 범주에 대해 용어의 Textrank 점수 합계를 반환합니다. 이 점수의 밀집된 순위는 산점도를 구성하는 데 사용됩니다.
term_category_scores = corpus.get_metadata_freq_df('')
print(term_category_scores)
'''
Democratic Republican
term
our future 1.113434 0.699103
your country 0.314057 0.000000
their home 0.385925 0.000000
our government 0.185483 0.462122
our workers 0.199704 0.210989
her family 0.540887 0.405552
our time 0.510930 0.410058
...
'''
플롯을 구성하기 전에 집계 TexTrank 점수는 특히 해석 할 수 없기 때문에 일부 도우미 변수를합시다. metadata_description 필드에서 각 점수의 범주 당 순위를 표시 할 것이다. 용어가 클릭 한 후에 표시됩니다.
term_ranks = pd.DataFrame(
np.argsort(np.argsort(-term_category_scores, axis=0), axis=0) + 1,
columns=term_category_scores.columns,
index=term_category_scores.index)
metadata_descriptions = {
term: '<br/>' + '<br/>'.join(
'<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
for cat in corpus.get_categories())
for term in corpus.get_metadata()
}
우리는 몇 가지 방법으로 학기 점수를 구성 할 수 있습니다. 하나는 표준 빽빽한 순위 차이로, 여기에서 대부분의 2 범주 대비 플롯에 사용되는 점수로, 가장 많은 범주 관련 문구를 제공합니다. 다른 하나는 최대 카테고리 별 점수를 사용하는 것입니다. 이는 다른 범주의 두드러진에 관계없이 각 범주에서 가장 두드러진 문구를 제공합니다. 우리는이 튜토리얼에서 두 가지 접근 방식을 모두 취할 것입니다. 두 번째 종류의 점수, 아래의 범주 별표를 계산합시다.
category_specific_prominence = term_category_scores.apply(
lambda r: r.Democratic if r.Democratic > r.Republican else -r.Republican,
axis=1
)
이제 우리는이 차트를 출력했습니다. 우리는 서로 꼭대기에 구부러진 문구를 배치하는 dense_rank 변환을 사용합니다. category_specific_prominence sort_by_dist False 일치하는 문구는 텍스트 기능이 아닌 기능으로 취급되므로 단일 프레이즈 주제 모델로 인코딩하고 topic_model_preview_size 를 0 으로 설정하여 주제 모델 목록을 표시하지 않아야합니다. 마지막으로 전체 문서가 표시되도록 설정합니다. 문서는 문구 별 점수 순서대로 표시됩니다.
html = produce_scattertext_explorer(
corpus,
category='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
transform=dense_rank,
metadata=corpus.get_df()['speaker'],
scores=category_specific_prominence,
sort_by_dist=False,
use_non_text_features=True,
topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
topic_model_preview_size=0,
metadata_descriptions=metadata_descriptions,
use_full_doc=True
)
각 범주에서 가장 관련된 용어는 적어도 사후 분석에서 의미가 있습니다. 롬니 주지사를 언급 할 때 민주당은 자신의 성 "Romney"를 가장 중심적인 언급에서 사용했지만 공화당은 더 친숙하고 인간화 된 "Mitt"를 사용했습니다. 오바마 대통령의 관점에서, "오바마"라는 문구는 최고의 용어로 나타나지 않았지만, 이름 "Barack"은 민주당 연설에서 가장 중심적인 문구 중 하나 인 "미트"를 반영했습니다.
대안으로, 우리는 컬러 프레이즈 포인트에 대한 점수의 균형 차이를 조밀하게하고 차트의 오른쪽에 표시 될 상단 문구를 결정할 수 있습니다. 범주 별표 점수로 scores 설정하는 대신 term_scorer=RankDifference() 설정하여 용어 점수를 산점도 플롯 생성 프로세스에 결정하는 방법을 주입하도록 설정합니다.
html = produce_scattertext_explorer(
corpus,
category='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
transform=dense_rank,
use_non_text_features=True,
metadata=corpus.get_df()['speaker'],
term_scorer=RankDifference(),
sort_by_dist=False,
topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
topic_model_preview_size=0,
metadata_descriptions=metadata_descriptions,
use_full_doc=True
)
Abehandler의 Phrasemachine (Handler et al. 2016)은 명사구를 식별하기 위해 일련의 부품 태그 시퀀스에 대한 정기적 인 표현을 사용합니다. 이것은 Spacy의 NP-Chunking을 사용하는 것보다 이점이 있습니다. 그것은 appositives가없는 의미 있고 큰 명사 단계를 분리하는 경향이 있습니다.
pytextrank와 반대되는 것은 우리는이 문구의 수를 사용하여 다른 용어처럼 취급합니다.
import spacy
from scattertext import SampleCorpora, PhraseMachinePhrases, dense_rank, RankDifference, AssociationCompactor, produce_scattertext_explorer
from scattertext.CorpusFromPandas import CorpusFromPandas
corpus = (CorpusFromPandas(SampleCorpora.ConventionData2012.get_data(),
category_col='party',
text_col='text',
feats_from_spacy_doc=PhraseMachinePhrases(),
nlp=spacy.load('en', parser=False))
.build().compact(AssociationCompactor(4000)))
html = produce_scattertext_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
transform=dense_rank,
metadata=corpus.get_df()['speaker'],
term_scorer=RankDifference(),
width_in_pixels=1000)
Scattertext에서는 용어 연관성을 포함한 다양한 메트릭이 종종 두 가지 방식으로 표시됩니다. 첫 번째이자 가장 중요한 것은 차트의 위치입니다. 두 번째는 점이나 텍스트의 색상입니다. ScatterText 0.2.21에서, 이러한 점수의 의미론을 시각화하는 방법이 소개됩니다 : 그라디언트는 키로.
기본적으로 그라디언트는 기본적으로 d3.interpolateRdYlBu 인 produce_scattertext_explorer 의 d3_color_scale 매개 변수를 따릅니다.
produce_scattertext_explorer (및 유사한 기능)에 대한 다음의 추가 매개 변수를 사용하면 조작 구배가 가능합니다.
include_gradient: bool (기본적으로 False )은 그라디언트의 모양을 트리거하는 플래그입니다.left_gradient_term: Optional[str] 그라디언트의 왼쪽 왼쪽쪽에 쓰여진 텍스트를 나타냅니다. gradient_text_color 로 작성되었으며 기본적으로 category_name 입니다.right_gradient_term: Optional[str] 그라디언트의 왼쪽 왼쪽에 기록 된 텍스트를 나타냅니다. gradient_text_color 로 작성되었으며 기본적으로 not_category_name 입니다.middle_gradient_term: Optional[str] 그라디언트 중간에 작성된 텍스트를 나타냅니다. 중앙 구배 색상의 반대 색상이며 기본적으로 비어 있습니다.gradient_text_color: Optional[str] 그라디언트에 작성된 텍스트의 고정 된 색상을 나타냅니다. 없다면 기본적으로 그라디언트의 반대 색상으로 기본적으로 표시됩니다.left_text_color: Optional[str] 왼쪽 그라디언트 항의 gradient_text_color 재정의합니다middle_text_color: Optional[str] 중간 기울기 항의 gradient_text_color 재정의합니다right_text_color: Optional[str] 올바른 그라디언트 용어의 gradient_text_color 재정의합니다gradient_colors: Optional[List[str]] '#', (예 : ['#0000ff', '#980067', '#cc3300', '#32cd00'] ]를 포함한 16 진수 목록 선택. 주어지면 d3_color_scale 무시합니다. 간단한 예는 다음과 같습니다. 용어 색상은 용어 이름과 #RRGGBB 색상 사이의 매핑으로 정의되며 term_color 매개 변수의 일부로 색상 구배는 gradient_colors 에 정의됩니다. 그만큼
import matplotlib . pyplot as plt
import matplotlib as mpl
df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
)
corpus = st . CorpusFromParsedDocuments (
df , category_col = 'party' , parsed_col = 'parse'
). build (). get_unigram_corpus (). compact ( st . AssociationCompactor ( 2000 ))
html = st . produce_scattertext_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
minimum_term_frequency = 0 ,
pmi_threshold_coefficient = 0 ,
width_in_pixels = 1000 ,
metadata = corpus . get_df ()[ 'speaker' ],
transform = st . Scalers . dense_rank ,
include_gradient = True ,
left_gradient_term = "More Democratic" ,
right_gradient_term = "More Republican" ,
middle_gradient_term = 'Metric: Dense Rank Difference' ,
gradient_text_color = "white" ,
term_colors = dict ( zip (
corpus . get_terms (),
[
mpl . colors . to_hex ( x ) for x in plt . get_cmap ( 'brg' )(
st . Scalers . scale_center_zero_abs (
st . RankDifferenceScorer ( corpus ). set_categories ( 'democrat' ). get_scores ()). values
)
]
)),
gradient_colors = [ mpl . colors . to_hex ( x ) for x in plt . get_cmap ( 'brg' )( np . arange ( 1. , 0. , - 0.01 ))],
) 용어 대신 공감 (Fast et al., 2016) 주제와 카테고리를 시각화하려면 유니 그램과 빅 람이 아닌 추출 된 주제와 카테고리의 Corpus 만들어야합니다. 그렇게하려면 FeatsOnlyFromEmpath 기능 추출기를 사용하십시오. 직접 만드는 방법의 예는 소스 코드를 참조하십시오.
시각화를 만들 때 use_non_text_features=True argument를 produce_scattertext_explorer 로 전달하십시오. 이것은 용어를 찾는 대신 라벨이 붙은 공감 주제와 카테고리를 사용하도록 지시합니다. 주제 또는 카테고리 레이블이 클릭되면 문서가 반환되므로 문서 수준 범주 관련 강도의 순서대로 사용되므로 use_full_doc=True 설정하는 것은 거대한 문서가없는 한 의미가 있습니다. 그렇지 않으면 처음 300자가 표시됩니다.
(0.0.26의 신규). 주제 모델과 일치하는 스 produce_scattertext_explorer 의 구절을 대담하게 보장하기 위해 topic_model_term_lists=feat_builder.get_top_model_term_lists() 포함시켜야합니다.
>>> feat_builder = st.FeatsFromOnlyEmpath()
>>> empath_corpus = st.CorpusFromParsedDocuments(convention_df,
... category_col='party',
... feats_from_spacy_doc=feat_builder,
... parsed_col='text').build()
>>> html = st.produce_scattertext_explorer(empath_corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... use_non_text_features=True,
... use_full_doc=True,
... topic_model_term_lists=feat_builder.get_top_model_term_lists())
>>> open("Convention-Visualization-Empath.html", 'wb').write(html.encode('utf-8'))
C ScatterText에는 일반 지망생 태그 CateGoires와 문서 카테고리 간의 관계를 탐색하는 기능 빌더도 포함되어 있습니다. 우리는 정보가없는 Dirichlet Priors와 함께 로그-오드-비율의 Z- 스코어를 사용하여 GI 태그 범주와 정당과의 관계를 살펴 보겠습니다 (Monroe 2008). produce_frequency_explorer 플롯 변형을 사용 하여이 관계를 시각화하여 태그 범주의 단어가 발생하는 횟수로 x 축을 설정하고 y 축을 z- 점수로 설정합니다.
일반 지망생에 대한 자세한 내용은 일반 지망생 홈페이지를 참조하십시오.
FeatsFromGeneralInquirer 기능 빌더를 사용하는 것을 제외하고는 이전과 동일한 데이터 세트를 사용하겠습니다.
>>> general_inquirer_feature_builder = st.FeatsFromGeneralInquirer()
>>> corpus = st.CorpusFromPandas(convention_df,
... category_col='party',
... text_col='text',
... nlp=st.whitespace_nlp_with_sentences,
... feats_from_spacy_doc=general_inquirer_feature_builder).build()
다음으로, 우리는 이전 섹션에서 produce_scattertext_explorer 라고 불리는 유사한 방식으로 produce_frequency_explorer 에 전화합니다. 그러나 몇 가지 차이점이 있습니다. 먼저, 우리는 범주 간의 관계를 채점하는 LogOddsRatioUninformativeDirichletPrior tercor를 지정합니다. grey_threshold 는 [-1.96, 1.96] (예 : p> 0.05) 사이의 점수가 회색으로 채색되어야 함을 나타냅니다. 인수 metadata_descriptions=general_inquirer_feature_builder.get_definitions() 는 태그 이름을 문자열 정의에 매핑하는 사전 매핑이 전달되었음을 나타냅니다. 태그를 클릭하면 스 니펫 다음 이미지에 표시된 것처럼 사전의 정의가 플롯 아래에 표시됩니다.
>>> html = st.produce_frequency_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... metadata=convention_df['speaker'],
... use_non_text_features=True,
... use_full_doc=True,
... term_scorer=st.LogOddsRatioUninformativeDirichletPrior(),
... grey_threshold=1.96,
... width_in_pixels=1000,
... topic_model_term_lists=general_inquirer_feature_builder.get_top_model_term_lists(),
... metadata_descriptions=general_inquirer_feature_builder.get_definitions())
결과 차트는 다음과 같습니다.
[Moral Foundations 이론]은 Graham et al. (2013). 이 기초는 [moralfoundations.org]에 설명 된 바와 같이, 간호/피해, 공정성/부정 행위, 충성도/배신, 권위/파괴, 성실/타락, 자유/억압. 이러한 재단에 대한보다 심도있는 토론은 사이트를 참조하십시오.
Frimer et al. (2019)는 도덕 기초 사전 2.0 또는 미덕 (재단에 유리한) 또는 부통령 (재단에 반대)으로 도덕적 기초를 불러 일으키는 용어의 어휘를 만들었습니다.
이 사전은 일반 지망생과 같은 방식으로 사용할 수 있습니다. 이 예에서는 기초가 호출 된 주파수 단어에 비해 Cohen의 D 점수를 플로팅 할 수 있습니다.
먼저 코퍼스를 정상적으로로드하고 st.FeatsFromMoralFoundationsDictionary() 사용하여 기능을 추출 할 수 있습니다.
import scattertext as st
convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
moral_foundations_feats = st . FeatsFromMoralFoundationsDictionary ()
corpus = st . CorpusFromPandas ( convention_df ,
category_col = 'party' ,
text_col = 'text' ,
nlp = st . whitespace_nlp_with_sentences ,
feats_from_spacy_doc = moral_foundations_feats ). build ()다음으로 코헨의 D 기간 득점자를 사용하여 코퍼스를 분석하고 코헨의 D 연관 점수 세트를 설명하겠습니다.
cohens_d_scorer = st . CohensD ( corpus ). use_metadata ()
term_scorer = cohens_d_scorer . set_categories ( 'democrat' , [ 'republican' ]). term_scorer . get_score_df ()다음 데이터 프레임을 생성합니다.
| cohens_d | Cohens_d_se | Cohens_d_z | Cohens_d_p | hedges_g | hedges_g_se | hedges_g_z | hedges_g_p | M1 | M2 | count1 | count2 | Docs1 | DOCS2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Care.virtue | 0.662891 | 0.149425 | 4.43629 | 4.57621E-06 | 0.660257 | 0.159049 | 4.15129 | 1.65302E-05 | 0.195049 | 0.12164 | 760 | 379 | 115 | 54 |
| Care.vice | 0.24435 | 0.146025 | 1.67335 | 0.0471292 | 0.243379 | 0.152654 | 1.59432 | 0.0554325 | 0.0580005 | 0.0428358 | 244 | 121 | 80 | 41 |
| 공정성 | 0.176794 | 0.145767 | 1.21286 | 0.112592 | 0.176092 | 0.152164 | 1.15725 | 0.123586 | 0.0502469 | 0.0403369 | 225 | 107 | 71 | 39 |
| 공정성 | 0.0707162 | 0.145528 | 0.485928 | 0.313509 | 0.0704352 | 0.151711 | 0.464273 | 0.321226 | 0.00718627 | 0.00573227 | 32 | 14 | 21 | 10 |
| 권위 .Virtue | -0.0187793 | 0.145486 | -0.12908 | 0.551353 | -0.0187047 | 0.15163 | -0.123357 | 0.549088 | 0.358192 | 0.361191 | 1281 | 788 | 122 | 66 |
| 권위 .VICE | -0.0354164 | 0.145494 | -0.243422 | 0.596161 | -0.0352757 | 0.151646 | -0.232619 | 0.591971 | 0.00353465 | 0.00390602 | 20 | 14 | 14 | 10 |
| 신성함 | -0.512145 | 0.147848 | -3.46399 | 0.999734 | -0.51011 | 0.156098 | -3.26788 | 0.999458 | 0.0587987 | 0.101677 | 265 | 309 | 74 | 48 |
| 신성함 | -0.108011 | 0.145589 | -0.74189 | 0.770923 | -0.107582 | 0.151826 | -0.708585 | 0.760709 | 0.00845048 | 0.0109339 | 35 | 28 | 23 | 20 |
| 로열티 .Virtue | -0.413696 | 0.147031 | -2.81367 | 0.997551 | -0.412052 | 0.154558 | -2.666 | 0.996162 | 0.259296 | 0.309776 | 1056 | 717 | 119 | 66 |
| 충성도 | -0.0854683 | 0.145549 | -0.587213 | 0.72147 | -0.0851287 | 0.151751 | -0.560978 | 0.712594 | 0.00124518 | 0.00197022 | 5 | 5 | 5 | 4 |
이 데이터 프레임은 코헨의 D 점수 (및 표준 오류 및 Z- 스코어), 헤지의
Cohen의 D는 M1과 M2의 차이를 풀링 된 표준 편차로 나눈 값입니다.
이제 기초 대 주파수의 D- 점수를 플로팅합시다.
html = st . produce_frequency_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
metadata = convention_df [ 'speaker' ],
use_non_text_features = True ,
use_full_doc = True ,
term_scorer = st . CohensD ( corpus ). use_metadata (),
grey_threshold = 0 ,
width_in_pixels = 1000 ,
topic_model_term_lists = moral_foundations_feats . get_top_model_term_lists (),
metadata_descriptions = moral_foundations_feats . get_definitions ()
)가장 관심있는 용어는 종종 코퍼스 전체에 특징적인 용어입니다. 이 용어는 연구중인 모든 문서 세트에서 자주 발생하지만 일반적인 용어 빈도에 비해 상대적으로 드물게 발생합니다.
x 축 produce_characteristic_explorer 특징적인 점수를 가진 플롯을 생성 할 수 있습니다.
코퍼스 특성은 연구의 모든 문서의 단어와 일반적인 영어 빈도 목록 사이의 조밀 한 용어 순위 차이입니다. 보다 철저한 설명은 Term-Class Association 점수에 대한이 대화를 참조하십시오.
import scattertext as st
corpus = ( st . CorpusFromPandas ( st . SampleCorpora . ConventionData2012 . get_data (),
category_col = 'party' ,
text_col = 'text' ,
nlp = st . whitespace_nlp_with_sentences )
. build ()
. get_unigram_corpus ()
. compact ( st . ClassPercentageCompactor ( term_count = 2 ,
term_ranker = st . OncePerDocFrequencyRanker )))
html = st . produce_characteristic_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
metadata = corpus . get_df ()[ 'speaker' ]
)
open ( 'demo_characteristic_chart.html' , 'wb' ). write ( html . encode ( 'utf-8' )) 단어, 단계 및 주제 외에도 각 점이 문서에 해당 할 수 있습니다. 먼저 2012 년 컨벤션 데이터 세트의 코퍼스 객체를 만들어 봅시다. 이 설명은 demo_pca_documents.py 를 따릅니다
import pandas as pd
from sklearn . feature_extraction . text import TfidfTransformer
import scattertext as st
from scipy . sparse . linalg import svds
convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )
corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
category_col = 'party' ,
parsed_col = 'parse' )
. build ()
. get_stoplisted_unigram_corpus ()) 다음으로 문서 이름을 Corpus 객체의 메타 데이터로 추가하겠습니다. add_doc_names_as_metadata 함수는 문서 이름을 배열하고 해당 이름으로 새로운 코퍼스 메타 데이터를 채 웁니다. 두 문서의 이름이 같은 경우 이름에 숫자 (1부터 시작)가 추가됩니다.
corpus = corpus . add_doc_names_as_metadata ( corpus . get_df ()[ 'speaker' ])다음으로, 우리는 코퍼스 용어 용어 문서 행렬에 대한 tf.idf 점수를 발견하고, 스파 스 SVD를 실행하고, 프로젝션 데이터 프레임에 추가하여 X와 Y 축을 처음 두 개의 단일 값으로 만들고 문서 이름에 해당하는 코퍼스 메타 데이터에 색인화됩니다.
embeddings = TfidfTransformer (). fit_transform ( corpus . get_term_doc_mat ())
u , s , vt = svds ( embeddings , k = 3 , maxiter = 20000 , which = 'LM' )
projection = pd . DataFrame ({ 'term' : corpus . get_metadata (), 'x' : u . T [ 0 ], 'y' : u . T [ 1 ]}). set_index ( 'term' ) 마지막으로, 민주당의 경우 1 점, 공화당의 경우 0으로 점수를 올렸으며 공화당 문서를 적색 포인트로, 민주당 문서를 파란색으로 렌더링합니다. produce_pca_explorer 함수에 대한 자세한 내용은 SVD를 사용하여 모든 종류의 단어 임베드를 시각화하십시오.
category = 'democrat'
scores = ( corpus . get_category_ids () == corpus . get_categories (). index ( category )). astype ( int )
html = st . produce_pca_explorer ( corpus ,
category = category ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
metadata = convention_df [ 'speaker' ],
width_in_pixels = 1000 ,
show_axes = False ,
use_non_text_features = True ,
use_full_doc = True ,
projection = projection ,
scores = scores ,
show_top_terms = False )대화식 버전을 클릭하십시오
코헨의 D는 효과 크기를 측정하는 데 사용되는 인기있는 메트릭입니다. 코헨의 D와 헤지의 정의
> >> convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
> >> corpus = ( st . CorpusFromPandas ( convention_df ,
... category_col = 'party' ,
... text_col = 'text' ,
... nlp = st . whitespace_nlp_with_sentences )
.... build ()
.... get_unigram_corpus ())효과 크기 및 기타 메트릭을 검사하기 위해 Term Scorer 객체를 만들 수 있습니다.
>> > term_scorer = st . CohensD ( corpus ). set_categories ( 'democrat' , [ 'republican' ])
>> > term_scorer . get_score_df (). sort_values ( by = 'cohens_d' , ascending = False ). head ()
cohens_d
cohens_d_se
cohens_d_z
cohens_d_p
hedges_g
hedges_g_se
hedges_g_z
hedges_g_p
m1
m2
obama
1.187378
0.024588
48.290444
0.000000e+00
1.187322
0.018419
64.461363
0.0
0.007778
0.002795
class 0.855859 0.020848 41.052045 0.000000e+00 0.855818 0.017227 49.677688 0.0 0.002222 0.000375
middle
0.826895
0.020553
40.232746
0.000000e+00
0.826857
0.017138
48.245626
0.0
0.002316
0.000400
president
0.820825
0.020492
40.056541
0.000000e+00
0.820786
0.017120
47.942661
0.0
0.010231
0.005369
barack
0.730624
0.019616
37.245725
6.213052e-304
0.730589
0.016862
43.327800
0.0
0.002547
0.000725 Cohen의 D 계산은 용어 수에 직접 기반을 두지 않습니다. 오히려 통계를 계산하기 전에 각 문서의 용어 수를 문서의 총 용어 수로 나눕니다. m1 과 m2 는 각각 민주당과 공화당이 해당 용어 인 연설에서 단어의 평균 부분입니다. 효과 크기 ( cohens_d )는 이러한 수단의 차이를 풀링 된 표준 편차로 나눈 값입니다. cohens_d_se 통계의 표준 오차이며 cohens_d_z 와 cohens_d_p 효과의 통계적 유의성을 나타내는 z- 점수 및 p- 값입니다. 헤지에 해당하는 열이 있습니다
> >> st . produce_frequency_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
term_scorer = st . CohensD ( corpus ),
metadata = convention_df [ 'speaker' ],
grey_threshold = 0
)대화식 버전을 클릭하십시오.
Cliff 's Delta (Cliff 1993)는 비모수 적 접근 방식을 사용하여 효과 크기를 계산합니다. 우리의 설정에서, 포커스 세트에서 각 문서의 용어 주파수 백분율은 배경 세트의 용어와 비교됩니다. 각 문서 쌍의 경우 포커스 문서의 빈도 백분율이 배경보다 크면 0, 다른 경우 -1이면 1 점이 부여됩니다. 이것은 문서 길이가 초점 및 배경 공동에 유사하게 분포되어 있다고 가정합니다.
CliffsDelta 에 사용 된 공식은 [https://real-statistics.com/non-parametric-tests/mann-whitney-test/cliffs-delta/]를 참조하십시오.
아래는 CliffsDelta 사용하여 용어 점수를 찾아 플롯하는 방법의 예입니다.
nlp = spacy . blank ( 'en' )
nlp . add_pipe ( 'sentencizer' )
convention_df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
party = lambda df : df . party . apply (
lambda x : { 'democrat' : 'Dem' , 'republican' : 'Rep' }[ x ]),
SpacyParse = lambda df : df . text . progress_apply ( nlp )
)
corpus = st . CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'SpacyParse' ). build (
). remove_terms_used_in_less_than_num_docs ( 10 )
st . CliffsDelta ( corpus ). set_categories ( 'Dem' ). get_score_df (). sort_values ( by = 'Dem' , ascending = False ). iloc [: 10 ]| 용어 | 메트릭 | stddev | 낮은 5.0% CI | 높이 -5.0% CI | 용기 1 | Termcount2 | doccount1 | doccount2 |
|---|---|---|---|---|---|---|---|---|
| 오바마 | 0.597191 | 0.0266606 | -1.35507 | -1.03477 | 537 | 165 | 113 | 40 |
| 오바마 대통령 | 0.565903 | 0.0314348 | -2.37978 | -1.74131 | 351 | 78 | 100 | 30 |
| 대통령 | 0.426337 | 0.0293418 | 1.22784 | 0.909226 | 740 | 301 | 113 | 53 |
| 가운데 | 0.417591 | 0.0267365 | 1.10791 | 0.840932 | 164 | 27 | 68 | 12 |
| 수업 | 0.415373 | 0.0280622 | 1.09032 | 0.815649 | 161 | 25 | 69 | 14 |
| 버락 | 0.406997 | 0.0281692 | 1.00765 | 0.750963 | 202 | 46 | 76 | 16 |
| 버락 오바마 | 0.402562 | 0.027512 | 0.965359 | 0.723403 | 164 | 45 | 76 | 16 |
| 그게 | 0.384085 | 0.0227344 | 0.809747 | 0.634705 | 236 | 91 | 89 | 31 |
| 오바마. | 0.356245 | 0.0237453 | 0.664688 | 0.509631 | 70 | 5 | 49 | 4 |
| ~을 위한 | 0.35526 | 0.0364138 | 0.70142 | 0.46487 | 1020 | 542 | 119 | 62 |
dataframe_scattertext 사용하여 Cliff의 델타 점수를 우아하게 표시하고 include_gradient=True 매개 변수를 사용하여 포인트 채색 방식을 설명 할 수 있습니다. left_gradient_term , middle_gradient_term 및 right_gradient_term 매개 변수를 문자열로 설정하여 코르레딩 값에 표시됩니다.
plot_df = st . CliffsDelta (
corpus
). set_categories (
category_name = 'Dem'
). get_score_df (). rename ( columns = { 'Metric' : 'CliffsDelta' }). assign (
Frequency = lambda df : df . TermCount1 + df . TermCount1 ,
X = lambda df : df . Frequency ,
Y = lambda df : df . CliffsDelta ,
Xpos = lambda df : st . Scalers . dense_rank ( df . X ),
Ypos = lambda df : st . Scalers . scale_center_zero_abs ( df . Y ),
ColorScore = lambda df : df . Ypos ,
)
html = st . dataframe_scattertext (
corpus ,
plot_df = plot_df ,
category = 'Dem' ,
category_name = 'Dem' ,
not_category_name = 'Rep' ,
width_in_pixels = 1000 ,
ignore_categories = False ,
metadata = lambda corpus : corpus . get_df ()[ 'speaker' ],
color_score_column = 'ColorScore' ,
left_list_column = 'ColorScore' ,
show_characteristic = False ,
y_label = "Cliff's Delta" ,
x_label = 'Frequency Ranks' ,
y_axis_labels = [ f'More Rep: delta= { plot_df . CliffsDelta . max ():.3f } ' ,
'' ,
f'More Dem: delta= { - plot_df . CliffsDelta . max ():.3f } ' ],
tooltip_columns = [ 'Frequency' , 'CliffsDelta' ],
term_description_columns = [ 'CliffsDelta' , 'Stddev' , 'Low-95.0% CI' , 'High-95.0% CI' ],
header_names = { 'upper' : 'Top Dem' , 'lower' : 'Top Reps' },
horizontal_line_y_position = 0 ,
include_gradient = True ,
left_gradient_term = 'More Republican' ,
right_gradient_term = 'More Democratic' ,
middle_gradient_term = "Metric: Cliff's Delta" ,
) BNS (Bi-Normal Separation) (BNS) (Forman, 2008)는 버전 0.1.8에 추가되었습니다. (BNS)의 변형이 어디에나 사용됩니다
corpus = ( st . CorpusFromPandas ( convention_df ,
category_col = 'party' ,
text_col = 'text' ,
nlp = st . whitespace_nlp_with_sentences )
. build ()
. get_unigram_corpus ()
. remove_infrequent_words ( 3 , term_ranker = st . OncePerDocFrequencyRanker ))
term_scorer = ( st . BNSScorer ( corpus ). set_categories ( 'democrat' ))
print ( term_scorer . get_score_df (). sort_values ( by = 'democrat BNS' ))
html = st . produce_frequency_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
scores = term_scorer . get_score_df ()[ 'democrat BNS' ]. reindex ( corpus . get_terms ()). values ,
metadata = lambda c : c . get_df ()[ 'speaker' ],
minimum_term_frequency = 0 ,
grey_threshold = 0 ,
y_label = f'Bi-normal Separation (alpha= { term_scorer . prior_counts } )'
)BNS는 알고리즘으로 발견 된 알파를 사용하여 용어를 기록했습니다. ! [bns] (https://raw.githubusercontent.com/jasonkessler/jasonkessler.github.io/master/d emo_bi_normal_separation.png)
각 문서에 대한 예측 점수를 생성하기 위해 분류기를 훈련시킬 수 있습니다. 분류기 또는 회귀자는 종종 Scatterext로 표시된 기능을 넘어서 기능을 고려하는 기능을 사용합니다.
ScatterText를 사용하여 Unigrams (또는 실제로 모든 기능 표현)와 모델에서 생성 된 문서 점수 간의 상관 관계를 시각화 할 수 있습니다.
다음 예에서는 전체 컨벤션 데이터 세트에서 UniGram 및 Bi-Gram 기능을 사용하여 선형 SVM을 훈련시키고 모델을 사용하여 각 문서를 예측하고 Pearson을 사용합니다.
from sklearn . svm import LinearSVC
import scattertext as st
df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
)
corpus = st . CorpusFromParsedDocuments (
df , category_col = 'party' , parsed_col = 'parse'
). build ()
X = corpus . get_term_doc_mat ()
y = corpus . get_category_ids ()
clf = LinearSVC ()
clf . fit ( X = X , y = y == corpus . get_categories (). index ( 'democrat' ))
doc_scores = clf . decision_function ( X = X )
compactcorpus = corpus . get_unigram_corpus (). compact ( st . AssociationCompactor ( 2000 ))
plot_df = st . Correlations (). set_correlation_type (
'pearsonr'
). get_correlation_df (
corpus = compactcorpus ,
document_scores = doc_scores
). reindex ( compactcorpus . get_terms ()). assign (
X = lambda df : df . Frequency ,
Y = lambda df : df [ 'r' ],
Xpos = lambda df : st . Scalers . dense_rank ( df . X ),
Ypos = lambda df : st . Scalers . scale_center_zero_abs ( df . Y ),
SuppressDisplay = False ,
ColorScore = lambda df : df . Ypos ,
)
html = st . dataframe_scattertext (
compactcorpus ,
plot_df = plot_df ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
width_in_pixels = 1000 ,
metadata = lambda c : c . get_df ()[ 'speaker' ],
unified_context = False ,
ignore_categories = False ,
color_score_column = 'ColorScore' ,
left_list_column = 'ColorScore' ,
y_label = "Pearson r (correlation to SVM document score)" ,
x_label = 'Frequency Ranks' ,
header_names = { 'upper' : 'Top Democratic' ,
'lower' : 'Top Republican' },
) ScatterText는 단일 항목 특성을 계산할 때 일반 도메인 영어 단어 주파수 세트에 의존합니다.
점수. 영어가 아닌 데이터 또는 특정 도메인에서 실행 ScatterText를 사용하면 점수의 품질이 저하됩니다.
ScatterText 0.1.6 이상에 있는지 확인하십시오.
이를 해결하기 위해 Corpus.set_background_corpus 함수를 사용하여 Corpus와 같은 객체에 사용자 정의 배경 점수를 추가 할 수 있습니다. 이 함수는 pd.Series 객체를 취하며 숫자 카운트 값이있는 용어로 인덱싱됩니다.
기본적으로 [! 이해 스케일 -F- 스코어] (스케일링 된 F- 점수)는 특성 용어의 방식을 평가하는 데 사용됩니다.
아래의 예는 폴란드어 배경 단어 주파수를 사용하는 것을 보여줍니다.
먼저, https://github.com/oprogramador/ most-common-words-by-language repo의 목록을 사용하여 폴란드어를 주파수에 매핑하는 직렬 객체를 생성합니다.
polish_word_frequencies = pd . read_csv (
'https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/pl/pl_50k.txt' ,
sep = ' ' ,
names = [ 'Word' , 'Frequency' ]
). set_index ( 'Word' )[ 'Frequency' ]시리즈의 구성에 유의하십시오
>> > polish_word_frequencies
Word
nie
5875385
to
4388099
się
3507076
w
2723767
na
2309765
Name : Frequency , dtype : int64 다음으로, 우리는 https://klejbenchmark.com/tasks/ corpus에서 긍정적이고 부정적인 호텔 리뷰로 나타나는 문서로 구성된 Dataframe, reviews_df 구축합니다 (Kocoń, et al. 2019). 이 데이터는 CC By-NC-SA 4.0 라이센스 아래에 있습니다. These are labeled as "__label__meta_plus_m" and "__label__meta_minus_m". We will use Scattertext to compare those reviews and determine
nlp = spacy . blank ( 'pl' )
nlp . add_pipe ( 'sentencizer' )
with ZipFile ( io . BytesIO ( urlopen (
'https://klejbenchmark.com/static/data/klej_polemo2.0-in.zip'
). read ())) as zf :
review_df = pd . read_csv ( zf . open ( 'train.tsv' ), sep = ' t ' )[
lambda df : df . target . isin ([ '__label__meta_plus_m' , '__label__meta_minus_m' ])
]. assign (
Parse = lambda df : df . sentence . apply ( nlp )
) Next, we wish to create a ParsedCorpus object from review_df . In preparation, we first assemble a list of Polish stopwords from the stopwords repository. We also create the not_a_word regular expression to filter out terms which do not contain a letter.
polish_stopwords = {
stopword for stopword in
urlopen (
'https://raw.githubusercontent.com/bieli/stopwords/master/polish.stopwords.txt'
). read (). decode ( 'utf-8' ). split ( ' n ' )
if stopword . strip ()
}
not_a_word = re . compile ( r'^W+$' ) With these present, we can build a corpus from review_df with the category being the binary "target" column. We reduce the term space to unigrams and then run the filter_out which takes a function to determine if a term should be removed from the corpus. The function identifies terms which are in the Polish stoplist or do not contain a letter. Finally, terms occurring less than 20 times in the corpus are removed.
We set the background frequency Series we created early as the background corpus.
corpus = st . CorpusFromParsedDocuments (
review_df ,
category_col = 'target' ,
parsed_col = 'Parse'
). build (
). get_unigram_corpus (
). filter_out (
lambda term : term in polish_stopwords or not_a_word . match ( term ) is not None
). remove_infrequent_words (
minimum_term_count = 20
). set_background_corpus (
polish_word_frequencies
)Note that a minimum word count of 20 was chosen to ensure that only around 2,000 terms would be displayed
>> > corpus . get_num_terms ()
2023 Running get_term_and_background_counts shows us total term counts in the corpus compare to background frequency counts. We limit this to terms which only occur in the corpus.
>> > corpus . get_term_and_background_counts ()[
...
lambda df : df . corpus > 0
...]. sort_values ( by = 'corpus' , ascending = False )
background
corpus
m
341583838.0
4819.0
hotelu
33108.0
1812.0
hotel
297974790.0
1651.0
doktor
154840.0
1534.0
polecam
0.0
1438.0
.........
szoku
0.0
21.0
badaniem
0.0
21.0
balkonu
0.0
21.0
stopnia
0.0
21.0
wobec
0.0
21.0Interesting, the term "polecam" appears very frequently in the corpus, but does not appear at all in the background corpus, making it highly characteristic. Judging from Google Translate, it appears to mean something related to "recommend".
We are now ready to display the plot.
html = st . produce_scattertext_explorer (
corpus ,
category = '__label__meta_plus_m' ,
category_name = 'Plus-M' ,
not_category_name = 'Minus-M' ,
minimum_term_frequency = 1 ,
width_in_pixels = 1000 ,
transform = st . Scalers . dense_rank
) We can change the formula which is used to produce the Characteristic scores using the characteristic_scorer parameter to produce_scattertext_explorer .
It takes a instance of a descendant of the CharacteristicScorer class. See DenseRankCharacteristicness.py for an example of how to make your own.
Example of plotting with a modified characteristic scorer,
html = st . produce_scattertext_explorer (
corpus ,
category = '__label__meta_plus_m' ,
category_name = 'Plus-M' ,
not_category_name = 'Minus-M' ,
minimum_term_frequency = 1 ,
transform = st . Scalers . dense_rank ,
characteristic_scorer = st . DenseRankCharacteristicness (),
term_ranker = st . termranking . AbsoluteFrequencyRanker ,
term_scorer = st . ScaledFScorePresets ( beta = 1 , one_to_neg_one = True )
). encode ( 'utf-8' ))
print ( 'open ' + fn )Note that numbers show up as more characteristic using the Dense Rank Difference. It may be they occur unusually frequently in this corpus, or perhaps the background word frequencies under counted mumbers.
Word productivity is one strategy for plotting word-based charts describing an uncategorized corpus.
Productivity is defined in Schumann (2016) (Jason: check this) as the entropy of ngrams which contain a term. For the entropy computation, the probability of an n-gram wrt the term whose productivity is being calculated is the frequency of the n-gram divided by the term's frequency.
Since productivity highly correlates with frequency, the recommended metric to plot is the dense rank difference between frequency and productivity.
The snippet below plots words in the convention corpus based on their log frequency and their productivity.
The function st.whole_corpus_productivity_scores returns a DataFrame giving each word's productivity. For example, in the convention corpus,
Productivity scores should be calculated on a Corpus -like object which contains a complete set of unigrams and at least bigrams. This corpus should not be compacted before the productivity score calculation.
The terms with lower productivity have more limited usage (eg, "thank" for "thank you", "united" for "united steates") while the terms with higher productivity occurr in a wider varity of contexts ("getting", "actually", "political", etc.).
import spacy
import scattertext as st
corpus_no_cat = st . CorpusWithoutCategoriesFromParsedDocuments (
st . SampleCorpora . ConventionData2012 . get_data (). assign (
Parse = lambda df : [ x for x in spacy . load ( 'en_core_web_sm' ). pipe ( df . text )]),
parsed_col = 'Parse'
). build ()
compact_corpus_no_cat = corpus_no_cat . get_stoplisted_unigram_corpus (). remove_infrequent_words ( 9 )
plot_df = st . whole_corpus_productivity_scores ( corpus_no_cat ). assign (
RankDelta = lambda df : st . RankDifference (). get_scores (
a = df . Productivity ,
b = df . Frequency
)
). reindex (
compact_corpus_no_cat . get_terms ()
). dropna (). assign (
X = lambda df : df . Frequency ,
Xpos = lambda df : st . Scalers . log_scale ( df . Frequency ),
Y = lambda df : df . RankDelta ,
Ypos = lambda df : st . Scalers . scale ( df . RankDelta ),
)
html = st . dataframe_scattertext (
compact_corpus_no_cat . whitelist_terms ( plot_df . index ),
plot_df = plot_df ,
metadata = lambda df : df . get_df ()[ 'speaker' ],
ignore_categories = True ,
x_label = 'Rank Frequency' ,
y_label = "Productivity" ,
left_list_column = 'Ypos' ,
color_score_column = 'Ypos' ,
y_axis_labels = [ 'Least Productive' , 'Average Productivity' , 'Most Productive' ],
header_names = { 'upper' : 'Most Productive' , 'lower' : 'Least Productive' , 'right' : 'Characteristic' },
horizontal_line_y_position = 0
)Let's now turn our attention to a novel term scoring metric, Scaled F-Score. We'll examine this on a unigram version of the Rotten Tomatoes corpus (Pang et al. 2002). It contains excerpts of positive and negative movie reviews.
Please see Scaled F Score Explanation for a notebook version of this analysis.
from scipy . stats import hmean
term_freq_df = corpus . get_unigram_corpus (). get_term_freq_df ()[[ 'Positive freq' , 'Negative freq' ]]
term_freq_df = term_freq_df [ term_freq_df . sum ( axis = 1 ) > 0 ]
term_freq_df [ 'pos_precision' ] = ( term_freq_df [ 'Positive freq' ] * 1. /
( term_freq_df [ 'Positive freq' ] + term_freq_df [ 'Negative freq' ]))
term_freq_df [ 'pos_freq_pct' ] = ( term_freq_df [ 'Positive freq' ] * 1.
/ term_freq_df [ 'Positive freq' ]. sum ())
term_freq_df [ 'pos_hmean' ] = ( term_freq_df
. apply ( lambda x : ( hmean ([ x [ 'pos_precision' ], x [ 'pos_freq_pct' ]])
if x [ 'pos_precision' ] > 0 and x [ 'pos_freq_pct' ] > 0
else 0 ), axis = 1 ))
term_freq_df . sort_values ( by = 'pos_hmean' , ascending = False ). iloc [: 10 ]If we plot term frequency on the x-axis and the percentage of a term's occurrences which are in positive documents (ie, its precision) on the y-axis, we can see that low-frequency terms have a much higher variation in the precision. Given these terms have low frequencies, the harmonic means are low. Thus, the only terms which have a high harmonic mean are extremely frequent words which tend to all have near average precisions.
freq = term_freq_df . pos_freq_pct . values
prec = term_freq_df . pos_precision . values
html = st . produce_scattertext_explorer (
corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
category = 'Positive' ,
not_category_name = 'Negative' ,
not_categories = [ 'Negative' ],
x_label = 'Portion of words used in positive reviews' ,
original_x = freq ,
x_coords = ( freq - freq . min ()) / freq . max (),
x_axis_values = [ int ( freq . min () * 1000 ) / 1000. ,
int ( freq . max () * 1000 ) / 1000. ],
y_label = 'Portion of documents containing word that are positive' ,
original_y = prec ,
y_coords = ( prec - prec . min ()) / prec . max (),
y_axis_values = [ int ( prec . min () * 1000 ) / 1000. ,
int (( prec . max () / 2. ) * 1000 ) / 1000. ,
int ( prec . max () * 1000 ) / 1000. ],
scores = term_freq_df . pos_hmean . values ,
sort_by_dist = False ,
show_characteristic = False
)
file_name = 'not_normed_freq_prec.html'
open ( file_name , 'wb' ). write ( html . encode ( 'utf-8' ))
IFrame ( src = file_name , width = 1300 , height = 700 ) from scipy . stats import norm
def normcdf ( x ):
return norm . cdf ( x , x . mean (), x . std ())
term_freq_df [ 'pos_precision_normcdf' ] = normcdf ( term_freq_df . pos_precision )
term_freq_df [ 'pos_freq_pct_normcdf' ] = normcdf ( term_freq_df . pos_freq_pct . values )
term_freq_df [ 'pos_scaled_f_score' ] = hmean (
[ term_freq_df [ 'pos_precision_normcdf' ], term_freq_df [ 'pos_freq_pct_normcdf' ]])
term_freq_df . sort_values ( by = 'pos_scaled_f_score' , ascending = False ). iloc [: 10 ] freq = term_freq_df . pos_freq_pct_normcdf . values
prec = term_freq_df . pos_precision_normcdf . values
html = st . produce_scattertext_explorer (
corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
category = 'Positive' ,
not_category_name = 'Negative' ,
not_categories = [ 'Negative' ],
x_label = 'Portion of words used in positive reviews (norm-cdf)' ,
original_x = freq ,
x_coords = ( freq - freq . min ()) / freq . max (),
x_axis_values = [ int ( freq . min () * 1000 ) / 1000. ,
int ( freq . max () * 1000 ) / 1000. ],
y_label = 'documents containing word that are positive (norm-cdf)' ,
original_y = prec ,
y_coords = ( prec - prec . min ()) / prec . max (),
y_axis_values = [ int ( prec . min () * 1000 ) / 1000. ,
int (( prec . max () / 2. ) * 1000 ) / 1000. ,
int ( prec . max () * 1000 ) / 1000. ],
scores = term_freq_df . pos_scaled_f_score . values ,
sort_by_dist = False ,
show_characteristic = False
) term_freq_df [ 'neg_precision_normcdf' ] = normcdf (( term_freq_df [ 'Negative freq' ] * 1. /
( term_freq_df [ 'Negative freq' ] + term_freq_df [ 'Positive freq' ])))
term_freq_df [ 'neg_freq_pct_normcdf' ] = normcdf (( term_freq_df [ 'Negative freq' ] * 1.
/ term_freq_df [ 'Negative freq' ]. sum ()))
term_freq_df [ 'neg_scaled_f_score' ] = hmean (
[ term_freq_df [ 'neg_precision_normcdf' ], term_freq_df [ 'neg_freq_pct_normcdf' ]])
term_freq_df [ 'scaled_f_score' ] = 0
term_freq_df . loc [ term_freq_df [ 'pos_scaled_f_score' ] > term_freq_df [ 'neg_scaled_f_score' ],
'scaled_f_score' ] = term_freq_df [ 'pos_scaled_f_score' ]
term_freq_df . loc [ term_freq_df [ 'pos_scaled_f_score' ] < term_freq_df [ 'neg_scaled_f_score' ],
'scaled_f_score' ] = 1 - term_freq_df [ 'neg_scaled_f_score' ]
term_freq_df [ 'scaled_f_score' ] = 2 * ( term_freq_df [ 'scaled_f_score' ] - 0.5 )
term_freq_df . sort_values ( by = 'scaled_f_score' , ascending = True ). iloc [: 10 ] is_pos = term_freq_df . pos_scaled_f_score > term_freq_df . neg_scaled_f_score
freq = term_freq_df . pos_freq_pct_normcdf * is_pos - term_freq_df . neg_freq_pct_normcdf * ~ is_pos
prec = term_freq_df . pos_precision_normcdf * is_pos - term_freq_df . neg_precision_normcdf * ~ is_pos
def scale ( ar ):
return ( ar - ar . min ()) / ( ar . max () - ar . min ())
def close_gap ( ar ):
ar [ ar > 0 ] -= ar [ ar > 0 ]. min ()
ar [ ar < 0 ] -= ar [ ar < 0 ]. max ()
return ar
html = st . produce_scattertext_explorer (
corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
category = 'Positive' ,
not_category_name = 'Negative' ,
not_categories = [ 'Negative' ],
x_label = 'Frequency' ,
original_x = freq ,
x_coords = scale ( close_gap ( freq )),
x_axis_labels = [ 'Frequent in Neg' ,
'Not Frequent' ,
'Frequent in Pos' ],
y_label = 'Precision' ,
original_y = prec ,
y_coords = scale ( close_gap ( prec )),
y_axis_labels = [ 'Neg Precise' ,
'Imprecise' ,
'Pos Precise' ],
scores = ( term_freq_df . scaled_f_score . values + 1 ) / 2 ,
sort_by_dist = False ,
show_characteristic = False
) We can use st.ScaledFScorePresets as a term scorer to display terms' Scaled F-Score on the y-axis and term frequencies on the x-axis.
html = st . produce_frequency_explorer (
corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
category = 'Positive' ,
not_category_name = 'Negative' ,
not_categories = [ 'Negative' ],
term_scorer = st . ScaledFScorePresets ( beta = 1 , one_to_neg_one = True ),
metadata = rdf [ 'movie_name' ],
grey_threshold = 0
)Scaled F-Score is not the only scoring method included in Scattertext. Please click on one of the links below to view a notebook which describes how other class association scores work and can be visualized through Scattertext.
New in 0.0.2.73 is the delta JS-Divergence scorer DeltaJSDivergence scorer (Gallagher et al. 2020), and its corresponding compactor (JSDCompactor.) See demo_deltajsd.py for an example usage.
New in 0.0.2.72
Scattertext was originally set up to visualize corpora objects, which are connected sets of documents and terms to visualize. The "compaction" process allows users to eliminate terms which may not be associated with a category using a variety of feature selection methods. The issue with this is that the terms eliminated during the selection process are not taken into account when scaling term positions.
This issue can be mitigated by using the position-select-plot process, where term positions are pre-determined before the selection process is made.
Let's first use the 2012 conventions corpus, update the category names, and create a unigram corpus.
import scattertext as st
import numpy as np
df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
). assign ( party = lambda df : df [ 'party' ]. apply ({ 'democrat' : 'Democratic' , 'republican' : 'Republican' }. get ))
corpus = st . CorpusFromParsedDocuments (
df , category_col = 'party' , parsed_col = 'parse'
). build (). get_unigram_corpus ()
category_name = 'Democratic'
not_category_name = 'Republican'Next, let's create a dataframe consisting of the original counts and their log-scale positions.
def get_log_scale_df ( corpus , y_category , x_category ):
term_coord_df = corpus . get_term_freq_df ( '' )
# Log scale term counts (with a smoothing constant) as the initial coordinates
coord_columns = []
for category in [ y_category , x_category ]:
col_name = category + '_coord'
term_coord_df [ col_name ] = np . log ( term_coord_df [ category ] + 1e-6 ) / np . log ( 2 )
coord_columns . append ( col_name )
# Scale these coordinates to between 0 and 1
min_offset = term_coord_df [ coord_columns ]. min ( axis = 0 ). min ()
for coord_column in coord_columns :
term_coord_df [ coord_column ] -= min_offset
max_offset = term_coord_df [ coord_columns ]. max ( axis = 0 ). max ()
for coord_column in coord_columns :
term_coord_df [ coord_column ] /= max_offset
return term_coord_df
# Get term coordinates from original corpus
term_coordinates = get_log_scale_df ( corpus , category_name , not_category_name )
print ( term_coordinates ) Here is a preview of the term_coordinates dataframe. The Democrat and Republican columns contain the term counts, while the _coord columns contain their logged coordinates. Visualizing 7,973 terms is difficult (but possible) for people running Scattertext on most computers.
Democratic Republican Democratic_coord Republican_coord
term
thank 158 205 0.860166 0.872032
you 836 794 0.936078 0.933729
so 337 212 0.894681 0.873562
much 84 76 0.831380 0.826820
very 62 75 0.817543 0.826216
... ... ... ... ...
precinct 0 2 0.000000 0.661076
godspeed 0 1 0.000000 0.629493
beauty 0 1 0.000000 0.629493
bumper 0 1 0.000000 0.629493
sticker 0 1 0.000000 0.629493
[7973 rows x 4 columns]
We can visualize this full data set by running the following code block. We'll create a custom Javascript function to populate the tooltip with the original term counts, and create a Scattertext Explorer where the x and y coordinates and original values are specified from the data frame. Additionally, we can use show_diagonal=True to draw a dashed diagonal line across the plot area.
You can click the chart below to see the interactive version. Note that it will take a while to load.
# The tooltip JS function. Note that d is is the term data object, and ox and oy are the original x- and y-
# axis counts.
get_tooltip_content = ('(function(d) {return d.term + "<br/>' + not_category_name + ' Count: " ' +
'+ d.ox +"<br/>' + category_name + ' Count: " + d.oy})')
html_orig = st.produce_scattertext_explorer(
corpus,
category=category_name,
not_category_name=not_category_name,
minimum_term_frequency=0,
pmi_threshold_coefficient=0,
width_in_pixels=1000,
metadata=corpus.get_df()['speaker'],
show_diagonal=True,
original_y=term_coordinates[category_name],
original_x=term_coordinates[not_category_name],
x_coords=term_coordinates[category_name + '_coord'],
y_coords=term_coordinates[not_category_name + '_coord'],
max_overlapping=3,
use_global_scale=True,
get_tooltip_content=get_tooltip_content,
)
Next, we can visualize the compacted version of the corpus. The compaction, using ClassPercentageCompactor , selects terms which frequently in each category. The term_count parameter, set to 2, is used to determine the percentage threshold for terms to keep in a particular category. This is done using by calculating the percentile of terms (types) in each category which appear more than two times. We find the smallest percentile, and only include terms which occur above that percentile in a given category.
Note that this compaction leaves only 2,828 terms. This number is much easier for Scattertext to display in a browser.
# Select terms which appear a minimum threshold in both corpora
compact_corpus = corpus . compact ( st . ClassPercentageCompactor ( term_count = 2 ))
# Only take term coordinates of terms remaining in corpus
term_coordinates = term_coordinates . loc [ compact_corpus . get_terms ()]
html_compact = st . produce_scattertext_explorer (
compact_corpus ,
category = category_name ,
not_category_name = not_category_name ,
minimum_term_frequency = 0 ,
pmi_threshold_coefficient = 0 ,
width_in_pixels = 1000 ,
metadata = corpus . get_df ()[ 'speaker' ],
show_diagonal = True ,
original_y = term_coordinates [ category_name ],
original_x = term_coordinates [ not_category_name ],
x_coords = term_coordinates [ category_name + '_coord' ],
y_coords = term_coordinates [ not_category_name + '_coord' ],
max_overlapping = 3 ,
use_global_scale = True ,
get_tooltip_content = get_tooltip_content ,
) Occasionally, only term frequency statistics are available. This may happen in the case of very large, lost, or proprietary data sets. TermCategoryFrequencies is a corpus representation,that can accept this sort of data, along with any categorized documents that happen to be available.
Let use the Corpus of Contemporary American English as an example.
We'll construct a visualization to analyze the difference between spoken American English and English that occurs in fiction.
df = ( pd . read_excel ( 'https://www.wordfrequency.info/files/genres_sample.xls' )
. dropna ()
. set_index ( 'lemma' )[[ 'SPOKEN' , 'FICTION' ]]
. iloc [: 1000 ])
df . head ()
'''
SPOKEN FICTION
lemma
the 3859682.0 4092394.0
I 1346545.0 1382716.0
they 609735.0 352405.0
she 212920.0 798208.0
would 233766.0 229865.0
''' Transforming this into a visualization is extremely easy. Just pass a dataframe indexed on terms with columns indicating category-counts into the the TermCategoryFrequencies constructor.
term_cat_freq = st . TermCategoryFrequencies ( df ) And call produce_scattertext_explorer normally:
html = st . produce_scattertext_explorer (
term_cat_freq ,
category = 'SPOKEN' ,
category_name = 'Spoken' ,
not_category_name = 'Fiction' ,
) If you'd like to incorporate some documents into the visualization, you can add them into to the TermCategoyFrequencies object.
First, let's extract some example Fiction and Spoken documents from the sample COCA corpus.
import requests , zipfile , io
coca_sample_url = 'http://corpus.byu.edu/cocatext/samples/text.zip'
zip_file = zipfile . ZipFile ( io . BytesIO ( requests . get ( coca_sample_url ). content ))
document_df = pd . DataFrame (
[{ 'text' : zip_file . open ( fn ). read (). decode ( 'utf-8' ),
'category' : 'SPOKEN' }
for fn in zip_file . filelist if fn . filename . startswith ( 'w_spok' )][: 2 ]
+ [{ 'text' : zip_file . open ( fn ). read (). decode ( 'utf-8' ),
'category' : 'FICTION' }
for fn in zip_file . filelist if fn . filename . startswith ( 'w_fic' )][: 2 ]) And we'll pass the documents_df dataframe into TermCategoryFrequencies via the document_category_df parameter. Ensure the dataframe has two columns, 'text' and 'category'. Afterward, we can call produce_scattertext_explorer (or your visualization function of choice) normally.
doc_term_cat_freq = st . TermCategoryFrequencies ( df , document_category_df = document_df )
html = st . produce_scattertext_explorer (
doc_term_cat_freq ,
category = 'SPOKEN' ,
category_name = 'Spoken' ,
not_category_name = 'Fiction' ,
)Word representations have recently become a hot topic in NLP. While lots of work has been done visualizing how terms relate to one another given their scores (eg, http://projector.tensorflow.org/), none to my knowledge has been done visualizing how we can use these to examine how document categories differ.
In this example given a query term, "jobs", we can see how Republicans and Democrats talk about it differently.
In this configuration of Scattertext, words are colored by their similarity to a query phrase.
This is done using spaCy-provided GloVe word vectors (trained on the Common Crawl corpus). The cosine distance between vectors is used, with mean vectors used for phrases.
The calculation of the most similar terms associated with each category is a simple heuristic. First, sets of terms closely associated with a category are found. Second, these terms are ranked based on their similarity to the query, and the top rank terms are displayed to the right of the scatterplot.
A term is considered associated if its p-value is less than 0.05. P-values are determined using Monroe et al. (2008)'s difference in the weighted log-odds-ratios with an uninformative Dirichlet prior. This is the only model-based method discussed in Monroe et al. that does not rely on a large, in-domain background corpus. Since we are scoring bigrams in addition to the unigrams scored by Monroe, the size of the corpus would have to be larger to have high enough bigram counts for proper penalization. This function relies the Dirichlet distribution's parameter alpha, a vector, which is uniformly set to 0.01.
Here is the code to produce such a visualization.
>>> from scattertext import word_similarity_explorer
>>> html = word_similarity_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... target_term='jobs',
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... alpha=0.01,
... max_p_val=0.05,
... save_svg_button=True)
>>> open("Convention-Visualization-Jobs.html", 'wb').write(html.encode('utf-8'))
Scattertext can interface with Gensim Word2Vec models. For example, here's a snippet from demo_gensim_similarity.py which illustrates how to train and use a word2vec model on a corpus. Note the similarities produced reflect quirks of the corpus, eg, "8" tends to refer to the 8% unemployment rate at the time of the convention.
import spacy
from gensim . models import word2vec
from scattertext import SampleCorpora , word_similarity_explorer_gensim , Word2VecFromParsedCorpus
from scattertext . CorpusFromParsedDocuments import CorpusFromParsedDocuments
nlp = spacy . en . English ()
convention_df = SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parsed' ] = convention_df . text . apply ( nlp )
corpus = CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'parsed' ). build ()
model = word2vec . Word2Vec ( size = 300 ,
alpha = 0.025 ,
window = 5 ,
min_count = 5 ,
max_vocab_size = None ,
sample = 0 ,
seed = 1 ,
workers = 1 ,
min_alpha = 0.0001 ,
sg = 1 ,
hs = 1 ,
negative = 0 ,
cbow_mean = 0 ,
iter = 1 ,
null_word = 0 ,
trim_rule = None ,
sorted_vocab = 1 )
html = word_similarity_explorer_gensim ( corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
target_term = 'jobs' ,
minimum_term_frequency = 5 ,
pmi_threshold_coefficient = 4 ,
width_in_pixels = 1000 ,
metadata = convention_df [ 'speaker' ],
word2vec = Word2VecFromParsedCorpus ( corpus , model ). train (),
max_p_val = 0.05 ,
save_svg_button = True )
open ( './demo_gensim_similarity.html' , 'wb' ). write ( html . encode ( 'utf-8' ))How Democrats and Republicans talked differently about "jobs" in their 2012 convention speeches.
We can use Scattertext to visualize alternative types of word scores, and ensure that 0 scores are greyed out. Use the sparse_explroer function to acomplish this, and see its source code for more details.
>>> from sklearn.linear_model import Lasso
>>> from scattertext import sparse_explorer
>>> html = sparse_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... scores = corpus.get_regression_coefs('democrat', Lasso(max_iter=10000)),
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... metadata=convention_df['speaker'])
>>> open('./Convention-Visualization-Sparse.html', 'wb').write(html.encode('utf-8'))
You can also use custom term positions and axis labels. For example, you can base terms' y-axis positions on a regression coefficient and their x-axis on term frequency and label the axes accordingly. The one catch is that axis positions must be scaled between 0 and 1.
First, let's define two scaling functions: scale to project positive values to [0,1], and zero_centered_scale project real values to [0,1], with negative values always <0.5, and positive values always >0.5.
>>> def scale(ar):
... return (ar - ar.min()) / (ar.max() - ar.min())
...
>>> def zero_centered_scale(ar):
... ar[ar > 0] = scale(ar[ar > 0])
... ar[ar < 0] = -scale(-ar[ar < 0])
... return (ar + 1) / 2.
Next, let's compute and scale term frequencies and L2-penalized regression coefficients. We'll hang on to the original coefficients and allow users to view them by mousing over terms.
>>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>>
>>> frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))
>>> scores = corpus.get_logreg_coefs('democrat',
... LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
>>> scores_scaled = zero_centered_scale(scores)
Finally, we can write the visualization. Note the use of the x_coords and y_coords parameters to store the respective coordinates, the scores and sort_by_dist arguments to register the original coefficients and use them to rank the terms in the right-hand list, and the x_label and y_label arguments to label axes.
>>> html = produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... minimum_term_frequency=5,
... pmi_threshold_coefficient=4,
... width_in_pixels=1000,
... x_coords=frequencies_scaled,
... y_coords=scores_scaled,
... scores=scores,
... sort_by_dist=False,
... metadata=convention_df['speaker'],
... x_label='Log frequency',
... y_label='L2-penalized logistic regression coef')
>>> open('demo_custom_coordinates.html', 'wb').write(html.encode('utf-8'))
The Emoji analysis capability displays a chart of the category-specific distribution of Emoji. Let's look at a new corpus, a set of tweets. We'll build a visualization showing how men and women use emoji differently.
Note: the following example is implemented in demo_emoji.py .
First, we'll load the dataset and parse it using NLTK's tweet tokenizer. Note, install NLTK before running this example. It will take some time for the dataset to download.
import nltk , urllib . request , io , agefromname , zipfile
import scattertext as st
import pandas as pd
with zipfile . ZipFile ( io . BytesIO ( urllib . request . urlopen (
'http://followthehashtag.com/content/uploads/USA-Geolocated-tweets-free-dataset-Followthehashtag.zip'
). read ())) as zf :
df = pd . read_excel ( zf . open ( 'dashboard_x_usa_x_filter_nativeretweets.xlsx' ))
nlp = st . tweet_tokenzier_factory ( nltk . tokenize . TweetTokenizer ())
df [ 'parse' ] = df [ 'Tweet content' ]. apply ( nlp )
df . iloc [ 0 ]
'''
Tweet Id 721318437075685382
Date 2016-04-16
Hour 12:44
User Name Bill Schulhoff
Nickname BillSchulhoff
Bio Husband,Dad,GrandDad,Ordained Minister, Umpire...
Tweet content Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Favs NaN
RTs NaN
Latitude 40.7603
Longitude -72.9547
Country US
Place (as appears on Bio) East Patchogue, NY
Profile picture http://pbs.twimg.com/profile_images/3788000007...
Followers 386
Following 705
Listed 24
Tweet language (ISO 639-1) en
Tweet Url http://www.twitter.com/BillSchulhoff/status/72...
parse Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Name: 0, dtype: object
''' Next, we'll use the AgeFromName package to find the probabilities of the gender of each user given their first name. First, we'll find a dataframe indexed on first names that contains the probability that each someone with that first name is male ( male_prob ).
male_prob = agefromname . AgeFromName (). get_all_name_male_prob ()
male_prob . iloc [ 0 ]
'''
hi 1.00000
lo 0.95741
prob 1.00000
Name: aaban, dtype: float64
''' Next, we'll extract the first names of each user, and use the male_prob data frame to find users whose names indicate there is at least a 90% chance they are either male or female, label those users, and create new data frame df_mf with only those users.
df [ 'first_name' ] = df [ 'User Name' ]. apply ( lambda x : x . split ()[ 0 ]. lower () if type ( x ) == str and len ( x . split ()) > 0 else x )
df_aug = pd . merge ( df , male_prob , left_on = 'first_name' , right_index = True )
df_aug [ 'gender' ] = df_aug [ 'prob' ]. apply ( lambda x : 'm' if x > 0.9 else 'f' if x < 0.1 else '?' )
df_mf = df_aug [ df_aug [ 'gender' ]. isin ([ 'm' , 'f' ])] The key to this analysis is to construct a corpus using only the emoji extractor st.FeatsFromSpacyDocOnlyEmoji which builds a corpus only from emoji and not from anything else.
corpus = st . CorpusFromParsedDocuments (
df_mf ,
parsed_col = 'parse' ,
category_col = 'gender' ,
feats_from_spacy_doc = st . FeatsFromSpacyDocOnlyEmoji ()
). build () Next, we'll run this through a standard produce_scattertext_explorer visualization generation.
html = st . produce_scattertext_explorer (
corpus ,
category = 'f' ,
category_name = 'Female' ,
not_category_name = 'Male' ,
use_full_doc = True ,
term_ranker = st . OncePerDocFrequencyRanker ,
sort_by_dist = False ,
metadata = ( df_mf [ 'User Name' ]
+ ' (@' + df_mf [ 'Nickname' ] + ') '
+ df_mf [ 'Date' ]. astype ( str )),
width_in_pixels = 1000
)
open ( "EmojiGender.html" , 'wb' ). write ( html . encode ( 'utf-8' ))SentencePiece tokenization is a subword tokenization technique which relies on a language-model to produce optimized tokenization. It has been used in large, transformer-based contextual language models.
Ensure to run $ pip install sentencepiece before running this example.
First, let's load the political convention data set as normal.
import tempfile
import re
import scattertext as st
convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df . text . apply ( st . whitespace_nlp_with_sentences ) Next, let's train a SentencePiece tokenizer based on this data. The train_sentence_piece_tokenizer function trains a SentencePieceProcessor on the data set and returns it. You can of course use any SentencePieceProcessor.
def train_sentence_piece_tokenizer ( documents , vocab_size ):
'''
:param documents: list-like, a list of str documents
:vocab_size int: the size of the vocabulary to output
:return sentencepiece.SentencePieceProcessor
'''
import sentencepiece as spm
sp = None
with tempfile . NamedTemporaryFile ( delete = True ) as tempf :
with tempfile . NamedTemporaryFile ( delete = True ) as tempm :
tempf . write (( ' n ' . join ( documents )). encode ())
spm . SentencePieceTrainer . Train (
'--input=%s --model_prefix=%s --vocab_size=%s' % ( tempf . name , tempm . name , vocab_size )
)
sp = spm . SentencePieceProcessor ()
sp . load ( tempm . name + '.model' )
return sp
sp = train_sentence_piece_tokenizer ( convention_df . text . values , vocab_size = 2000 ) Next, let's add the SentencePiece tokens as metadata when creating our corpus. In order to do this, pass a FeatsFromSentencePiece instance into the feats_from_spacy_doc parameter. Pass the SentencePieceProcessor into the constructor.
corpus = st . CorpusFromParsedDocuments ( convention_df ,
parsed_col = 'parse' ,
category_col = 'party' ,
feats_from_spacy_doc = st . FeatsFromSentencePiece ( sp )). build ()Now we can create the SentencePiece token scatter plot.
html = st . produce_scattertext_explorer (
corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
sort_by_dist = False ,
metadata = convention_df [ 'party' ] + ': ' + convention_df [ 'speaker' ],
term_scorer = st . RankDifference (),
transform = st . Scalers . dense_rank ,
use_non_text_features = True ,
use_full_doc = True ,
)Suppose you'd like to audit or better understand weights or importances given to bag-of-words features by a classifier.
It's easy to use Scattertext to do, if you use a Scikit-learn-style classifier.
For example the Lighting package makes available high-performance linear classifiers which are have Scikit-compatible interfaces.
First, let's import sklearn 's text feature extraction classes, the 20 Newsgroup corpus, Lightning's Primal Coordinate Descent classifier, and Scattertext. We'll also fetch the training portion of the Newsgroup corpus.
from lightning . classification import CDClassifier
from sklearn . datasets import fetch_20newsgroups
from sklearn . feature_extraction . text import CountVectorizer , TfidfVectorizer
import scattertext as st
newsgroups_train = fetch_20newsgroups (
subset = 'train' ,
remove = ( 'headers' , 'footers' , 'quotes' )
)Next, we'll tokenize our corpus twice. Once into tfidf features which will be used to train the classifier, an another time into ngram counts that will be used by Scattertext. It's important that both vectorizers share the same vocabulary, since we'll need to apply the weight vector from the model onto our Scattertext Corpus.
vectorizer = TfidfVectorizer ()
tfidf_X = vectorizer . fit_transform ( newsgroups_train . data )
count_vectorizer = CountVectorizer ( vocabulary = vectorizer . vocabulary_ ) Next, we use the CorpusFromScikit factory to build a Scattertext Corpus object. Ensure the X parameter is a document-by-feature matrix. The argument to the y parameter is an array of class labels. Each label is an integer representing a different news group. We the feature_vocabulary is the vocabulary used by the vectorizers. The category_names are a list of the 20 newsgroup names which as a class-label list. The raw_texts is a list of the text of newsgroup texts.
corpus = st . CorpusFromScikit (
X = count_vectorizer . fit_transform ( newsgroups_train . data ),
y = newsgroups_train . target ,
feature_vocabulary = vectorizer . vocabulary_ ,
category_names = newsgroups_train . target_names ,
raw_texts = newsgroups_train . data
). build () Now, we can train the model on tfidf_X and the categoricla response variable, and capture feature weights for category 0 ("alt.atheism").
clf = CDClassifier ( penalty = "l1/l2" ,
loss = "squared_hinge" ,
multiclass = True ,
max_iter = 20 ,
alpha = 1e-4 ,
C = 1.0 / tfidf_X . shape [ 0 ],
tol = 1e-3 )
clf . fit ( tfidf_X , newsgroups_train . target )
term_scores = clf . coef_ [ 0 ]Finally, we can create a Scattertext plot. We'll use the Monroe-style visualization, and automatically select around 4000 terms that encompass the set of frequent terms, terms with high absolute scores, and terms that are characteristic of the corpus.
html = st . produce_frequency_explorer (
corpus ,
'alt.atheism' ,
scores = term_scores ,
use_term_significance = False ,
terms_to_include = st . AutoTermSelector . get_selected_terms ( corpus , term_scores , 4000 ),
metadata = [ '/' . join ( fn . split ( '/' )[ - 2 :]) for fn in newsgroups_train . filenames ]
)Let's take a look at the performance of the classifier:
newsgroups_test = fetch_20newsgroups ( subset = 'test' ,
remove = ( 'headers' , 'footers' , 'quotes' ))
X_test = vectorizer . transform ( newsgroups_test . data )
pred = clf . predict ( X_test )
f1 = f1_score ( pred , newsgroups_test . target , average = 'micro' )
print ( "Microaveraged F1 score" , f1 )Microaveraged F1 score 0.662108337759. Not bad over a ~0.05 baseline.
Please see Signo for an introduction to semiotic squares.
Some variants of the semiotic square-creator are can be seen in this notebook, which studies words and phrases in headlines that had low or high Facebook engagement and were published by either BuzzFeed or the New York Times: [http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb]
The idea behind the semiotic square is to express the relationship between two opposing concepts and concepts things within a larger domain of a discourse. Examples of opposed concepts life or death, male or female, or, in our example, positive or negative sentiment. Semiotics squares are comprised of four "corners": the upper two corners are the opposing concepts, while the bottom corners are the negation of the concepts.
Circumscribing the negation of a concept involves finding everything in the domain of discourse that isn't associated with the concept. For example, in the life-death opposition, one can consider the universe of discourse to be all animate beings, real and hypothetical. The not-alive category will cover dead things, but also hypothetical entities like fictional characters or sentient AIs.
In building lexicalized semiotic squares, we consider concepts to be documents labeled in a corpus. Documents, in this setting, can belong to one of three categories: two labels corresponding to the opposing concepts, a neutral category, indicating a document is in the same domain as the opposition, but cannot fall into one of opposing categories.
In the example below positive and negative movie reviews are treated as the opposing categories, while plot descriptions of the same movies are treated as the neutral category.
Terms associated with one of the two opposing categories (relative only to the other) are listed as being associated with that category. Terms associated with a netural category (eg, not positive) are terms which are associated with the disjunction of the opposite category and the neutral category. For example, not-positive terms are those most associated with the set of negative reviews and plot descriptions vs. positive reviews.
Common terms among adjacent corners of the square are also listed.
An HTML-rendered square is accompanied by a scatter plot. Points on the plot are terms. The x-axis is the Z-score of the association to one of the opposed concepts. The y-axis is the Z-score how associated a term is with the neutral set of documents relative to the opposed set. A point's red-blue color indicate the term's opposed-association, while the more desaturated a term is, the more it is associated with the neutral set of documents.
Update to version 2.2: terms are colored by their nearest semiotic categories across the eight corresponding radial sectors.
import scattertext as st
movie_df = st . SampleCorpora . RottenTomatoes . get_data ()
movie_df . category = movie_df . category . apply
( lambda x : { 'rotten' : 'Negative' , 'fresh' : 'Positive' , 'plot' : 'Plot' }[ x ])
corpus = st . CorpusFromPandas (
movie_df ,
category_col = 'category' ,
text_col = 'text' ,
nlp = st . whitespace_nlp_with_sentences
). build (). get_unigram_corpus ()
semiotic_square = st . SemioticSquare (
corpus ,
category_a = 'Positive' ,
category_b = 'Negative' ,
neutral_categories = [ 'Plot' ],
scorer = st . RankDifference (),
labels = { 'not_a_and_not_b' : 'Plot Descriptions' , 'a_and_b' : 'Reviews' }
)
html = st . produce_semiotic_square_explorer ( semiotic_square ,
category_name = 'Positive' ,
not_category_name = 'Negative' ,
x_label = 'Fresh-Rotten' ,
y_label = 'Plot-Review' ,
neutral_category_name = 'Plot Description' ,
metadata = movie_df [ 'movie_name' ])There are a number of other types of semiotic square construction functions. Again, please see https://nbviewer.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb for an overview of these.
A frequently requested feature of Scattertext has been the ability to visualize topic models. While this capability has existed in some forms (eg, the Empath visualization), I've finally gotten around to implementing a concise API for such a visualization. There are three main ways to visualize topic models using Scattertext. The first is the simplest: manually entering topic models and visualizing them. The second uses a Scikit-Learn pipeline to produce the topic models for visualization. The third is a novel topic modeling technique, based on finding terms similar to a custom set of seed terms.
If you have already created a topic model, simply structure it as a dictionary. This dictionary is keyed on string which serve as topic titles and are displayed in the main scatterplot. The values are lists of words that belong to that topic. The words that are in each topic list are bolded when they appear in a snippet.
Note that currently, there is no support for keyword scores.
For example, one might manually the following topic models to explore in the Convention corpus:
topic_model = {
'money' : [ 'money' , 'bank' , 'banks' , 'finances' , 'financial' , 'loan' , 'dollars' , 'income' ],
'jobs' : [ 'jobs' , 'workers' , 'labor' , 'employment' , 'worker' , 'employee' , 'job' ],
'patriotic' : [ 'america' , 'country' , 'flag' , 'americans' , 'patriotism' , 'patriotic' ],
'family' : [ 'mother' , 'father' , 'mom' , 'dad' , 'sister' , 'brother' , 'grandfather' , 'grandmother' , 'son' , 'daughter' ]
} We can use the FeatsFromTopicModel class to transform this topic model into one which can be visualized using Scattertext. This is used just like any other feature builder, and we pass the topic model object into produce_scattertext_explorer .
import scattertext as st
topic_feature_builder = st.FeatsFromTopicModel(topic_model)
topic_corpus = st.CorpusFromParsedDocuments(
convention_df,
category_col='party',
parsed_col='parse',
feats_from_spacy_doc=topic_feature_builder
).build()
html = st.produce_scattertext_explorer(
topic_corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
width_in_pixels=1000,
metadata=convention_df['speaker'],
use_non_text_features=True,
use_full_doc=True,
pmi_threshold_coefficient=0,
topic_model_term_lists=topic_feature_builder.get_top_model_term_lists()
)
Since topic modeling using document-level coocurence generally produces poor results, I've added a SentencesForTopicModeling class which allows clusterting by coocurence at the sentence-level. It requires a ParsedCorpus object to be passed to its constructor, and creates a term-sentence matrix internally.
Next, you can create a topic model dictionary like the one above by passing in a Scikit-Learn clustering or dimensionality reduction pipeline. The only constraint is the last transformer in the pipeline must populate a components_ attribute.
The num_topics_per_term attribute specifies how many terms should be added to a list.
In the following example, we'll use NMF to cluster a stoplisted, unigram corpus of documents, and use the topic model dictionary to create a FeatsFromTopicModel , just like before.
Note that in produce_scattertext_explorer , we make the topic_model_preview_size 20 in order to show a preview of the first 20 terms in the topic in the snippet view as opposed to the default 10.
from sklearn . decomposition import NMF
from sklearn . feature_extraction . text import TfidfTransformer
from sklearn . pipeline import Pipeline
convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )
unigram_corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
category_col = 'party' ,
parsed_col = 'parse' )
. build (). get_stoplisted_unigram_corpus ())
topic_model = st . SentencesForTopicModeling ( unigram_corpus ). get_topics_from_model (
Pipeline ([
( 'tfidf' , TfidfTransformer ( sublinear_tf = True )),
( 'nmf' , ( NMF ( n_components = 100 , alpha = .1 , l1_ratio = .5 , random_state = 0 )))
]),
num_terms_per_topic = 20
)
topic_feature_builder = st . FeatsFromTopicModel ( topic_model )
topic_corpus = st . CorpusFromParsedDocuments (
convention_df ,
category_col = 'party' ,
parsed_col = 'parse' ,
feats_from_spacy_doc = topic_feature_builder
). build ()
html = st . produce_scattertext_explorer (
topic_corpus ,
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
width_in_pixels = 1000 ,
metadata = convention_df [ 'speaker' ],
use_non_text_features = True ,
use_full_doc = True ,
pmi_threshold_coefficient = 0 ,
topic_model_term_lists = topic_feature_builder . get_top_model_term_lists (),
topic_model_preview_size = 20
)A surprisingly easy way to generate good topic models is to use a term scoring formula to find words that are associated with sentences where a seed word occurs vs. where one doesn't occur.
Given a custom term list, the SentencesForTopicModeling.get_topics_from_terms will generate a series of topics. Note that the dense rank difference ( RankDifference ) works particularly well for this task, and is the default parameter.
term_list = [ 'obama' , 'romney' , 'democrats' , 'republicans' , 'health' , 'military' , 'taxes' ,
'education' , 'olympics' , 'auto' , 'iraq' , 'iran' , 'israel' ]
unigram_corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
category_col = 'party' ,
parsed_col = 'parse' )
. build (). get_stoplisted_unigram_corpus ())
topic_model = ( st . SentencesForTopicModeling ( unigram_corpus )
. get_topics_from_terms ( term_list ,
scorer = st . RankDifference (),
num_terms_per_topic = 20 ))
topic_feature_builder = st . FeatsFromTopicModel ( topic_model )
# The remaining code is identical to two examples above. See demo_word_list_topic_model.py
# for the complete example. Scattertext makes it easy to create word-similarity plots using projections of word embeddings as the x and y-axes. In the example below, we create a stop-listed Corpus with only unigram terms. The produce_projection_explorer function by uses Gensim to create word embeddings and then projects them to two dimentions using Uniform Manifold Approximation and Projection (UMAP).
UMAP is chosen over T-SNE because it can employ the cosine similarity between two word vectors instead of just the euclidean distance.
convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )
corpus = ( st . CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'parse' )
. build (). get_stoplisted_unigram_corpus ())
html = st . produce_projection_explorer ( corpus , category = 'democrat' , category_name = 'Democratic' ,
not_category_name = 'Republican' , metadata = convention_df . speaker ) In order to use custom word embedding functions or projection functions, pass models into the word2vec_model and projection_model parameters. In order to use T-SNE, for example, use projection_model=sklearn.manifold.TSNE() .
import umap
from gensim . models . word2vec import Word2Vec
html = st . produce_projection_explorer ( corpus ,
word2vec_model = Word2Vec ( size = 100 , window = 5 , min_count = 10 , workers = 4 ),
projection_model = umap . UMAP ( min_dist = 0.5 , metric = 'cosine' ),
category = 'democrat' ,
category_name = 'Democratic' ,
not_category_name = 'Republican' ,
metadata = convention_df . speaker ) Term positions can also be determined by the positions of terms according to the output of principal component analysis, and produce_projection_explorer also supports this functionality. We'll look at how axes transformations ("scalers" in Scattertext terminology) can make it easier to inspect the output of PCA.
We'll use the 2012 Conventions corpus for these visualizations. Only unigrams occurring in at least three documents will be considered.
>>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
>>> corpus = (st.CorpusFromParsedDocuments(convention_df,
... category_col='party',
... parsed_col='parse')
... .build()
... .get_stoplisted_unigram_corpus()
... .remove_infrequent_words(minimum_term_count=3, term_ranker=st.OncePerDocFrequencyRanker))
Next, we use scikit-learn's tf-idf transformer to find very simple, sparse embeddings for all of these words. Since, we input a #docs x #terms matrix to the transformer, we can transpose it to get a proper term-embeddings matrix, where each row corresponds to a term, and the columns correspond to document-specific tf-idf scores.
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> embeddings = TfidfTransformer().fit_transform(corpus.get_term_doc_mat())
>>> embeddings.shape
(189, 2159)
>>> corpus.get_num_docs(), corpus.get_num_terms()
(189, 2159)
>>> embeddings = embeddings.T
>>> embeddings.shape
(2159, 189)
Given these spare embeddings, we can apply sparse singular value decomposition to extract three factors. SVD outputs factorizes the term embeddings matrix into three matrices, U, Σ, and VT. Importantly, the matrix U provides the singular values for each term, and VT provides them for each document, and Σ is a vector of the singular values.
>>> from scipy.sparse.linalg import svds
>>> U, S, VT = svds(embeddings, k = 3, maxiter=20000, which='LM')
>>> U.shape
(2159, 3)
>>> S.shape
(3,)
>>> VT.shape
(3, 189)
We'll look at the first two singular values, plotting each term such that the x-axis position is the first singular value, and the y-axis term is the second. To do this, we make a "projection" data frame, where the x and y columns store the first two singular values, and key the data frame on each term. This controls the term positions on the chart.
>>> x_dim = 0; y_dim = 1;
>>> projection = pd.DataFrame({'term':corpus.get_terms(),
... 'x':U.T[x_dim],
... 'y':U.T[y_dim]}).set_index('term')
We'll use the produce_pca_explorer function to visualize these. Note we include the projection object, and specify which singular values were used for x and y ( x_dim and y_dim ) so we they can be labeled in the interactive visualization.
html = st.produce_pca_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
projection=projection,
metadata=convention_df['speaker'],
width_in_pixels=1000,
x_dim=x_dim,
y_dim=y_dim)
Click for an interactive visualization.
We can easily re-scale the plot in order to make more efficient use of space. For example, passing in scaler=scale_neg_1_to_1_with_zero_mean will make all four quadrants take equal area.
html = st.produce_pca_explorer(corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
projection=projection,
metadata=convention_df['speaker'],
width_in_pixels=1000,
scaler=st.scale_neg_1_to_1_with_zero_mean,
x_dim=x_dim,
y_dim=y_dim)
Click for an interactive visualization.
To export the content of a scattertext explorer object (ScattertextStructure) to matplotlib you can use produce_scattertext_pyplot . The function returns a matplotlib.figure.Figure object which can be visualized using plt.show or plt.savefig as in the example below.
Note that installation of textalloc==0.0.3 and matplotlib>=3.6.0 is required before running this.
convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()
scattertext_structure = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=5,
pmi_threshold_coefficient=8,
width_in_pixels=1000,
return_scatterplot_structure=True,
)
fig = st.produce_scattertext_pyplot(scattertext_structure)
fig.savefig('pyplot_export.png', format='png')
[]
Please see the examples in the PyData 2017 Tutorial on Scattertext.
Cozy: The Collection Synthesizer (Loncaric 2016) was used to help determine which terms could be labeled without overlapping a circle or another label. It automatically built a data structure to efficiently store and query the locations of each circle and labeled term.
The script to build rectangle-holder.js was
fields ax1 : long, ay1 : long, ax2 : long, ay2 : long
assume ax1 < ax2 and ay1 < ay2
query findMatchingRectangles(bx1 : long, by1 : long, bx2 : long, by2 : long)
assume bx1 < bx2 and by1 < by2
ax1 < bx2 and ax2 > bx1 and ay1 < by2 and ay2 > by1
And it was called using
$ python2.7 src/main.py <script file name> --enable-volume-trees
--js-class RectangleHolder --enable-hamt --enable-arrays --js rectangle_holder.js
Adding in code to ensure that term statistics will show up even if no documents are present in visualization.
Better axis labeling (see demo_axis_crossbars_and_labels.py).
Pytextrank compatibility
Ensuring Pandas 1.0 compatibility fixing Issue #51 and scikit-learn stopwords import issue in #49.
AssociationCompactorByRank , TermCategoryRanker . terms_to_show parameter use_categories_as_metadata_and_replace_terms to TermDocMatrix .get_metadata_doc_count_df and get_metadata_count_mat to TermDocMatrix produce_pairplot ScatterChart.hide_terms(terms: iter[str]) which enables selected terms to be hidden from the chart.ScatterChartData.score_transform to specify the function which can change an original score into a value between 0 and 1 used for term coloring. alternative_term_func to produce_scattertext_explorer which allows you to inject a function that activates when a term is clicked.HedgesG , and unbiased version of Cohen's d which is a subclass of CohensD .frequency_transform parameter to produce_frequency_explorer . This defaults to a log transform, but allows you to use any way your heart desires to order terms along the x-axis. show_category_headings=True to produce_scattertext_explorer . Setting this to False suppresses the list of categories which will be displayed in the term context area.div_name argument to produce_scattertext_explorer and name-spaced important divs and classes by div_name in HTML templates and Javascript.show_cross_axes=True to produce_scattertext_explorer . Setting this to False prevents the cross axes from being displayed if show_axes is True .TermDocMatrix.get_metadata_freq_df now accepts the label_append argument which by default adds ' freq' to the end of each column.TermDocMatrix.get_num_cateogires returns the number of categories in a term-document matrix. Added the following methods:
TermDocMatrixWithoutCategories.get_num_metadataTermDocMatrix.use_metadata_as_categoriesunified_context argument in produce_scattertext_explorer lists all contexts in a single column. This let's you see snippets organized by multiple categories in a single column. See demo_unified_context.py for an example. Added a series of objects to handle uncategorized corpora. Added section on Document-Based Scatterplots, and the add_doc_names_as_metadata function. CategoryColorAssigner was also added to assign colors to a qualitative categories.
A number of new term scoring approaches including RelativeEntropy (a direct implementation of Frankhauser et al. ( 2014)), and ZScores and implementation of the Z-Score model used in Frankhauser et al.
TermDocMatrix.get_metadata_freq_df() returns a metadata-doc corpus.
CorpusBasedTermScorer.set_ranker allows you to use a different term ranker when finding corpus-based scores. This not only lets these scorers with metadata, but also allows you to integrate once-per-document counts.
Fixed produce_projection_explorer such that it can work with a predefined set of term embeddings. This can allow, for example, the easy exploration of one hot-encoded term embeddings in addition to arbitrary lower-dimensional embeddings.
Added add_metadata to TermDocMatrix in order to inject meta data after a TermDocMatrix object has been created.
Made sure tooltip never started above the top of the web page.
Added DomainCompactor .
Fixed bug #31, enabling context to show when metadata value is clicked.
Enabled display of terms in topic models in explorer, along with the the display of customized topic models. Please see Visualizing topic models for an overview of the additions.
Removed pkg_resources from Phrasemachine, corrected demo_phrase_machine.py
Now compatible with Gensim 3.4.0.
Added characteristic explorer, produce_characteristic_explorer , to plot terms with their characteristic scores on the x-axis and their class-association scores on the y-axis. See Ordering Terms by Corpus Characteristicness for more details.
Added TermCategoryFrequencies in response to Issue 23. Please see Visualizing differences based on only term frequencies for more details.
Added x_axis_labels and y_axis_labels parameters to produce_scattertext_explorer . These let you include evenly-spaced string axis labels on the chart, as opposed to just "Low", "Medium" and "High". These rely on d3's ticks function, which can behave unpredictable. Caveat usor.
Semiotic Squares now look better, and have customizable labels.
Incorporated the General Inquirer lexicon. 비상업적 사용 만 사용합니다. The lexicon is downloaded from their homepage at the start of each use. See demo_general_inquierer.py .
Incorporated Phrasemachine from AbeHandler (Handler et al. 2016). For the license, please see PhraseMachineLicense.txt . For an example, please see demo_phrase_machine.py .
Added CompactTerms for removing redundant and infrequent terms from term document matrices. These occur if a word or phrase is always part of a larger phrase; the shorter phrase is considered redundant and removed from the corpus. See demo_phrase_machine.py for an example.
Added FourSquare , a pattern that allows for the creation of a semiotic square with separate categories for each corner. Please see demo_four_square.py for an early example.
Finally, added a way to easily perform T-SNE-style visualizations on a categorized corpus. This uses, by default, the umap-learn package. Please see demo_tsne_style.py.
Fixed to ScaledFScorePresets(one_to_neg_one=True) , added UnigramsFromSpacyDoc .
Now, when using CorpusFromPandas , a CorpusDF object is returned, instead of a Corpus object. This new type of object keeps a reference to the source data frame, and returns it via the CorpusDF.get_df() method.
The factory CorpusFromFeatureDict was added. It allows you to directly specify term counts and metadata item counts within the dataframe. Please see test_corpusFromFeatureDict.py for an example.
Added a very semiotic square creator.
The idea to build a semiotic square that contrasts two categories in a Term Document Matrix while using other categories as neutral categories.
See Creating semiotic squares for an overview on how to use this functionality and semiotic squares.
Added a parameter to disable the display of the top-terms sidebar, eg, produce_scattertext_explorer(..., show_top_terms=False, ...) .
An interface to part of the subjectivity/sentiment dataset from Bo Pang and Lillian Lee. ``A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts''. ACL. 2004. See SampleCorpora.RottenTomatoes .
Fixed bug that caused tooltip placement to be off after scrolling.
Made category_name and not_category_name optional in produce_scattertext_explorer etc.
Created the ability to customize tooltips via the get_tooltip_content argument to produce_scattertext_explorer etc., control axes labels via x_axis_values and y_axis_values . The color_func parameter is a Javascript function to control color of a point. Function takes a parameter which is a dictionary entry produced by ScatterChartExplorer.to_dict and returns a string.
Integration with Scikit-Learn's text-analysis pipeline led the creation of the CorpusFromScikit and TermDocMatrixFromScikit classes.
The AutoTermSelector class to automatically suggest terms to appear in the visualization.
This can make it easier to show large data sets, and remove fiddling with the various minimum term frequency parameters.
For an example of how to use CorpusFromScikit and AutoTermSelector , please see demo_sklearn.py
Also, I updated the library and examples to be compatible with spaCy 2.
Fixed bug when processing single-word documents, and set the default beta to 2.
Added produce_frequency_explorer function, and adding the PEP 369-compliant __version__ attribute as mentioned in #19. Fixed bug when creating visualizations with more than two possible categories. Now, by default, category names will not be title-cased in the visualization, but will retain their original case.
If you'd still like to do this this, use ScatterChart (or a descendant).to_dict(..., title_case_names=True) . Fixed DocsAndLabelsFromCorpus for Py 2 compatibility.
Fixed bugs in chinese_nlp when jieba has already been imported and in p-value computation when performing log-odds-ratio w/ prior scoring.
Added demo for performing a Monroe et. al (2008) style visualization of log-odds-ratio scores in demo_log_odds_ratio_prior.py .
Breaking change: pmi_filter_thresold has been replaced with pmi_threshold_coefficient .
Added Emoji and Tweet analysis. See Emoji analysis.
Characteristic terms falls back ot "Most frequent" if no terms used in the chart are present in the background corpus.
Fixed top-term calculation for custom scores.
Set scaled f-score's default beta to 0.5.
Added --spacy_language_model argument to the CLI.
Added the alternative_text_field option in produce_scattertext_explorer to show an alternative text field when showing contexts in the interactive HTML visualization.
Updated ParsedCorpus.get_unigram_corpus to allow for continued alternative_text_field functionality.
Added ability to for Scattertext to use noun chunks instead of unigrams and bigrams through the FeatsFromSpacyDocOnlyNounChunks class. In order to use it, run your favorite Corpus or TermDocMatrix factory, and pass in an instance of the class as a parameter:
st.CorpusFromParsedDocuments(..., feats_from_spacy_doc=st.FeatsFromSpacyDocOnlyNounChunks())
Fixed a bug in corpus construction that occurs when the last document has no features.
Now you don't have to install tinysegmenter to use Scattertext. But you need to install it if you want to parse Japanese. This caused a problem when Scattertext was being installed on Windows.
Added TermDocMatrix.get_corner_score , giving an improved version of the Rudder Score. Exposing whitespace_nlp_with_sentences . It's a lightweight bad regex sentence splitter built a top a bad regex tokenizer that somewhat apes spaCy's API. Use it if you don't have spaCy and the English model downloaded or if you care more about memory footprint and speed than accuracy.
It's not compatible with word_similarity_explorer but is compatible with `word_similarity_explorer_gensim'.
Tweaked scaled f-score normalization.
Fixed Javascript bug when clicking on '$'.
Fixed bug in Scaled F-Score computations, and changed computation to better score words that are inversely correlated to category.
Added Word2VecFromParsedCorpus to automate training Gensim word vectors from a corpus, and
word_similarity_explorer_gensim to produce the visualization.
See demo_gensim_similarity.py for an example.
Added the d3_url and d3_scale_chromatic_url parameters to produce_scattertext_explorer . This provides a way to manually specify the paths to "d3.js" (ie, the file from "https://cdnjs.cloudflare.com/ajax/libs/d3/4.6.0/d3.min.js") and "d3-scale-chromatic.v1.js" (ie, the file from "https://d3js.org/d3-scale-chromatic.v1.min.js").
This is important if you're getting the error:
Javascript error adding output!
TypeError: d3.scaleLinear is not a function
See your browser Javascript console for more details.
It also lets you use Scattertext if you're serving in an environment with no (or a restricted) external Internet connection.
For example, if "d3.min.js" and "d3-scale-chromatic.v1.min.js" were present in the current working directory, calling the following code would reference them locally instead of the remote Javascript files. See Visualizing term associations for code context.
>>> html = st.produce_scattertext_explorer(corpus,
... category='democrat',
... category_name='Democratic',
... not_category_name='Republican',
... width_in_pixels=1000,
... metadata=convention_df['speaker'],
... d3_url='d3.min.js',
... d3_scale_chromatic_url='d3-scale-chromatic.v1.min.js')
Fixed a bug in 0.0.2.6.0 that transposed default axis labels.
Added a Japanese mode to Scattertext. See demo_japanese.py for an example of how to use Japanese. Please run pip install tinysegmenter to parse Japanese.
Also, the chiense_mode boolean parameter in produce_scattertext_explorer has been renamed to asian_mode .
For example, the output of demo_japanese.py is:
Custom term positions and axis labels. Although not recommended, you can visualize different metrics on each axis in visualizations similar to Monroe et al. (2008). Please see Custom term positions for more info.
Enhanced the visualization of query-based categorical differences, aka the word_similarity_explorer function. When run, a plot is produced that contains category associated terms colored in either red or blue hues, and terms not associated with either class colored in greyscale and slightly smaller. The intensity of each color indicates association with the query term. 예를 들어:
Some minor bug fixes, and added a minimum_not_category_term_frequency parameter. This fixes a problem with visualizing imbalanced datasets. It sets a minimum number of times a word that does not appear in the target category must appear before it is displayed.
Added TermDocMatrix.remove_entity_tags method to remove entity type tags from the analysis.
Fixed matched snippet not displaying issue #9, and fixed a Python 2 issue in created a visualization using a ParsedCorpus prepared via CorpusFromParsedDocuments , mentioned in the latter part of the issue #8 discussion.
Again, Python 2 is supported in experimental mode only.
Corrected example links on this Readme.
Fixed a bug in Issue 8 where the HTML visualization produced by produce_scattertext_html would fail.
Fixed a couple issues that rendered Scattertext broken in Python 2. Chinese processing still does not work.
Note: Use Python 3.4+ if you can.
Fixed links in Readme, and made regex NLP available in CLI.
Added the command line tool, and fixed a bug related to Empath visualizations.
Ability to see how a particular term is discussed differently between categories through the word_similarity_explorer function.
Specialized mode to view sparse term scores.
Fixed a bug that was caused by repeated values in background unigram counts.
Added true alphabetical term sorting in visualizations.
Added an optional save-as-SVG button.
Addition option of showing characteristic terms (from the full set of documents) being considered. The option ( show_characteristic in produce_scattertext_explorer ) is on by default, but currently unavailable for Chinese. If you know of a good Chinese wordcount list, please let me know. The algorithm used to produce these is F-Score.
See this and the following slide for more details
Added document and word count statistics to main visualization.
Added preliminary support for visualizing Empath (Fast 2016) topics categories instead of emotions. See the tutorial for more information.
Improved term-labeling.
Addition of strip_final_period param to FeatsFromSpacyDoc to deal with spaCy tokenization of all-caps documents that can leave periods at the end of terms.
I've added support for Chinese, including the ChineseNLP class, which uses a RegExp-based sentence splitter and Jieba for word segmentation. To use it, see the demo_chinese.py file. Note that CorpusFromPandas currently does not support ChineseNLP.
In order for the visualization to work, set the asian_mode flag to True in produce_scattertext_explorer .