scattertext Download - scattertext Source Code herunterladen

Streuung 0.2.2

Ein Werkzeug zum Auffinden von Unterschieden in Korpora und zum Anzeigen in einem interaktiven HTML -Streudiagramm. Punkte, die den Begriffen entsprechen, werden selektiv gekennzeichnet, damit sie sich nicht mit anderen Beschriftungen oder Punkten überlappen.

Zitieren als: Jason S. Kessler. STRATETTEXT: Ein browserbasiertes Tool zur Visualisierung, wie sich Korpora unterscheidet. ACL -Systemdemonstrationen. 2017.

Im Folgenden finden Sie ein Beispiel für die Verwendung von Streutentext, um visualisierte Begriffe zu erstellen, die 2012 in den amerikanischen politischen Konventionen verwendet werden. Die 2.000 am meisten Party-assoziierten UNI-Gramme werden als Punkte im Streudiagramm angezeigt. Ihre X- und Y-Achsen sind die dichten Reihen ihrer Verwendung durch republikanische und demokratische Sprecher.

 import scattertext as st

df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)

corpus = st.CorpusFromParsedDocuments(
    df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

html = st.produce_scattertext_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0, 
    pmi_threshold_coefficient=0,
    width_in_pixels=1000, 
    metadata=corpus.get_df()['speaker'],
    transform=st.Scalers.dense_rank,
    include_gradient=True,
    left_gradient_term='More Republican',
    middle_gradient_term='Metric: Dense Rank Difference',
    right_gradient_term='More Democratic',
)
open('./demo_compact.html', 'w').write(html)

Die geschriebene HTML -Datei würde wie das Bild unten aussehen. Klicken Sie darauf, um die tatsächliche interaktive Visualisierung zu erhalten.

Zitat

Jason S. Kessler. STRATETTEXT: Ein browserbasiertes Tool zur Visualisierung, wie sich Korpora unterscheidet. ACL -Systemdemonstrationen. 2017. Link zu Papier: arxiv.org/abs/1703.00565

 @article{kessler2017scattertext,
  author    = {Kessler, Jason S.},
  title     = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},
  booktitle = {Proceedings of ACL-2017 System Demonstrations},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
}

Inhaltsverzeichnis

Installation
Überblick
Anpassen der Visualisierung und der Dispersion
Tutorial
- Helfen! Ich kenne Python nicht, aber ich möchte immer noch Streuung verwenden
- Verwenden von Streutext als Textanalyse -Bibliothek: charakteristische Begriffe finden und deren Assoziationen
- Visualisierung von Term Assoziationen
- Visualisierung von Phrasenassoziationen
- Hinzufügen von Farbgradienten, um die Ergebnisse zu erklären
- Visualisieren von Empath -Themen und -kategorien
- Visualisierung des moralischen Fundaments 2.0 -Wörterbuchs
- Bestellbegriffe nach Corpus -Charakteristik
- Dokumentbasierte Streudiagramme
- Verwenden Sie Cohens D oder Hedge's G, um die Effektgröße zu visualisieren
- Verwenden Sie das Delta von Cliff, um die Effektgröße zu visualisieren
- Verwenden der Bi-Normal-Trennung (BNS), um Begriffe zu bewerten
- Verwenden von Korrelationen zur Erklärung von Klassifikatoren
- Verwenden von benutzerdefinierten Hintergrundwortfrequenzen
- Darstellung von Wortproduktivität
Verständnis von skaliertem F-Score
Alternative Bewertungsmethoden
Der Positionsauswahl-Plot-Prozess
Erweiterte Verwendung
- Visualisierung von Unterschieden basierend nur auf Termfrequenzen
- Visualisierung von kategorialen Unterschiede auf abfragen basierende
- Visualisierung jeglicher Art von Term Score
- Benutzerdefinierte Begriff Positionen
- Emoji -Analyse
- Visualisieren von Satzstück -Token
- Visualisieren von Scikit-Learn-Textklassifizierungsgewichten
- Schaffung lexikalisierter semiotischer Quadrate
- Visualisierung von Themenmodellen
- Erstellen von Word-Einbetten von T-SNE-Stilprojektionsplänen
- Verwenden von SVD, um jede Art von Worteinbettungen zu visualisieren
- Exportieren des Diagramms in Matplotlib
- Verwenden Sie dieselbe Skala für beide Achsen
Beispiele
Eine Notiz im Diagrammlayout
Was ist neu
Quellen

Installation

Installieren Sie Python 3.11 oder höher und rennen Sie:

$ pip install scattertext

Wenn Sie Spacy nicht installieren können (oder nicht), ersetzen Sie nlp = spacy.load('en') Zeilen mit nlp = scattertext.WhitespaceNLP.whitespace_nlp . Beachten Sie, dass dies nicht mit word_similarity_explorer kompatibel ist, und die Fähigkeiten zur Tokenisierung und der Satzgrenze sind regelmäßige Ausdrücke mit geringer Leistung. Ein Beispiel siehe demo_without_spacy.py .

Es wird empfohlen, jieba , spacy , empath , astropy , flashtext , gensim und umap-learn zu installieren, um den Streuungspunkt voll auszunutzen.

Der Streutext sollte meistens mit Python 2.7 funktionieren, es kann jedoch nicht.

Die HTML -Ausgänge sehen in Chrom und Safari am besten aus.

Stilhandbuch

Der Name dieses Projekts ist ScatterText. "ScatterText" wird als einzelnes Wort geschrieben und sollte aktiviert werden. Bei der Verwendung in Python sollte der scattertext auf den Namen st definiert werden, dh import scattertext as st .

Überblick

Dies ist ein Tool, das zur Visualisierung der Wörter und Phrasen für eine Kategorie vorgesehen ist als andere.

Betrachten Sie das Beispiel oben auf der Seite.

Wenn man das betrachtet, scheint es überwältigend. Tatsächlich ist es eine relativ einfache Visualisierung der Wortnutzung während des politischen Konvents 2012. Jeder Punkt entspricht einem Wort oder einer Phrase, die von Republikanern oder Demokraten während ihrer Konventionen erwähnt wird. Je näher ein Punkt an der Spitze der Handlung liegt, desto häufiger wurde er von Demokraten verwendet. Je weiter ein Punkt ein Punkt, desto mehr wurde dieses Wort oder die Phrase von Republikanern verwendet. Wörter, die häufig von beiden Parteien verwendet werden, wie "von" und "The" und sogar "Seits", treten in der oberen rechten Ecke tendenziell auf. Obwohl sehr niedrige Frequenzwörter versteckt waren, um die Rechenressourcen zu erhalten, würde ein Wort, das keine Partei verwendet hat, wie "Giraffe", in der unteren linken Ecke.

Die interessanten Dinge passieren in der Nähe der oberen linken und niedrigeren Ecken. In der oberen linken Ecke werden Wörter wie "Auto" (wie in automatischer Rettungspaket) und "Millionäre" häufig von Demokraten verwendet, aber selten oder nie von Republikanern verwendet. Ebenso besetzen Begriffe, die häufig von Republikanern und selten von Demokraten verwendet werden, die untere rechte Ecke. Dazu gehören "große Regierung" und "Olympische Spiele", unter Bezugnahme auf die Olympischen Spiele in Salt Lake City, an denen Gouverneur Romney beteiligt war.

Begriffe werden von ihrer Vereinigung gefärbt. Diejenigen, die mehr mit Demokraten in Verbindung gebracht werden, sind blau und diejenigen, die mehr mit den Republikanern rot verbunden sind.

Begriffe, die für die beiden Dokumentensätze am charakteristischsten sind, werden in rechtsextremer Visualisierung angezeigt.

Die Inspiration für diese Visualisierung kam von Dataclysm (Ruder, 2014).

ScatterText soll Ihnen helfen, diese Grafiken zu erstellen und Punkte effizient zu beschriften.

Die Dokumentation (einschließlich dieses Readme) ist eine laufende Arbeit. Bitte beachten Sie das Tutorial unten sowie das Pydata 2017 -Tutorial.

Wenn Sie den Code und die Tests durchsuchen, sollten Sie eine gute Vorstellung davon geben, wie die Dinge funktionieren.

Die Bibliothek deckt einige neuartige und effektive Term-Importanzformeln ab, einschließlich skalierter F-Score .

Anpassen der Visualisierung und der Dispersion

Neu in ScatterText 0.1.0, kann man einen Datenrahmen für Term-/Metadatenpositionen und andere Term-spezifische Daten verwenden. Wir können es auch verwenden, um Term-spezifische Informationen zu ermitteln, die nach dem Klicken eines Begriffs angezeigt werden.

Beachten Sie, dass es möglich ist, die Verwendung von Dokumentkategorien in Streutext zu deaktivieren, wie wir in diesem Beispiel sehen werden.

Dieses Beispiel deckt mit der Darstellung der Term Dispersion gegen Worthäufigkeit und der Ermittlung der Begriffe ab, die aufgrund ihrer Frequenzen am meisten und am wenigsten verteilt sind. Unter Verwendung der Rosengren -Dispersionsmaßnahme (Gries 2021) neigen die Begriffe dazu, ihre Dispersionswerte zu erhöhen, wenn sie häufiger werden. Wir werden sehen, wie wir diesen Effekt sowohl zeichnen als auch den Effekt der Frequenz aufnehmen können.

Dies zusammen mit einer Reihe anderer Dispersionsmetriken, die in Gries (2021) vorgestellt werden, ist in der Dispersion verfügbar und dokumentiert, die wir später im Abschnitt verwenden werden.

Beginnen wir mit der Erstellung eines Kongresskorpus, aber wir werden die CorpusWithoutCategoriesFromParsedDocuments der Fabrik der Dokumente verwenden, um sicherzustellen, dass keine Kategorien im Korpus enthalten sind. Wenn wir versuchen, Dokumentenkategorien zu finden, werden wir feststellen, dass alle Dokumente die Kategorie '_' haben.

 import scattertext as st

df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
    parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences ))
corpus = st . CorpusWithoutCategoriesFromParsedDocuments (
    df , parsed_col = 'parse'
). build (). get_unigram_corpus (). remove_infrequent_words ( minimum_term_count = 6 )

corpus . get_categories ()
# Returns ['_']

Als nächstes erstellen wir einen Datenrahmen für alle Begriffe, die wir zeichnen werden. Wir werden mit dem Erstellen eines Datenrahmens beginnen, in dem wir die Häufigkeit jedes Begriffs und verschiedene Dispersionsmetriken erfassen. Diese werden gezeigt, nachdem ein Begriff im Diagramm aktiviert wurde.

 dispersion = st . Dispersion ( corpus )

dispersion_df = dispersion . get_df ()
dispersion_df . head ( 3 )

Was kehrt zurück

       Frequency  Range         SD        VC  Juilland's D  Rosengren's S        DP   DP norm  KL-divergence  Dissemination
thank        363    134   3.108113  1.618274      0.707416       0.694898  0.391548  0.391560       0.748808       0.972954
you         1630    177  12.383708  1.435902      0.888596       0.898805  0.233627  0.233635       0.263337       0.963905
so           549    155   3.523380  1.212967      0.774299       0.822244  0.283151  0.283160       0.411750       0.986423```

These are discussed in detail in [Gries 2021](http://www.stgries.info/research/ToApp_STG_Dispersion_PHCL.pdf). 
Dissementation is presented in Altmann et al. (2011).

We'll use Rosengren's S to find the dispersion of each term. It's which a metric designed for corpus parts
(convention speeches in our case) of varying length. Where n is the number of documents in the corpus, s_i is the
percentage of tokens in the corpus found in document i, v_i is term count in document i, and f is the total number
of tokens in the corpus of type term type.

Rosengren's
S: [![Rosengren's S](https://render.githubusercontent.com/render/math?math=frac{Sum_{i=1}^{n}sqrt{s_i%20cdot%20v_i})^2}{f})](https://render.githubusercontent.com/render/math?math=frac{Sum_{i=1}^{n}sqrt{s_i%20cdot%20v_i})
^2}{f})

In order to start plotting, we'll need to add coordinates for each term to the data frame.

To use the `dataframe_scattertext` function, you need, at a minimum a dataframe with 'X' and 'Y' columns.

The `Xpos` and `Ypos` columns indicate the positions of the original `X` and `Y` values on the scatterplot, and
need to be between 0 and 1. Functions in `st.Scalers` perform this scaling. Absent `Xpos` or `Ypos`,
`st.Scalers.scale` would be used.

Here is a sample of values:

* `st.Scalers.scale(vec)` Rescales the vector to where the minimum value is 0 and the maximum is 1.
* `st.Scalers.log_scale(vec)` Rescales the lgo of the vector
* `st.Scalers.dense_ranke(vec)` Rescales the dense rank of the vector
* `st.Scalers.scale_center_zero_abs(vec)` Rescales a vector with both positive and negative values such that the 0 value
  in the original vector is plotted at 0.5, negative values are projected from [-argmax(abs(vec)), 0] to [0, 0.5] and
  positive values projected from [0, argmax(abs(vec))] to [0.5, 1].

```python
dispersion_df = dispersion_df.assign(
    X=lambda df: df.Frequency,
    Xpos=lambda df: st.Scalers.log_scale(df.X),
    Y=lambda df: df["Rosengren's S"],
    Ypos=lambda df: st.Scalers.scale(df.Y),
)

Beachten Sie, dass die Ypos -Spalte hier nicht erforderlich ist, da Y automatisch skaliert wird.

Da wir nicht zwischen Kategorien unterscheiden, können wir ignore_categories=True festlegen.

Wir können diesen Diagramm jetzt mit der Funktion dataframe_scattertext zeichnen:

 html = st . dataframe_scattertext (
    corpus ,
    plot_df = dispersion_df ,
    metadata = corpus . get_df ()[ 'speaker' ] + ' (' + corpus . get_df ()[ 'party' ]. str . upper () + ')' ,
    ignore_categories = True ,
    x_label = 'Log Frequency' ,
    y_label = "Rosengren's S" ,
    y_axis_labels = [ 'Less Dispersion' , 'Medium' , 'More Dispersion' ],
)

Welches ergibt (klicken Sie auf eine interaktive Version):

Beachten Sie, dass wir neben den Standardnutzungsstatistiken verschiedene Dispersionsstatistiken unter dem Namen eines Begriffs sehen können. Um die angezeigten Statistiken anzupassen, legen Sie den Parameter term_description_column=[...] mit einer Liste der zu angezeigten Spaltennamen fest.

Ein Problem in diesem Dispersionsdiagramm, das im Allgemeinen in der Regel Dispersionsmetriken auftritt, ist, dass Dispersion und Frequenz tendenziell eine hohe Korrelation aufweisen, jedoch mit einer komplexen, nichtlinearen Kurve. Je nach Metrik könnte diese Korrelationskurve Leistung, linear, sigmoidal oder typischerweise etwas anderes sein.

Um diese Korrelation zu berücksichtigen, können wir die Dispersion mithilfe eines nichtparametrischen Regressors von Frequenz vorhersagen und feststellen, welche Begriffe die höchsten und niedrigsten Residuen in Bezug auf ihre erwarteten Dispersionen auf der Grundlage ihrer Frequenzen aufweisen.

In diesem Fall werden wir einen KNN -Regressor mit 10 Nachbarn verwenden, um Rosengren von Term Frequenzen ( dispersion_df.X bzw. .Y ) vorherzusagen und den Rest zu berechnen.

Wir werden die Residual zu färben, mit einer neutralen Farbe für Residuen um 0 und andere Farben für positive und negative Werte. Wir fügen eine Spalte im Datenrahmen für Punktfarben hinzu und nennen es ColorScore. Es wird mit Werten zwischen 0 und 1 besiedelt, wobei 0,5 als Neturalfarbe auf der d3 interpolateWarm -Farbskala geprägt sind. Wir verwenden st.Scalers.scale_center_zero_abs , die oben diskutiert wurden, um diese Transformation vorzunehmen.

 from sklearn . neighbors import KNeighborsRegressor

dispersion_df = dispersion_df . assign (
    Expected = lambda df : KNeighborsRegressor ( n_neighbors = 10 ). fit (
        df . X . values . reshape ( - 1 , 1 ), df . Y
    ). predict ( df . X . values . reshape ( - 1 , 1 )),
    Residual = lambda df : df . Y - df . Expected ,
    ColorScore = lambda df : st . Scalers . scale_center_zero_abs ( df . Residual )
)

Jetzt sind wir bereit, unser farbiges Dispersionsdiagramm zu zeichnen. Wir weisen den ColorsCore -Spaltennamen dem Parameter color_score_column in dataframe_scattertext zu.

Darüber hinaus möchten wir die beiden Lauflisten auf der linken Seite mit hohen und niedrigen Restwerten bevölkern, die Begriffe angeben, die die am stärksten dispergierte relativ zu ihrem Frequenz-erwarteten Niveau und die niedrigsten aufweisen. Wir können dies mit dem Parameter left_list_column tun. Wir können die Namen der oberen und unteren Term mithilfe des Parameters header_names angeben. Schließlich können wir die Handlung anspannen, indem wir eine ansprechende Hintergrundfarbe hinzufügen.

 html = st . dataframe_scattertext (
    corpus ,
    plot_df = dispersion_df ,
    metadata = corpus . get_df ()[ 'speaker' ] + ' (' + corpus . get_df ()[ 'party' ]. str . upper () + ')' ,
    ignore_categories = True ,
    x_label = 'Log Frequency' ,
    y_label = "Rosengren's S" ,
    y_axis_labels = [ 'Less Dispersion' , 'Medium' , 'More Dispersion' ],
    color_score_column = 'ColorScore' ,
    header_names = { 'upper' : 'Lower than Expected' , 'lower' : 'More than Expected' },
    left_list_column = 'Residual' ,
    background_color = '#e5e5e3'
)

Welches ergibt (klicken Sie auf eine interaktive Version):

Tutorial

Helfen! Ich kenne Python nicht, aber ich möchte immer noch Streuung verwenden.

Während Sie Python lernen sollten, verwenden Sie Streuungspunkte voll und ganz, ich habe einige der grundlegenden Funktionen in ein Befehlszeilen -Tool eingebracht. Das Tool ist installiert, wenn Sie die oben festgelegte Prozedur befolgen.

Führen Sie $ scattertext --help aus der Befehlszeile aus, um die vollständigen Nutzungsinformationen anzuzeigen. Hier ist ein kurzes Beispiel für die Verwendung von Vanille -ScatterText in einer CSV -Datei. Die Datei muss mindestens zwei Spalten haben, einen, der den zu analysierenden Text enthält, und eine andere, die die Kategorie enthält. Im folgenden Beispiel CSV sind die Spalten Text bzw. Party.

Das folgende Beispiel verarbeitet die CSV -Datei und die resultierende HTML -Visualisierung in cli_demo.html.

Beachten Sie, dass der Parameter --minimum_term_frequency=8 Begriffe auslässt, die weniger als das 8 -fache auftreten, und --regex_parser zeigt an, dass anstelle von Spacy ein einfacher regulärer Ausdrucksparser verwendet werden sollte. Das Flag --one_use_per_doc gibt an, dass die Termfrequenz berechnet werden sollte, indem nur nicht mehr als ein Vorkommen eines Terms in einem Dokument gezählt wird.

Wenn Sie nicht englischerweise Text analysieren möchten, können Sie das Argument --spacy_language_model verwenden, um zu konfigurieren, welches Spacy-Sprachmodell das Tool verwendet. Die Standardeinstellung ist 'EN' und Sie können die anderen unter https://spacy.io/docs/api/glanguage-models sehen.

$ curl -s https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv | head -2
party,speaker,text
democrat,BARACK OBAMA, " Thank you. Thank you. Thank you. Thank you so much.Thank you.Thank you so much. Thank you. Thank you very much, everybody. Thank you.
$
$ scattertext --datafile=https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv 
> --text_column=text --category_column=party --metadata_column=speaker --positive_category=democrat 
> --category_display_name=Democratic --not_category_display_name=Republican --minimum_term_frequency=8 
> --one_use_per_doc --regex_parser --outputfile=cli_demo.html

Verwenden von Streutext als Textanalyse -Bibliothek: charakteristische Begriffe finden und deren Assoziationen

Der folgende Code erstellt eine eigenständige HTML-Datei, die Wörter analysiert, die von Demokraten und Republikanern in den Parteikongressen 2012 verwendet werden, und einige bemerkenswerte Begriffsverbände ausgibt.

Importieren Sie zunächst Streuung und Spacy.

 >>> import scattertext as st
>>> import spacy
>>> from pprint import pprint

Zusammenstellen Sie als Nächstes die Daten zusammen, die Sie in einem Pandas -Datenrahmen analysieren möchten. Es sollte mindestens zwei Spalten haben, den Text, den Sie analysieren möchten, und die Kategorie, die Sie studieren möchten. Hier enthält die text Kongressreden, während die party die Partei des Sprechers enthält. Wir werden schließlich die speaker verwenden, um Snippets in der Visualisierung zu kennzeichnen.

 >>> convention_df = st.SampleCorpora.ConventionData2012.get_data()  
>>> convention_df.iloc[0]
party                                               democrat
speaker                                         BARACK OBAMA
text       Thank you. Thank you. Thank you. Thank you so ...
Name: 0, dtype: object

Verwandeln Sie den Datenrahmen in einen Scattertext -Korpus, um ihn zu analysieren. Um nach Unterschieden in den Parteien zu suchen, setzen Sie den Parameter category_col auf 'party' und verwenden Sie die Reden, die in der text vorhanden sind, als die Texte, indem Sie den text -COL -Parameter festlegen. Geben Sie schließlich ein Spacy -Modell in das nlp -Argument weiter und rufen Sie build() auf, um den Korpus zu konstruieren.

 # Turn it into a Scattertext Corpus 
>>> nlp = spacy.load('en')
>>> corpus = st.CorpusFromPandas(convention_df, 
...                              category_col='party', 
...                              text_col='text',
...                              nlp=nlp).build()

Sehen wir uns charakteristische Begriffe im Korpus und Begriffe an, die die am meisten verbundenen Demokraten und Republikaner sind. Siehe die Folien 52 bis 59 der wende unstrukturierten Inhalte der Kernel von Ideen sprechen für weitere Informationen zu diesen Ansätzen.

Hier sind die Begriffe, die den Korpus von einem allgemeinen englischen Korpus unterscheiden.

 >>> print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))
['obama',
 'romney',
 'barack',
 'mitt',
 'obamacare',
 'biden',
 'romneys',
 'hardworking',
 'bailouts',
 'autoworkers']

Hier sind die Begriffe, die mit Demokraten am meisten verbunden sind:

 >>> term_freq_df = corpus.get_term_freq_df()
>>> term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')
>>> pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))
['auto',
 'america forward',
 'auto industry',
 'insurance companies',
 'pell',
 'last week',
 'pell grants',
 "women 's",
 'platform',
 'millionaires']

Und Republikaner:

 >>> term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')
>>> pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))
['big government',
 "n't build",
 'mitt was',
 'the constitution',
 'he wanted',
 'hands that',
 'of mitt',
 '16 trillion',
 'turned around',
 'in florida']

Visualisierung von Term Assoziationen

Lassen Sie uns nun dem Streudiagramm eine eigenständige HTML-Datei schreiben. Wir werden die Kategorie Y-Achse "Demokrat" machen und die Kategorie "Demokrat" mit einem Kapital "D" für Präsentationszwecke nennen. Wir werden die andere Kategorie "Republikaner" mit einer Hauptstadt "R" nennen. Alle Dokumente im Korpus ohne die Kategorie "Demokrat" werden als republikanisch angesehen. Wir setzen die Breite der Visualisierung in Pixeln und kennzeichnen jeden Auszug mit dem Lautsprecher mit dem metadata . Schließlich schreiben wir die Visualisierung in eine HTML -Datei.

 >>> html = st.produce_scattertext_explorer(corpus,
...          category='democrat',
...          category_name='Democratic',
...          not_category_name='Republican',
...          width_in_pixels=1000,
...          metadata=convention_df['speaker'])
>>> open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

Unten ist die Webseite aussieht. Klicken Sie darauf und warten Sie einige Minuten auf die interaktive Version.

Visualisierung von Phrasenassoziationen

Der Streutext kann auch verwendet werden, um die Kategorie -Assoziation einer Vielzahl verschiedener Phrasentypen zu visualisieren. Das Wort "Phrase" bezeichnet eine einzelne oder mehrstufige Kollokation.

Mit Pytextrank

Pytextrank, erstellt von Paco Nathan, ist eine Implementierung einer modifizierten Version des Textrank -Algorithmus (Mihalcea und Tarau 2004). Es beinhaltet einen Graph -Zentralitätsalgorithmus, um eine bewertete Liste der bekanntesten Phrasen in einem Dokument zu extrahieren. Hier, genannte Unternehmen, die von Spacy anerkannt wurden. Ab der Spacy Version 2.2 stammen diese aus einem NER -System, das auf Ontonotes 5 trainiert wurde.

Bitte installieren Sie Pytextrank $ pip3 install pytextrank bevor Sie dieses Tutorial fortsetzen.

Erstellen Sie ein Korpus wie gewohnt, aber stellen Sie sicher, dass Sie mit Spacy jedes Dokument analysieren, im Gegensatz zu einem integrierten Tokenizer mit whitespace_nlp -Typ. Beachten Sie, dass keine Pytextrank in die Spacy -Pipeline erforderlich ist, da sie vom Objekt PyTextRankPhrases getrennt ausgeführt wird. Wir werden die Anzahl der im Diagramm angezeigten Phrasen auf 2000 mit dem AssociationCompactor reduzieren. Die generierten Phrasen werden wie nicht-textuelle Funktionen behandelt, da ihre Dokumentbewertungen nicht den Wortzählungen entsprechen.

 import pytextrank, spacy
import scattertext as st

nlp = spacy.load('en')
nlp.add_pipe("textrank", last=True)

convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(nlp),
    party=lambda df: df.party.apply({'democrat': 'Democratic', 'republican': 'Republican'}.get)
)
corpus = st.CorpusFromParsedDocuments(
    convention_df,
    category_col='party',
    parsed_col='parse',
    feats_from_spacy_doc=st.PyTextRankPhrases()
).build(
).compact(
    AssociationCompactor(2000, use_non_text_features=True)
)

Beachten Sie, dass die im Korpus vorhandenen Begriffe als Entitäten bezeichnet werden, und im Gegensatz zu Frequenzzählungen sind ihre Bewertungen die ihnen vom Textrank -Algorithmus zugewiesenen Eigenschaften. Ausführen corpus.get_metadata_freq_df('') kehrt für jede Kategorie die Summen der Begriffe 'Textrank -Scores zurück. Die dichten Reihen dieser Bewertungen werden verwendet, um das Streudiagramm zu konstruieren.

 term_category_scores = corpus.get_metadata_freq_df('')
print(term_category_scores)
'''
                                         Democratic  Republican
term
our future                                 1.113434    0.699103
your country                               0.314057    0.000000
their home                                 0.385925    0.000000
our government                             0.185483    0.462122
our workers                                0.199704    0.210989
her family                                 0.540887    0.405552
our time                                   0.510930    0.410058
...
'''

Bevor wir das Diagramm konstruieren, lassen Sie uns einige Helfervariablen, da die aggregierten Textrank-Scores nicht besonders interpretierbar sind. Wir werden den Rang pro Kategorie jeder Punktzahl im Feld metadata_description anstellen. Diese werden angezeigt, nachdem ein Begriff geklickt wurde.

 term_ranks = pd.DataFrame(
    np.argsort(np.argsort(-term_category_scores, axis=0), axis=0) + 1,
    columns=term_category_scores.columns,
    index=term_category_scores.index)

metadata_descriptions = {
    term: '<br/>' + '<br/>'.join(
        '<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
        for cat in corpus.get_categories())
    for term in corpus.get_metadata()
}

Wir können Termwerte auf verschiedene Arten erstellen. Einer ist ein Standardunterschied zwischen Dichtemittel, eine Punktzahl, die in den meisten kontrastiven Diagrammen mit zwei Kategorien verwendet wird, die uns die am meisten kategorienassoziierten Phrasen verleihen. Ein weiteres Bestehen besteht darin, die maximale kategoriespezifische Punktzahl zu verwenden. Dies gibt uns die bekanntesten Phrasen in jeder Kategorie, unabhängig von der Bekanntheit in der anderen Kategorie. Wir werden beide Ansätze in diesem Tutorial verfolgen. Lassen Sie uns die zweite Art von Punktzahl, die kategorienspezifische Bedeutung unten, berechnen.

 category_specific_prominence = term_category_scores.apply(
    lambda r: r.Democratic if r.Democratic > r.Republican else -r.Republican,
    axis=1
)

Jetzt geben wir dieses Diagramm bereit. Beachten Sie, dass wir eine dense_rank -Transformation verwenden, die identisch versenkte Sätze aufeinander platziert. Wir verwenden category_specific_prominence als Bewertungen und setzen sort_by_dist als False , um sicherzustellen, dass die auf der rechten Seite des Diagramms angezeigten Phrasen von den Punktzahlen und nicht der Entfernung zum oberen linken oder unteren Rechten eingestuft werden. Da übereinstimmende Phrasen als Nicht-Text-Funktionen behandelt werden, codieren wir sie als Einzelphrase-Themenmodelle und setzen die topic_model_preview_size auf 0 , um anzuzeigen, dass die Themenmodellliste nicht angezeigt werden sollte. Schließlich stellen wir fest, dass die vollständigen Dokumente angezeigt werden. Beachten Sie, dass die Dokumente in der Reihenfolge der phrasenspezifischen Punktzahl angezeigt werden.

 html = produce_scattertext_explorer(
    corpus,
    category='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    transform=dense_rank,
    metadata=corpus.get_df()['speaker'],
    scores=category_specific_prominence,
    sort_by_dist=False,
    use_non_text_features=True,
    topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
    topic_model_preview_size=0,
    metadata_descriptions=metadata_descriptions,
    use_full_doc=True
)

Die am meisten zugeordneten Begriffe in jeder Kategorie machen zumindest bei einer Post -hoc -Analyse einen Sinn. Als sie sich auf (damals) Gouverneur Romney bezogen, verwendeten die Demokraten seinen Nachnamen "Romney" in ihren zentralsten Erwähnungen von ihm, während die Republikaner die vertrautere und humanisierendere "Handschuh" verwendeten. In Bezug auf den Präsidenten Obama wurde der Ausdruck "Obama" auch nicht als Top -Amtszeit angezeigt, der aber der Vorname "Barack" war eine der zentralsten Phrasen in demokratischen Reden, die "Mitt."

Alternativ können wir einen dichten Rangunterschied in den Bewertungen zu Farbphrasenpunkten dichten und feststellen, dass auf der rechten Seite des Diagramms die Top-Phrasen angezeigt werden. Anstatt scores als kategoriespezifische Bewertungswerte festzulegen, setzen wir term_scorer=RankDifference() so, dass die Erstellung von Termbewertungen in den Erstellungsprozess des Streudiagramms injiziert werden.

 html = produce_scattertext_explorer(
    corpus,
    category='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    transform=dense_rank,
    use_non_text_features=True,
    metadata=corpus.get_df()['speaker'],
    term_scorer=RankDifference(),
    sort_by_dist=False,
    topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
    topic_model_preview_size=0, 
    metadata_descriptions=metadata_descriptions,
    use_full_doc=True
)

Verwenden von Phrasemachine, um Phrasen zu finden.

Phrasemachine von AbeHandler (Handler et al. 2016) verwendet regelmäßige Ausdrücke über Sequenzen von Teil der Speech-Tags, um Substantivphrasen zu identifizieren. Dies hat den Vorteil, dass das NP-Chunking von Spacy dadurch verwendet wird, dass es sinnvolle, große Substantivphasen isoliert, die kostenlos von Appositiven sind.

Im Gegensatz zu Pytextrank werden wir nur Zählungen dieser Sätze verwenden und sie wie jede andere Begriff behandeln.

 import spacy
from scattertext import SampleCorpora, PhraseMachinePhrases, dense_rank, RankDifference, AssociationCompactor, produce_scattertext_explorer
from scattertext.CorpusFromPandas import CorpusFromPandas

corpus = (CorpusFromPandas(SampleCorpora.ConventionData2012.get_data(),
                           category_col='party',
                           text_col='text',
                           feats_from_spacy_doc=PhraseMachinePhrases(),
                           nlp=spacy.load('en', parser=False))
          .build().compact(AssociationCompactor(4000)))

html = produce_scattertext_explorer(corpus,
                                    category='democrat',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=0,
                                    pmi_threshold_coefficient=0,
                                    transform=dense_rank,
                                    metadata=corpus.get_df()['speaker'],
                                    term_scorer=RankDifference(),
                                    width_in_pixels=1000)

Hinzufügen von Farbgradienten, um die Ergebnisse zu erklären

Im Streutext werden verschiedene Metriken, einschließlich Begriffsverbände, häufig auf zwei Arten gezeigt. Das erste und Wichtigste ist die Position in der Tabelle. Die zweite ist die Farbe eines Punktes oder Textes. In ScatterText 0.2.21 wird ein Weg zur Visualisierung der Semantik dieser Bewertungen eingeführt: der Gradient als Schlüssel.

Der Gradient folgt standardmäßig dem Parameter d3_color_scale von produce_scattertext_explorer , der standardmäßig d3.interpolateRdYlBu ist.

Die folgenden zusätzlichen Parameter für produce_scattertext_explorer (und ähnliche Funktionen) ermöglichen die Manipulationsgradienten.

include_gradient: bool (standardmäßig False ) ist ein Flag, das das Erscheinungsbild eines Gradienten auslöst.
left_gradient_term: Optional[str] zeigt den Text an, der auf der fernen linken Seite des Gradienten geschrieben wurde. Es ist in gradient_text_color geschrieben und ist standardmäßig category_name .
right_gradient_term: Optional[str] zeigt den Text an, der auf der fernen linken Seite des Gradienten geschrieben wurde. Es ist in gradient_text_color geschrieben und standardmäßig not_category_name .
middle_gradient_term: Optional[str] zeigt den in der Mitte des Gradienten geschriebenen Text an. Es ist die entgegengesetzte Farbe der Center -Gradientenfarbe und standardmäßig leer.
gradient_text_color: Optional[str] Zeigt die feste Farbe des auf dem Gradienten geschriebenen Text an. Wenn keine, ist es standardmäßig an der entgegengesetzten Farbe des Gradienten.
left_text_color: Optional[str] überschreibt gradient_text_color für den linken Gradientenbegriff
middle_text_color: Optional[str] überschreibt gradient_text_color für den Middle Gradientenbegriff
right_text_color: Optional[str] überschreibt gradient_text_color für den richtigen Gradientenbegriff
gradient_colors: Optional[List[str]] Liste der Hex -Farben, einschließlich '#', (z ['#0000ff', '#980067', '#cc3300', '#32cd00'] Wenn angegeben, überschreiben diese d3_color_scale .

Ein unkompliziertes Beispiel ist wie folgt. Die Begriffsfarben werden als Zuordnung zwischen einem Termnamen und einer #RRGGBB -Farbe als Teil des term_color -Parameters definiert, und der Farbgradient ist in gradient_colors definiert. Der

 import matplotlib . pyplot as plt
import matplotlib as mpl

df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
    parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
)

corpus = st . CorpusFromParsedDocuments (
    df , category_col = 'party' , parsed_col = 'parse'
). build (). get_unigram_corpus (). compact ( st . AssociationCompactor ( 2000 ))

html = st . produce_scattertext_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    minimum_term_frequency = 0 ,
    pmi_threshold_coefficient = 0 ,
    width_in_pixels = 1000 ,
    metadata = corpus . get_df ()[ 'speaker' ],
    transform = st . Scalers . dense_rank ,
    include_gradient = True ,
    left_gradient_term = "More Democratic" ,
    right_gradient_term = "More Republican" ,
    middle_gradient_term = 'Metric: Dense Rank Difference' ,
    gradient_text_color = "white" ,
    term_colors = dict ( zip (
        corpus . get_terms (),
        [
            mpl . colors . to_hex ( x ) for x in plt . get_cmap ( 'brg' )(
                st . Scalers . scale_center_zero_abs (
                    st . RankDifferenceScorer ( corpus ). set_categories ( 'democrat' ). get_scores ()). values
            )
        ]
    )),
    gradient_colors = [ mpl . colors . to_hex ( x ) for x in plt . get_cmap ( 'brg' )( np . arange ( 1. , 0. , - 0.01 ))],
)

Visualisieren von Empath -Themen und -kategorien

Um Empath (Fast et al., 2016) Themen und Kategorien anstelle von Begriffen zu visualisieren, müssen wir ein Corpus extrahierter Themen und Kategorien anstelle von Unigrams und Bigrams erstellen. Verwenden Sie dazu den FeatsOnlyFromEmpath -Feature -Extraktor. In dem Quellcode finden Sie Beispiele, wie Sie Ihre eigenen herstellen können.

Übergeben Sie beim Erstellen der Visualisierung das Argument der use_non_text_features=True produce_scattertext_explorer . Dadurch wird es angewiesen, die gekennzeichneten Empath -Themen und -kategorien zu verwenden, anstatt nach Begriffen zu suchen. Da die Dokumente zurückgegeben werden, wenn ein Thema oder ein Kategorie-Etikett angeklickt wird, ist die Kategorie-Assoziationsstärke auf Dokumentenebene in der Reihenfolge der Kategorie der Kategorie auf Dokumente, sofern Sie die Verwendung use_full_doc=True sinnvoll sind, es sei denn, Sie haben enorme Dokumente. Andernfalls werden die ersten 300 Zeichen angezeigt.

(Neu in 0,0,26). Stellen Sie sicher, dass Sie topic_model_term_lists=feat_builder.get_top_model_term_lists() in produce_scattertext_explorer einschließen, um sicherzustellen, dass es fettgedruckte Passagen von Snippets, die dem Themenmodell entsprechen.

 >>> feat_builder = st.FeatsFromOnlyEmpath()
>>> empath_corpus = st.CorpusFromParsedDocuments(convention_df,
...                                              category_col='party',
...                                              feats_from_spacy_doc=feat_builder,
...                                              parsed_col='text').build()
>>> html = st.produce_scattertext_explorer(empath_corpus,
...                                        category='democrat',
...                                        category_name='Democratic',
...                                        not_category_name='Republican',
...                                        width_in_pixels=1000,
...                                        metadata=convention_df['speaker'],
...                                        use_non_text_features=True,
...                                        use_full_doc=True,
...                                        topic_model_term_lists=feat_builder.get_top_model_term_lists())
>>> open("Convention-Visualization-Empath.html", 'wb').write(html.encode('utf-8'))

c ScatterText enthält außerdem einen Feature -Builder, der die Beziehung zwischen allgemeinem Inquirer -Tag -Categoires und Dokumentenkategorien untersucht. Wir werden einen etwas anderen Ansatz verwenden und die Beziehung der GI-Tag-Kategorien zu politischen Parteien untersuchen, indem wir die Z-Scores des Protokoll-Odds-Verhältnisses mit uninformativen Dirichlet-Priors verwenden (Monroe 2008). Wir werden die Plot-Variation produce_frequency_explorer Plot verwenden, um diese Beziehung zu visualisieren und die x-Achse so festzustellen, wie oft ein Wort in der Tag-Kategorie auftritt, und die y-Achse als Z-Score.

Weitere Informationen zum allgemeinen Inquirer finden Sie auf der Homepage für allgemeine Anfragen.

Wir werden denselben Datensatz wie zuvor verwenden, außer dass wir die FeatsFromGeneralInquirer -Builder verwenden werden.

 >>> general_inquirer_feature_builder = st.FeatsFromGeneralInquirer()
>>> corpus = st.CorpusFromPandas(convention_df,
...                              category_col='party',
...                              text_col='text',
...                              nlp=st.whitespace_nlp_with_sentences,
...                              feats_from_spacy_doc=general_inquirer_feature_builder).build()

Als nächstes werden wir in ähnlicher Weise produce_frequency_explorer auf nennen, die wir im vorherigen Abschnitt produce_scattertext_explorer bezeichnet haben. Es gibt jedoch einige Unterschiede. Zunächst geben wir den Begriff Torschützenkönig LogOddsRatioUninformativeDirichletPrior an, die die Beziehungen zwischen den Kategorien bewertet. Das grey_threshold gibt an, dass die Punkte zwischen [-1,96, 1,96] (dh p> 0,05) grau gefärbt werden sollten. Das Argument metadata_descriptions=general_inquirer_feature_builder.get_definitions() zeigt an, dass ein Wörterbuchzugriff, der den Tag -Namen zu einer String -Definition übergeben wird, übergeben wird. Wenn ein Tag angeklickt wird, wird die Definition im Wörterbuch unter dem Diagramm angezeigt, wie im Bild nach dem Ausschnitt gezeigt.

 >>> html = st.produce_frequency_explorer(corpus,
...                                      category='democrat',
...                                      category_name='Democratic',
...                                      not_category_name='Republican',
...                                      metadata=convention_df['speaker'],
...                                      use_non_text_features=True,
...                                      use_full_doc=True,
...                                      term_scorer=st.LogOddsRatioUninformativeDirichletPrior(),
...                                      grey_threshold=1.96,
...                                      width_in_pixels=1000,
...                                      topic_model_term_lists=general_inquirer_feature_builder.get_top_model_term_lists(),
...                                      metadata_descriptions=general_inquirer_feature_builder.get_definitions())

Hier ist das resultierende Diagramm.

Visualisierung des moralischen Fundaments 2.0 -Wörterbuchs

Die [moralische Fundamentstheorie] schlägt sechs psychologische Konstrukte als Bausteine des moralischen Denkens vor, wie in Graham et al. (2013). Diese Grundlagen werden, wie in [MoralFoundations.org] beschrieben: Pflege/Schaden, Fairness/Betrug, Loyalität/Verrat, Autorität/Subversion, Heiligkeit/Verschlechterung und Freiheit/Unterdrückung/Unterdrückung beschrieben. In der Website finden Sie eine eingehendere Diskussion dieser Stiftungen.

Frimer et al. (2019) schuf das moralische Fundament Dictionary 2.0 oder ein Lexikon von Begriffen, die sich auf eine moralische Grundlage als Tugend (günstig gegenüber der Stiftung) oder als Vize (im Gegensatz zur Stiftung) berufen.

Dieses Wörterbuch kann auf die gleiche Weise wie der allgemeine Untersucher verwendet werden. In diesem Beispiel können wir die zahlreichen Zahlen von Cohen's D Foundation-Wort in Bezug auf die Frequenzwörter, die diese Fundamente beteiligen, aufnehmen.

Wir können den Korpus zuerst wie gewohnt laden und st.FeatsFromMoralFoundationsDictionary() verwenden, um Merkmale zu extrahieren.

 import scattertext as st

convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
moral_foundations_feats = st . FeatsFromMoralFoundationsDictionary ()
corpus = st . CorpusFromPandas ( convention_df ,
                             category_col = 'party' ,
                             text_col = 'text' ,
                             nlp = st . whitespace_nlp_with_sentences ,
                             feats_from_spacy_doc = moral_foundations_feats ). build ()

Lassen Sie uns als nächstes Cohens D -Begriff Torschütze verwenden, um den Korpus zu analysieren und eine Reihe von Cohens D -Assoziationswerten zu beschreiben.

 cohens_d_scorer = st . CohensD ( corpus ). use_metadata ()
term_scorer = cohens_d_scorer . set_categories ( 'democrat' , [ 'republican' ]). term_scorer . get_score_df ()

Dies ergibt den folgenden Datenrahmen:

	cohens_d	cohens_d_se	cohens_d_z	cohens_d_p	Hedges_G	hedges_g_se	hedges_g_z	hedges_g_p	M1	M2	count1	count2	docs1	docs2
Care.Virtue	0,662891	0,149425	4.43629	4.57621E-06	0,660257	0,159049	4.15129	1.65302E-05	0,195049	0,12164	760	379	115	54
Care.vice	0,24435	0,146025	1.67335	0,0471292	0,243379	0,152654	1,59432	0,0554325	0,0580005	0,0428358	244	121	80	41
Fairness.Virtue	0,176794	0,145767	1.21286	0,112592	0,176092	0,152164	1.15725	0,123586	0,0502469	0,0403369	225	107	71	39
Fairness.Vice	0,0707162	0,145528	0,485928	0,313509	0,0704352	0,151711	0,464273	0,321226	0,00718627	0,00573227	32	14	21	10
Autorität.Virtue	-0.0187793	0,145486	-0.12908	0,551353	-0.0187047	0,15163	-0.123357	0,549088	0,358192	0,361191	1281	788	122	66
Autorität.Vice	-0.0354164	0,145494	-0.243422	0,596161	-0.0352757	0,151646	-0.232619	0,591971	0,00353465	0,00390602	20	14	14	10
Heiligkeit.Virtue	-0.512145	0,147848	-3.46399	0,999734	-0.51011	0,156098	-3.26788	0,999458	0,0587987	0,101677	265	309	74	48
Heiligkeit.Vice	-0.108011	0,145589	-0.74189	0,770923	-0.107582	0,151826	-0.708585	0,760709	0,00845048	0.0109339	35	28	23	20
Loyalität.Virtue	-0.413696	0,147031	-2.81367	0,997551	-0.412052	0,154558	-2.666	0,996162	0,259296	0,309776	1056	717	119	66
Loyalität.Vice	-0.0854683	0,145549	-0.587213	0,72147	-0.0851287	0,151751	-0.560978	0,712594	0,00124518	0,00197022	5	5	5	4

Dieser Datenrahmen gibt uns Cohens D-Scores (und deren Standardfehler und Z-Scores), Hedge's $ g $ Scores (Ditto), die mittlere normalisierte Themenverwendung von Dokumentenlängen pro Kategorie (wobei die In-Fokus-Kategorie M1 ist [in diesem Fall Demokraten] und die Out-of-Focus-Kategorie M2), die rohe Anzahl der für jedes Thema verwendeten Wörter (COUNT1 und COUNT2) und die Anzahl der Dokumente in jeder Kategorie mit dem Thema (Docss1 und Docs2 und DOCS2).

Beachten Sie, dass Cohens D die Differenz von M1 und M2 geteilt durch ihre gepoolte Standardabweichung ist.

Lassen Sie uns nun die D-Scores von Fundamenten gegenüber ihren Frequenzen zeichnen.

 html = st . produce_frequency_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    metadata = convention_df [ 'speaker' ],
    use_non_text_features = True ,
    use_full_doc = True ,
    term_scorer = st . CohensD ( corpus ). use_metadata (),
    grey_threshold = 0 ,
    width_in_pixels = 1000 ,
    topic_model_term_lists = moral_foundations_feats . get_top_model_term_lists (),
    metadata_descriptions = moral_foundations_feats . get_definitions ()
)

Bestellbegriffe nach Corpus -Charakteristik

Oft sind die Begriffe des meisten Interesses solche, die für den gesamten Korpus charakteristisch sind. Dies sind Begriffe, die in allen untersuchten Dokumenten häufig auftreten, im Vergleich zu allgemeinen Lauffrequenzen jedoch relativ selten.

Wir können eine Handlung mit einer charakteristischen Punktzahl für die X-Achse- und Klassen-Assoziation-Scores auf der y-Achse unter Verwendung der Funktion produce_characteristic_explorer erstellen.

Die Corpus-Charakterität ist der Unterschied in der dichten Begriff zwischen den Wörtern in allen Dokumenten in der Studie und einer allgemeinen englischsprachigen Frequenzliste. In diesem Vortrag finden Sie eine gründlichere Erklärung.

 import scattertext as st

corpus = ( st . CorpusFromPandas ( st . SampleCorpora . ConventionData2012 . get_data (),
                              category_col = 'party' ,
                              text_col = 'text' ,
                              nlp = st . whitespace_nlp_with_sentences )
          . build ()
          . get_unigram_corpus ()
          . compact ( st . ClassPercentageCompactor ( term_count = 2 ,
                                               term_ranker = st . OncePerDocFrequencyRanker )))
html = st . produce_characteristic_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    metadata = corpus . get_df ()[ 'speaker' ]
)
open ( 'demo_characteristic_chart.html' , 'wb' ). write ( html . encode ( 'utf-8' ))

Dokumentbasierte Streudiagramme

Zusätzlich zu Wörtern, Phasen und Themen können wir jeden Punkt einem Dokument entsprechen. Lassen Sie uns zunächst ein Corpus -Objekt für den Datensatz von Conventions 2012 erstellen. Diese Erklärung folgt demo_pca_documents.py

 import pandas as pd
from sklearn . feature_extraction . text import TfidfTransformer
import scattertext as st
from scipy . sparse . linalg import svds

convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )
corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
                                       category_col = 'party' ,
                                       parsed_col = 'parse' )
          . build ()
          . get_stoplisted_unigram_corpus ())

Fügen wir als nächstes die Dokumentnamen als Meta -Daten im Corpus -Objekt hinzu. Die Funktion add_doc_names_as_metadata nimmt ein Array von Dokumentnamen und bevölkert die Meta -Daten von neuen Corpus mit diesen Namen. Wenn zwei Dokumente denselben Namen haben, findet eine Nummer (beginnend mit 1) zum Namen an.

 corpus = corpus . add_doc_names_as_metadata ( corpus . get_df ()[ 'speaker' ])

Als nächstes finden wir TF.IDF-Bewertungen für die Term-Document-Matrix der Corpus, führen spärliche SVD aus und fügen sie zu einem Projektionsdatenrahmen hinzu, wodurch die X- und Y-Achsen die ersten beiden Singularwerte gestaltet werden und sie auf den Meta-Daten der Korpus indizieren, was den Dokumentnamen entspricht.

 embeddings = TfidfTransformer (). fit_transform ( corpus . get_term_doc_mat ())
u , s , vt = svds ( embeddings , k = 3 , maxiter = 20000 , which = 'LM' )
projection = pd . DataFrame ({ 'term' : corpus . get_metadata (), 'x' : u . T [ 0 ], 'y' : u . T [ 1 ]}). set_index ( 'term' )

Setzen Sie schließlich die Punktzahlen als 1 für Demokraten und 0 für Republikaner, wodurch republikanische Dokumente als rote Punkte und demokratische Dokumente als blau gemacht werden. Weitere Informationen zur Funktion produce_pca_explorer finden Sie unter Verwendung von SVD, um jede Art von Worteinbettungen zu visualisieren.

 category = 'democrat'
scores = ( corpus . get_category_ids () == corpus . get_categories (). index ( category )). astype ( int )
html = st . produce_pca_explorer ( corpus ,
                               category = category ,
                               category_name = 'Democratic' ,
                               not_category_name = 'Republican' ,
                               metadata = convention_df [ 'speaker' ],
                               width_in_pixels = 1000 ,
                               show_axes = False ,
                               use_non_text_features = True ,
                               use_full_doc = True ,
                               projection = projection ,
                               scores = scores ,
                               show_top_terms = False )

Klicken Sie für eine interaktive Version

Verwenden Sie Cohens D oder Hedge's G, um die Effektgröße zu visualisieren.

Cohens D ist eine beliebte Metrik, die zur Messung der Effektgröße verwendet wird. Die Definitionen von Cohens D und Hedge's $ g $ Von (Shinichi und Cuthill 2017) werden in ScatterText implementiert.

 > >> convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
> >> corpus = ( st . CorpusFromPandas ( convention_df ,
...                               category_col = 'party' ,
               ... text_col = 'text' ,
               ... nlp = st . whitespace_nlp_with_sentences )
.... build ()
.... get_unigram_corpus ())

Wir können einen Begriffsbilderobjekt erstellen, um die Effektgrößen und andere Metriken zu untersuchen.

 >> > term_scorer = st . CohensD ( corpus ). set_categories ( 'democrat' , [ 'republican' ])
>> > term_scorer . get_score_df (). sort_values ( by = 'cohens_d' , ascending = False ). head ()
cohens_d
cohens_d_se
cohens_d_z
cohens_d_p
hedges_g
hedges_g_se
hedges_g_z
hedges_g_p
m1
m2
obama
1.187378
0.024588
48.290444
0.000000e+00
1.187322
0.018419
64.461363
0.0
0.007778
0.002795


class 0.855859     0.020848   41.052045   0.000000e+00  0.855818     0.017227   49.677688         0.0  0.002222  0.000375


middle
0.826895
0.020553
40.232746
0.000000e+00
0.826857
0.017138
48.245626
0.0
0.002316
0.000400
president
0.820825
0.020492
40.056541
0.000000e+00
0.820786
0.017120
47.942661
0.0
0.010231
0.005369
barack
0.730624
0.019616
37.245725
6.213052e-304
0.730589
0.016862
43.327800
0.0
0.002547
0.000725

Unsere Berechnung von Cohens D basiert nicht direkt auf Begriffszählungen. Vielmehr teilen wir die Laufzeit jedes Dokuments durch die Gesamtzahl der Begriffe im Dokument vor der Berechnung der Statistiken. m1 und m2 sind jeweils die mittleren Teile der Wörter in Reden von Demokraten und Republikanern, die der fragliche Begriff waren. Die Effektgröße ( cohens_d ) ist der Unterschied zwischen diesen Mitteln geteilt durch die gepoolte Standardabweichung. cohens_d_se ist der Standardfehler der Statistik, während cohens_d_z und cohens_d_p die Z-Scores und P-Werte sind, die die statistische Bedeutung des Effekts angeben. Entsprechende Spalten sind für Hedge's vorhanden $ g $ Eine Version von Cohens D für die Datensatzgröße angepasst.

 > >> st . produce_frequency_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    term_scorer = st . CohensD ( corpus ),
    metadata = convention_df [ 'speaker' ],
    grey_threshold = 0
)

Klicken Sie für eine interaktive Version.

Verwenden Sie das Delta von Cliff, um die Effektgröße zu visualisieren

Das Delta von Cliff (Cliff 1993) verwendet einen nichtparametrischen Ansatz zur Recheneffektgröße. In unserer Einstellung wird der Frequenzprozentsatz des Begriffs jedes Dokuments im Fokussatz mit dem des Hintergrundsatzes verglichen. Für jedes Dokumentepaar wird eine Punktzahl von 1 angegeben, wenn der Frequenzprozentsatz des Fokusdokuments größer als der Hintergrund ist, 0, wenn identisch und -1, wenn es unterschiedlich ist. Beachten Sie, dass die Dokumentlängen davon ausgehen, dass die Längen des Fokus und der Hintergrundinformationen ähnlich verteilt sind.

Siehe [https://real-statistics.com/non-parametric-teests/mann-whitney-test/cliffss-delta/] für die in CliffsDelta verwendeten Formeln.

Nachfolgend finden Sie ein Beispiel dafür, wie man CliffsDelta verwendet, um Begriffe zu finden und zu zeichnen:

 nlp = spacy . blank ( 'en' )
nlp . add_pipe ( 'sentencizer' )
convention_df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
   party = lambda df : df . party . apply (
       lambda x : { 'democrat' : 'Dem' , 'republican' : 'Rep' }[ x ]),
   SpacyParse = lambda df : df . text . progress_apply ( nlp )
)
corpus = st . CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'SpacyParse' ). build (
). remove_terms_used_in_less_than_num_docs ( 10 )
st . CliffsDelta ( corpus ). set_categories ( 'Dem' ). get_score_df (). sort_values ( by = 'Dem' , ascending = False ). iloc [: 10 ]

Begriff	Metrisch	Stddev	Niedrig-5,0% CI	Hoher 5,0% CI	TermCount1	TermCount2	Doccount1	Doccount2
Obama	0,597191	0,0266606	-1.35507	-1.03477	537	165	113	40
Präsident Obama	0,565903	0,0314348	-2.37978	-1.74131	351	78	100	30
Präsident	0,426337	0,0293418	1.22784	0,909226	740	301	113	53
Mitte	0,417591	0,0267365	1.10791	0,840932	164	27	68	12
Klasse	0,415373	0,0280622	1.09032	0,815649	161	25	69	14
Barack	0,406997	0,0281692	1.00765	0,750963	202	46	76	16
Barack Obama	0,402562	0,027512	0,965359	0,723403	164	45	76	16
Das ist	0,384085	0,0227344	0,809747	0,634705	236	91	89	31
Obama.	0,356245	0,0237453	0,664688	0,509631	70	5	49	4
für	0,35526	0,0364138	0,70142	0,46487	1020	542	119	62

Wir können die Delta -Bewertungen des Cliffs mithilfe von dataframe_scattertext elegant anzeigen und das Punktfärbungsschema unter Verwendung des Parameters include_gradient=True beschreiben. Wir setzen die Parameter left_gradient_term , middle_gradient_term und right_gradient_term auf Zeichenfolgen, die in ihren korrigierenden Werten angezeigt werden.

 plot_df = st . CliffsDelta (
    corpus
). set_categories (
    category_name = 'Dem'
). get_score_df (). rename ( columns = { 'Metric' : 'CliffsDelta' }). assign (
    Frequency = lambda df : df . TermCount1 + df . TermCount1 ,
    X = lambda df : df . Frequency ,
    Y = lambda df : df . CliffsDelta ,
    Xpos = lambda df : st . Scalers . dense_rank ( df . X ),
    Ypos = lambda df : st . Scalers . scale_center_zero_abs ( df . Y ),
    ColorScore = lambda df : df . Ypos ,
)

html = st . dataframe_scattertext (
    corpus ,
    plot_df = plot_df ,
    category = 'Dem' , 
    category_name = 'Dem' ,
    not_category_name = 'Rep' ,
    width_in_pixels = 1000 , 
    ignore_categories = False ,
    metadata = lambda corpus : corpus . get_df ()[ 'speaker' ],
    color_score_column = 'ColorScore' ,
    left_list_column = 'ColorScore' ,
    show_characteristic = False ,
    y_label = "Cliff's Delta" ,
    x_label = 'Frequency Ranks' ,
    y_axis_labels = [ f'More Rep: delta= { plot_df . CliffsDelta . max ():.3f } ' ,
                   '' ,
                   f'More Dem: delta= { - plot_df . CliffsDelta . max ():.3f } ' ],
    tooltip_columns = [ 'Frequency' , 'CliffsDelta' ],
    term_description_columns = [ 'CliffsDelta' , 'Stddev' , 'Low-95.0% CI' , 'High-95.0% CI' ],
    header_names = { 'upper' : 'Top Dem' , 'lower' : 'Top Reps' },
    horizontal_line_y_position = 0 ,
    include_gradient = True ,
    left_gradient_term = 'More Republican' ,
    right_gradient_term = 'More Democratic' ,
    middle_gradient_term = "Metric: Cliff's Delta" ,
)

Verwenden der Bi-Normal-Trennung (BNS), um Begriffe zu bewerten

Die Bi-normale Trennung (BNS) (Forman, 2008) wurde in Version 0.1.8 hinzugefügt. Eine Variation von (BNS) wird verwendet, wo $ F^{-1} (tpr)-f^{-1} (fpr) $ wird nicht als absoluter Wert verwendet, sondern als Unterschied aufbewahrt. Dies ermöglicht Begriffe, die stark auf wahre Positive und falsch positive Aspekte hinweisen, um eine hohe oder niedrige Punktzahl zu erzielen. Beachten Sie, dass TPR und FPR dazwischen skaliert sind $ [ alpha, 1- alpha] $ Wo Alpha ist $ in [0, 1] $ . In Forman (2008) und früherer Literatur $ alpha = 0,0005 $ . In der persönlichen Korrespondenz mit Forman schlug er freundlicherweise die Verwendung vor $ frac {1.} { mbox {minimum (positive, negative)}} $ . Ich habe dies als implementiert $ alpha = frac {1.} { mbox {Minimum Dokumente in der am wenigsten häufigen Kategorie}} $

 corpus = ( st . CorpusFromPandas ( convention_df ,
                              category_col = 'party' ,
                              text_col = 'text' ,
                              nlp = st . whitespace_nlp_with_sentences )
          . build ()
          . get_unigram_corpus ()
          . remove_infrequent_words ( 3 , term_ranker = st . OncePerDocFrequencyRanker ))

term_scorer = ( st . BNSScorer ( corpus ). set_categories ( 'democrat' ))
print ( term_scorer . get_score_df (). sort_values ( by = 'democrat BNS' ))

html = st . produce_frequency_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    scores = term_scorer . get_score_df ()[ 'democrat BNS' ]. reindex ( corpus . get_terms ()). values ,
    metadata = lambda c : c . get_df ()[ 'speaker' ],
    minimum_term_frequency = 0 ,
    grey_threshold = 0 ,
    y_label = f'Bi-normal Separation (alpha= { term_scorer . prior_counts } )'
)

BNS erzielte Begriffe mit einem algorithmisch gefundenen Alpha. ! [BNS] (https://raw.githubuSercontent.com/jasonkessler/jasonkessler.github.io/master/d emo_bi_normal_separation.png)

Verwenden von Korrelationen zur Erklärung von Klassifikatoren

Wir können einen Klassifizierer für jedes Dokument für die Erstellung einer Vorhersage -Punktzahl ausbilden. Oft verwenden Klassifikatoren oder Regressoren Funktionen, die Merkmale berücksichtigen, die über die durch Streuung dargestellten Merkmale berücksichtigen, sei es n-Gramm, Thema, extra-linguistisch, neuronal usw.

Wir können CratterText verwenden, um die Korrelationen zwischen Unigramms (oder wirklich jede Feature -Darstellung) und die von einem Modell erzeugten Dokumentwerte zu visualisieren.

Im folgenden Beispiel schulen wir eine lineare SVM mit Unigram- und BI-Gramm-Funktionen im gesamten Kongressdatensatz und verwenden das Modell, um in jedem Dokument eine Vorhersage zu machen und schließlich mit Pearson's zu prognostizieren $ r $ Um die Unigrammmerkmale mit der Entfernung von der SVM -Entscheidungsgrenze zu korrelieren.

 from sklearn . svm import LinearSVC

import scattertext as st

df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
    parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
)

corpus = st . CorpusFromParsedDocuments (
    df , category_col = 'party' , parsed_col = 'parse'
). build ()

X = corpus . get_term_doc_mat ()
y = corpus . get_category_ids ()

clf = LinearSVC ()
clf . fit ( X = X , y = y == corpus . get_categories (). index ( 'democrat' ))
doc_scores = clf . decision_function ( X = X )

compactcorpus = corpus . get_unigram_corpus (). compact ( st . AssociationCompactor ( 2000 ))

plot_df = st . Correlations (). set_correlation_type (
    'pearsonr'
). get_correlation_df (
    corpus = compactcorpus ,
    document_scores = doc_scores
). reindex ( compactcorpus . get_terms ()). assign (
    X = lambda df : df . Frequency ,
    Y = lambda df : df [ 'r' ],
    Xpos = lambda df : st . Scalers . dense_rank ( df . X ),
    Ypos = lambda df : st . Scalers . scale_center_zero_abs ( df . Y ),
    SuppressDisplay = False ,
    ColorScore = lambda df : df . Ypos ,
)

html = st . dataframe_scattertext (
    compactcorpus ,
    plot_df = plot_df ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    width_in_pixels = 1000 ,
    metadata = lambda c : c . get_df ()[ 'speaker' ],
    unified_context = False ,
    ignore_categories = False ,
    color_score_column = 'ColorScore' ,
    left_list_column = 'ColorScore' ,
    y_label = "Pearson r (correlation to SVM document score)" ,
    x_label = 'Frequency Ranks' ,
    header_names = { 'upper' : 'Top Democratic' ,
                  'lower' : 'Top Republican' },
)

Verwenden von benutzerdefinierten Hintergrundwortfrequenzen

Der Streutext basiert auf einer Reihe von englischen Worthäufigkeiten der allgemeinen Domäne bei der Berechnung von Unigram-Charakteristik
Punktzahlen. Bei Verwendung des Ausführens von Streutext auf nicht englischen Daten oder in einer bestimmten Domäne wird die Qualität der Punktzahlen abgebaut.

Stellen Sie sicher, dass Sie in Streuung von 0,1,6 oder höher sind.

Um dies zu beheben, kann man einem korpusähnlichen Objekt einen benutzerdefinierten Hintergrundbewertungen unter Verwendung der Funktion Corpus.set_background_corpus hinzufügen. Die Funktion nimmt ein pd.Series -Objekt an, das zu Begriffen mit numerischen Zählwerten indiziert ist.

Standardmäßig wird [! Verständnis scaled-f-score] (skaliertes F-Score) verwendet, um zu bewerten, wie charakteristische Begriffe sind.

Das folgende Beispiel zeigt die Verwendung von polnischen Hintergrund -Wortfrequenzen.

Zunächst produzieren wir ein Serienobjektzuordnungs-polnischer Wörter zu ihren Frequenzen mit einer Liste der https://github.com/oprogramador/most-common-words- by-language Repo.

 polish_word_frequencies = pd . read_csv (
    'https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/pl/pl_50k.txt' ,
    sep = ' ' ,
    names = [ 'Word' , 'Frequency' ]
). set_index ( 'Word' )[ 'Frequency' ]

Beachten Sie die Zusammensetzung der Serie

 >> > polish_word_frequencies
Word
nie
5875385
to
4388099
się
3507076
w
2723767
na
2309765
Name : Frequency , dtype : int64

Als nächstes erstellen wir einen DataFrame, reviews_df , der aus Dokumenten besteht, das (an einen nicht-polischen Sprecher) als positive und negative Hotelbewertungen aus den https://klejbenchmark.com/tasks/ Corpus (Kocoń, et al. 2019) erscheint. Beachten Sie, dass diese Daten unter einer CC BY-NC-SA 4.0-Lizenz erfolgen. These are labeled as "__label__meta_plus_m" and "__label__meta_minus_m". We will use Scattertext to compare those reviews and determine

 nlp = spacy . blank ( 'pl' )
nlp . add_pipe ( 'sentencizer' )

with ZipFile ( io . BytesIO ( urlopen (
        'https://klejbenchmark.com/static/data/klej_polemo2.0-in.zip'
). read ())) as zf :
    review_df = pd . read_csv ( zf . open ( 'train.tsv' ), sep = ' t ' )[
        lambda df : df . target . isin ([ '__label__meta_plus_m' , '__label__meta_minus_m' ])
    ]. assign (
        Parse = lambda df : df . sentence . apply ( nlp )
    )

Next, we wish to create a ParsedCorpus object from review_df . In preparation, we first assemble a list of Polish stopwords from the stopwords repository. We also create the not_a_word regular expression to filter out terms which do not contain a letter.

 polish_stopwords = {
    stopword for stopword in
    urlopen (
        'https://raw.githubusercontent.com/bieli/stopwords/master/polish.stopwords.txt'
    ). read (). decode ( 'utf-8' ). split ( ' n ' )
    if stopword . strip ()
}

not_a_word = re . compile ( r'^W+$' )

With these present, we can build a corpus from review_df with the category being the binary "target" column. We reduce the term space to unigrams and then run the filter_out which takes a function to determine if a term should be removed from the corpus. The function identifies terms which are in the Polish stoplist or do not contain a letter. Finally, terms occurring less than 20 times in the corpus are removed.

We set the background frequency Series we created early as the background corpus.

 corpus = st . CorpusFromParsedDocuments (
    review_df ,
    category_col = 'target' ,
    parsed_col = 'Parse'
). build (
). get_unigram_corpus (
). filter_out (
    lambda term : term in polish_stopwords or not_a_word . match ( term ) is not None
). remove_infrequent_words (
    minimum_term_count = 20
). set_background_corpus (
    polish_word_frequencies
)

Note that a minimum word count of 20 was chosen to ensure that only around 2,000 terms would be displayed

 >> > corpus . get_num_terms ()
2023

Running get_term_and_background_counts shows us total term counts in the corpus compare to background frequency counts. We limit this to terms which only occur in the corpus.

 >> > corpus . get_term_and_background_counts ()[
    ...
lambda df : df . corpus > 0
...]. sort_values ( by = 'corpus' , ascending = False )

background
corpus
m
341583838.0
4819.0
hotelu
33108.0
1812.0
hotel
297974790.0
1651.0
doktor
154840.0
1534.0
polecam
0.0
1438.0
.........
szoku
0.0
21.0
badaniem
0.0
21.0
balkonu
0.0
21.0
stopnia
0.0
21.0
wobec
0.0
21.0

Interesting, the term "polecam" appears very frequently in the corpus, but does not appear at all in the background corpus, making it highly characteristic. Judging from Google Translate, it appears to mean something related to "recommend".

We are now ready to display the plot.

 html = st . produce_scattertext_explorer (
    corpus ,
    category = '__label__meta_plus_m' ,
    category_name = 'Plus-M' ,
    not_category_name = 'Minus-M' ,
    minimum_term_frequency = 1 ,
    width_in_pixels = 1000 ,
    transform = st . Scalers . dense_rank
)

We can change the formula which is used to produce the Characteristic scores using the characteristic_scorer parameter to produce_scattertext_explorer .

It takes a instance of a descendant of the CharacteristicScorer class. See DenseRankCharacteristicness.py for an example of how to make your own.

Example of plotting with a modified characteristic scorer,

 html = st . produce_scattertext_explorer (
    corpus ,
    category = '__label__meta_plus_m' ,
    category_name = 'Plus-M' ,
    not_category_name = 'Minus-M' ,
    minimum_term_frequency = 1 ,
    transform = st . Scalers . dense_rank ,
    characteristic_scorer = st . DenseRankCharacteristicness (),
  	term_ranker = st . termranking . AbsoluteFrequencyRanker ,
	term_scorer = st . ScaledFScorePresets ( beta = 1 , one_to_neg_one = True )
). encode ( 'utf-8' ))
print ( 'open ' + fn )

Note that numbers show up as more characteristic using the Dense Rank Difference. It may be they occur unusually frequently in this corpus, or perhaps the background word frequencies under counted mumbers.

Plotting word productivity

Word productivity is one strategy for plotting word-based charts describing an uncategorized corpus.

Productivity is defined in Schumann (2016) (Jason: check this) as the entropy of ngrams which contain a term. For the entropy computation, the probability of an n-gram wrt the term whose productivity is being calculated is the frequency of the n-gram divided by the term's frequency.

Since productivity highly correlates with frequency, the recommended metric to plot is the dense rank difference between frequency and productivity.

The snippet below plots words in the convention corpus based on their log frequency and their productivity.

The function st.whole_corpus_productivity_scores returns a DataFrame giving each word's productivity. For example, in the convention corpus,

Productivity scores should be calculated on a Corpus -like object which contains a complete set of unigrams and at least bigrams. This corpus should not be compacted before the productivity score calculation.

The terms with lower productivity have more limited usage (eg, "thank" for "thank you", "united" for "united steates") while the terms with higher productivity occurr in a wider varity of contexts ("getting", "actually", "political", etc.).

 import spacy
import scattertext as st

corpus_no_cat = st . CorpusWithoutCategoriesFromParsedDocuments (
    st . SampleCorpora . ConventionData2012 . get_data (). assign (
        Parse = lambda df : [ x for x in spacy . load ( 'en_core_web_sm' ). pipe ( df . text )]),
    parsed_col = 'Parse'
). build ()

compact_corpus_no_cat = corpus_no_cat . get_stoplisted_unigram_corpus (). remove_infrequent_words ( 9 )

plot_df = st . whole_corpus_productivity_scores ( corpus_no_cat ). assign (
    RankDelta = lambda df : st . RankDifference (). get_scores (
        a = df . Productivity ,
        b = df . Frequency
    )
). reindex (
    compact_corpus_no_cat . get_terms ()
). dropna (). assign (
    X = lambda df : df . Frequency ,
    Xpos = lambda df : st . Scalers . log_scale ( df . Frequency ),
    Y = lambda df : df . RankDelta ,
    Ypos = lambda df : st . Scalers . scale ( df . RankDelta ),
)

html = st . dataframe_scattertext (
    compact_corpus_no_cat . whitelist_terms ( plot_df . index ),
    plot_df = plot_df ,
    metadata = lambda df : df . get_df ()[ 'speaker' ],
    ignore_categories = True ,
    x_label = 'Rank Frequency' ,
    y_label = "Productivity" ,
    left_list_column = 'Ypos' ,
    color_score_column = 'Ypos' ,
    y_axis_labels = [ 'Least Productive' , 'Average Productivity' , 'Most Productive' ],
    header_names = { 'upper' : 'Most Productive' , 'lower' : 'Least Productive' , 'right' : 'Characteristic' },
    horizontal_line_y_position = 0
)

Understanding Scaled F-Score

Let's now turn our attention to a novel term scoring metric, Scaled F-Score. We'll examine this on a unigram version of the Rotten Tomatoes corpus (Pang et al. 2002). It contains excerpts of positive and negative movie reviews.

Please see Scaled F Score Explanation for a notebook version of this analysis.

 from scipy . stats import hmean

term_freq_df = corpus . get_unigram_corpus (). get_term_freq_df ()[[ 'Positive freq' , 'Negative freq' ]]
term_freq_df = term_freq_df [ term_freq_df . sum ( axis = 1 ) > 0 ]

term_freq_df [ 'pos_precision' ] = ( term_freq_df [ 'Positive freq' ] * 1. /
                                 ( term_freq_df [ 'Positive freq' ] + term_freq_df [ 'Negative freq' ]))

term_freq_df [ 'pos_freq_pct' ] = ( term_freq_df [ 'Positive freq' ] * 1.
                                / term_freq_df [ 'Positive freq' ]. sum ())

term_freq_df [ 'pos_hmean' ] = ( term_freq_df
                             . apply ( lambda x : ( hmean ([ x [ 'pos_precision' ], x [ 'pos_freq_pct' ]])
                                               if x [ 'pos_precision' ] > 0 and x [ 'pos_freq_pct' ] > 0
                                               else 0 ), axis = 1 ))
term_freq_df . sort_values ( by = 'pos_hmean' , ascending = False ). iloc [: 10 ]

If we plot term frequency on the x-axis and the percentage of a term's occurrences which are in positive documents (ie, its precision) on the y-axis, we can see that low-frequency terms have a much higher variation in the precision. Given these terms have low frequencies, the harmonic means are low. Thus, the only terms which have a high harmonic mean are extremely frequent words which tend to all have near average precisions.

 freq = term_freq_df . pos_freq_pct . values
prec = term_freq_df . pos_precision . values
html = st . produce_scattertext_explorer (
    corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
    category = 'Positive' ,
    not_category_name = 'Negative' ,
    not_categories = [ 'Negative' ],

    x_label = 'Portion of words used in positive reviews' ,
    original_x = freq ,
    x_coords = ( freq - freq . min ()) / freq . max (),
    x_axis_values = [ int ( freq . min () * 1000 ) / 1000. ,
                   int ( freq . max () * 1000 ) / 1000. ],

    y_label = 'Portion of documents containing word that are positive' ,
    original_y = prec ,
    y_coords = ( prec - prec . min ()) / prec . max (),
    y_axis_values = [ int ( prec . min () * 1000 ) / 1000. ,
                   int (( prec . max () / 2. ) * 1000 ) / 1000. ,
                   int ( prec . max () * 1000 ) / 1000. ],
    scores = term_freq_df . pos_hmean . values ,

    sort_by_dist = False ,
    show_characteristic = False
)
file_name = 'not_normed_freq_prec.html'
open ( file_name , 'wb' ). write ( html . encode ( 'utf-8' ))
IFrame ( src = file_name , width = 1300 , height = 700 )

 from scipy . stats import norm


def normcdf ( x ):
    return norm . cdf ( x , x . mean (), x . std ())


term_freq_df [ 'pos_precision_normcdf' ] = normcdf ( term_freq_df . pos_precision )

term_freq_df [ 'pos_freq_pct_normcdf' ] = normcdf ( term_freq_df . pos_freq_pct . values )

term_freq_df [ 'pos_scaled_f_score' ] = hmean (
    [ term_freq_df [ 'pos_precision_normcdf' ], term_freq_df [ 'pos_freq_pct_normcdf' ]])

term_freq_df . sort_values ( by = 'pos_scaled_f_score' , ascending = False ). iloc [: 10 ]

 freq = term_freq_df . pos_freq_pct_normcdf . values
prec = term_freq_df . pos_precision_normcdf . values
html = st . produce_scattertext_explorer (
    corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
    category = 'Positive' ,
    not_category_name = 'Negative' ,
    not_categories = [ 'Negative' ],

    x_label = 'Portion of words used in positive reviews (norm-cdf)' ,
    original_x = freq ,
    x_coords = ( freq - freq . min ()) / freq . max (),
    x_axis_values = [ int ( freq . min () * 1000 ) / 1000. ,
                   int ( freq . max () * 1000 ) / 1000. ],

    y_label = 'documents containing word that are positive (norm-cdf)' ,
    original_y = prec ,
    y_coords = ( prec - prec . min ()) / prec . max (),
    y_axis_values = [ int ( prec . min () * 1000 ) / 1000. ,
                   int (( prec . max () / 2. ) * 1000 ) / 1000. ,
                   int ( prec . max () * 1000 ) / 1000. ],
    scores = term_freq_df . pos_scaled_f_score . values ,

    sort_by_dist = False ,
    show_characteristic = False
)

 term_freq_df [ 'neg_precision_normcdf' ] = normcdf (( term_freq_df [ 'Negative freq' ] * 1. /
                                                 ( term_freq_df [ 'Negative freq' ] + term_freq_df [ 'Positive freq' ])))

term_freq_df [ 'neg_freq_pct_normcdf' ] = normcdf (( term_freq_df [ 'Negative freq' ] * 1.
                                                / term_freq_df [ 'Negative freq' ]. sum ()))

term_freq_df [ 'neg_scaled_f_score' ] = hmean (
    [ term_freq_df [ 'neg_precision_normcdf' ], term_freq_df [ 'neg_freq_pct_normcdf' ]])

term_freq_df [ 'scaled_f_score' ] = 0
term_freq_df . loc [ term_freq_df [ 'pos_scaled_f_score' ] > term_freq_df [ 'neg_scaled_f_score' ],
                 'scaled_f_score' ] = term_freq_df [ 'pos_scaled_f_score' ]
term_freq_df . loc [ term_freq_df [ 'pos_scaled_f_score' ] < term_freq_df [ 'neg_scaled_f_score' ],
                 'scaled_f_score' ] = 1 - term_freq_df [ 'neg_scaled_f_score' ]
term_freq_df [ 'scaled_f_score' ] = 2 * ( term_freq_df [ 'scaled_f_score' ] - 0.5 )
term_freq_df . sort_values ( by = 'scaled_f_score' , ascending = True ). iloc [: 10 ]

 is_pos = term_freq_df . pos_scaled_f_score > term_freq_df . neg_scaled_f_score
freq = term_freq_df . pos_freq_pct_normcdf * is_pos - term_freq_df . neg_freq_pct_normcdf * ~ is_pos
prec = term_freq_df . pos_precision_normcdf * is_pos - term_freq_df . neg_precision_normcdf * ~ is_pos


def scale ( ar ):
    return ( ar - ar . min ()) / ( ar . max () - ar . min ())


def close_gap ( ar ):
    ar [ ar > 0 ] -= ar [ ar > 0 ]. min ()
    ar [ ar < 0 ] -= ar [ ar < 0 ]. max ()
    return ar


html = st . produce_scattertext_explorer (
    corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
    category = 'Positive' ,
    not_category_name = 'Negative' ,
    not_categories = [ 'Negative' ],

    x_label = 'Frequency' ,
    original_x = freq ,
    x_coords = scale ( close_gap ( freq )),
    x_axis_labels = [ 'Frequent in Neg' ,
                   'Not Frequent' ,
                   'Frequent in Pos' ],

    y_label = 'Precision' ,
    original_y = prec ,
    y_coords = scale ( close_gap ( prec )),
    y_axis_labels = [ 'Neg Precise' ,
                   'Imprecise' ,
                   'Pos Precise' ],

    scores = ( term_freq_df . scaled_f_score . values + 1 ) / 2 ,
    sort_by_dist = False ,
    show_characteristic = False
)

We can use st.ScaledFScorePresets as a term scorer to display terms' Scaled F-Score on the y-axis and term frequencies on the x-axis.

 html = st . produce_frequency_explorer (
    corpus . remove_terms ( set ( corpus . get_terms ()) - set ( term_freq_df . index )),
    category = 'Positive' ,
    not_category_name = 'Negative' ,
    not_categories = [ 'Negative' ],
    term_scorer = st . ScaledFScorePresets ( beta = 1 , one_to_neg_one = True ),
    metadata = rdf [ 'movie_name' ],
    grey_threshold = 0
)

Alternative term scoring methods

Scaled F-Score is not the only scoring method included in Scattertext. Please click on one of the links below to view a notebook which describes how other class association scores work and can be visualized through Scattertext.

Google Colab Notebook (recommend).
Jupyter Notebook via NBViewer.

New in 0.0.2.73 is the delta JS-Divergence scorer DeltaJSDivergence scorer (Gallagher et al. 2020), and its corresponding compactor (JSDCompactor.) See demo_deltajsd.py for an example usage.

The position-select-plot process

New in 0.0.2.72

Scattertext was originally set up to visualize corpora objects, which are connected sets of documents and terms to visualize. The "compaction" process allows users to eliminate terms which may not be associated with a category using a variety of feature selection methods. The issue with this is that the terms eliminated during the selection process are not taken into account when scaling term positions.

This issue can be mitigated by using the position-select-plot process, where term positions are pre-determined before the selection process is made.

Let's first use the 2012 conventions corpus, update the category names, and create a unigram corpus.

 import scattertext as st
import numpy as np

df = st . SampleCorpora . ConventionData2012 . get_data (). assign (
    parse = lambda df : df . text . apply ( st . whitespace_nlp_with_sentences )
). assign ( party = lambda df : df [ 'party' ]. apply ({ 'democrat' : 'Democratic' , 'republican' : 'Republican' }. get ))

corpus = st . CorpusFromParsedDocuments (
    df , category_col = 'party' , parsed_col = 'parse'
). build (). get_unigram_corpus ()

category_name = 'Democratic'
not_category_name = 'Republican'

Next, let's create a dataframe consisting of the original counts and their log-scale positions.

 def get_log_scale_df ( corpus , y_category , x_category ):
    term_coord_df = corpus . get_term_freq_df ( '' )

    # Log scale term counts (with a smoothing constant) as the initial coordinates
    coord_columns = []
    for category in [ y_category , x_category ]:
        col_name = category + '_coord'
        term_coord_df [ col_name ] = np . log ( term_coord_df [ category ] + 1e-6 ) / np . log ( 2 )
        coord_columns . append ( col_name )

    # Scale these coordinates to between 0 and 1
    min_offset = term_coord_df [ coord_columns ]. min ( axis = 0 ). min ()
    for coord_column in coord_columns :
        term_coord_df [ coord_column ] -= min_offset
    max_offset = term_coord_df [ coord_columns ]. max ( axis = 0 ). max ()
    for coord_column in coord_columns :
        term_coord_df [ coord_column ] /= max_offset
    return term_coord_df


# Get term coordinates from original corpus
term_coordinates = get_log_scale_df ( corpus , category_name , not_category_name )
print ( term_coordinates )

Here is a preview of the term_coordinates dataframe. The Democrat and Republican columns contain the term counts, while the _coord columns contain their logged coordinates. Visualizing 7,973 terms is difficult (but possible) for people running Scattertext on most computers.

          Democratic  Republican  Democratic_coord  Republican_coord
term
thank            158         205          0.860166          0.872032
you              836         794          0.936078          0.933729
so               337         212          0.894681          0.873562
much              84          76          0.831380          0.826820
very              62          75          0.817543          0.826216
...              ...         ...               ...               ...
precinct           0           2          0.000000          0.661076
godspeed           0           1          0.000000          0.629493
beauty             0           1          0.000000          0.629493
bumper             0           1          0.000000          0.629493
sticker            0           1          0.000000          0.629493

[7973 rows x 4 columns]

We can visualize this full data set by running the following code block. We'll create a custom Javascript function to populate the tooltip with the original term counts, and create a Scattertext Explorer where the x and y coordinates and original values are specified from the data frame. Additionally, we can use show_diagonal=True to draw a dashed diagonal line across the plot area.

You can click the chart below to see the interactive version. Note that it will take a while to load.

 # The tooltip JS function. Note that d is is the term data object, and ox and oy are the original x- and y-
# axis counts.
get_tooltip_content = ('(function(d) {return d.term + "<br/>' + not_category_name + ' Count: " ' +
                       '+ d.ox +"<br/>' + category_name + ' Count: " + d.oy})')


html_orig = st.produce_scattertext_explorer(
    corpus,
    category=category_name,
    not_category_name=not_category_name,
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    metadata=corpus.get_df()['speaker'],
    show_diagonal=True,
    original_y=term_coordinates[category_name],
    original_x=term_coordinates[not_category_name],
    x_coords=term_coordinates[category_name + '_coord'],
    y_coords=term_coordinates[not_category_name + '_coord'],
    max_overlapping=3,
    use_global_scale=True,
    get_tooltip_content=get_tooltip_content,
)

Next, we can visualize the compacted version of the corpus. The compaction, using ClassPercentageCompactor , selects terms which frequently in each category. The term_count parameter, set to 2, is used to determine the percentage threshold for terms to keep in a particular category. This is done using by calculating the percentile of terms (types) in each category which appear more than two times. We find the smallest percentile, and only include terms which occur above that percentile in a given category.

Note that this compaction leaves only 2,828 terms. This number is much easier for Scattertext to display in a browser.

 # Select terms which appear a minimum threshold in both corpora
compact_corpus = corpus . compact ( st . ClassPercentageCompactor ( term_count = 2 ))

# Only take term coordinates of terms remaining in corpus
term_coordinates = term_coordinates . loc [ compact_corpus . get_terms ()]

html_compact = st . produce_scattertext_explorer (
    compact_corpus ,
    category = category_name ,
    not_category_name = not_category_name ,
    minimum_term_frequency = 0 ,
    pmi_threshold_coefficient = 0 ,
    width_in_pixels = 1000 ,
    metadata = corpus . get_df ()[ 'speaker' ],
    show_diagonal = True ,
    original_y = term_coordinates [ category_name ],
    original_x = term_coordinates [ not_category_name ],
    x_coords = term_coordinates [ category_name + '_coord' ],
    y_coords = term_coordinates [ not_category_name + '_coord' ],
    max_overlapping = 3 ,
    use_global_scale = True ,
    get_tooltip_content = get_tooltip_content ,
)

Advanced uses

Visualizing differences based on only term frequencies

Occasionally, only term frequency statistics are available. This may happen in the case of very large, lost, or proprietary data sets. TermCategoryFrequencies is a corpus representation,that can accept this sort of data, along with any categorized documents that happen to be available.

Let use the Corpus of Contemporary American English as an example.
We'll construct a visualization to analyze the difference between spoken American English and English that occurs in fiction.

 df = ( pd . read_excel ( 'https://www.wordfrequency.info/files/genres_sample.xls' )
      . dropna ()
      . set_index ( 'lemma' )[[ 'SPOKEN' , 'FICTION' ]]
      . iloc [: 1000 ])
df . head ()
'''
       SPOKEN    FICTION
lemma
the    3859682.0  4092394.0
I      1346545.0  1382716.0
they   609735.0   352405.0
she    212920.0   798208.0
would  233766.0   229865.0
'''

Transforming this into a visualization is extremely easy. Just pass a dataframe indexed on terms with columns indicating category-counts into the the TermCategoryFrequencies constructor.

 term_cat_freq = st . TermCategoryFrequencies ( df )

And call produce_scattertext_explorer normally:

 html = st . produce_scattertext_explorer (
    term_cat_freq ,
    category = 'SPOKEN' ,
    category_name = 'Spoken' ,
    not_category_name = 'Fiction' ,
)

If you'd like to incorporate some documents into the visualization, you can add them into to the TermCategoyFrequencies object.

First, let's extract some example Fiction and Spoken documents from the sample COCA corpus.

 import requests , zipfile , io

coca_sample_url = 'http://corpus.byu.edu/cocatext/samples/text.zip'
zip_file = zipfile . ZipFile ( io . BytesIO ( requests . get ( coca_sample_url ). content ))

document_df = pd . DataFrame (
    [{ 'text' : zip_file . open ( fn ). read (). decode ( 'utf-8' ),
      'category' : 'SPOKEN' }
     for fn in zip_file . filelist if fn . filename . startswith ( 'w_spok' )][: 2 ]
    + [{ 'text' : zip_file . open ( fn ). read (). decode ( 'utf-8' ),
        'category' : 'FICTION' }
       for fn in zip_file . filelist if fn . filename . startswith ( 'w_fic' )][: 2 ])

And we'll pass the documents_df dataframe into TermCategoryFrequencies via the document_category_df parameter. Ensure the dataframe has two columns, 'text' and 'category'. Afterward, we can call produce_scattertext_explorer (or your visualization function of choice) normally.

 doc_term_cat_freq = st . TermCategoryFrequencies ( df , document_category_df = document_df )

html = st . produce_scattertext_explorer (
    doc_term_cat_freq ,
    category = 'SPOKEN' ,
    category_name = 'Spoken' ,
    not_category_name = 'Fiction' ,
)

Visualizing query-based categorical differences

Word representations have recently become a hot topic in NLP. While lots of work has been done visualizing how terms relate to one another given their scores (eg, http://projector.tensorflow.org/), none to my knowledge has been done visualizing how we can use these to examine how document categories differ.

In this example given a query term, "jobs", we can see how Republicans and Democrats talk about it differently.

In this configuration of Scattertext, words are colored by their similarity to a query phrase.
This is done using spaCy-provided GloVe word vectors (trained on the Common Crawl corpus). The cosine distance between vectors is used, with mean vectors used for phrases.

The calculation of the most similar terms associated with each category is a simple heuristic. First, sets of terms closely associated with a category are found. Second, these terms are ranked based on their similarity to the query, and the top rank terms are displayed to the right of the scatterplot.

A term is considered associated if its p-value is less than 0.05. P-values are determined using Monroe et al. (2008)'s difference in the weighted log-odds-ratios with an uninformative Dirichlet prior. This is the only model-based method discussed in Monroe et al. that does not rely on a large, in-domain background corpus. Since we are scoring bigrams in addition to the unigrams scored by Monroe, the size of the corpus would have to be larger to have high enough bigram counts for proper penalization. This function relies the Dirichlet distribution's parameter alpha, a vector, which is uniformly set to 0.01.

Here is the code to produce such a visualization.

 >>> from scattertext import word_similarity_explorer
>>> html = word_similarity_explorer(corpus,
...                                 category='democrat',
...                                 category_name='Democratic',
...                                 not_category_name='Republican',
...                                 target_term='jobs',
...                                 minimum_term_frequency=5,
...                                 pmi_threshold_coefficient=4,
...                                 width_in_pixels=1000,
...                                 metadata=convention_df['speaker'],
...                                 alpha=0.01,
...                                 max_p_val=0.05,
...                                 save_svg_button=True)
>>> open("Convention-Visualization-Jobs.html", 'wb').write(html.encode('utf-8'))

Developing and using bespoke word representations

Scattertext can interface with Gensim Word2Vec models. For example, here's a snippet from demo_gensim_similarity.py which illustrates how to train and use a word2vec model on a corpus. Note the similarities produced reflect quirks of the corpus, eg, "8" tends to refer to the 8% unemployment rate at the time of the convention.

 import spacy
from gensim . models import word2vec
from scattertext import SampleCorpora , word_similarity_explorer_gensim , Word2VecFromParsedCorpus
from scattertext . CorpusFromParsedDocuments import CorpusFromParsedDocuments

nlp = spacy . en . English ()
convention_df = SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parsed' ] = convention_df . text . apply ( nlp )
corpus = CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'parsed' ). build ()
model = word2vec . Word2Vec ( size = 300 ,
                          alpha = 0.025 ,
                          window = 5 ,
                          min_count = 5 ,
                          max_vocab_size = None ,
                          sample = 0 ,
                          seed = 1 ,
                          workers = 1 ,
                          min_alpha = 0.0001 ,
                          sg = 1 ,
                          hs = 1 ,
                          negative = 0 ,
                          cbow_mean = 0 ,
                          iter = 1 ,
                          null_word = 0 ,
                          trim_rule = None ,
                          sorted_vocab = 1 )
html = word_similarity_explorer_gensim ( corpus ,
                                       category = 'democrat' ,
                                       category_name = 'Democratic' ,
                                       not_category_name = 'Republican' ,
                                       target_term = 'jobs' ,
                                       minimum_term_frequency = 5 ,
                                       pmi_threshold_coefficient = 4 ,
                                       width_in_pixels = 1000 ,
                                       metadata = convention_df [ 'speaker' ],
                                       word2vec = Word2VecFromParsedCorpus ( corpus , model ). train (),
                                       max_p_val = 0.05 ,
                                       save_svg_button = True )
open ( './demo_gensim_similarity.html' , 'wb' ). write ( html . encode ( 'utf-8' ))

How Democrats and Republicans talked differently about "jobs" in their 2012 convention speeches.

Visualizing any kind of term score

We can use Scattertext to visualize alternative types of word scores, and ensure that 0 scores are greyed out. Use the sparse_explroer function to acomplish this, and see its source code for more details.

 >>> from sklearn.linear_model import Lasso
>>> from scattertext import sparse_explorer
>>> html = sparse_explorer(corpus,
...                        category='democrat',
...                        category_name='Democratic',
...                        not_category_name='Republican',
...                        scores = corpus.get_regression_coefs('democrat', Lasso(max_iter=10000)),
...                        minimum_term_frequency=5,
...                        pmi_threshold_coefficient=4,
...                        width_in_pixels=1000,
...                        metadata=convention_df['speaker'])
>>> open('./Convention-Visualization-Sparse.html', 'wb').write(html.encode('utf-8'))

Custom term positions

You can also use custom term positions and axis labels. For example, you can base terms' y-axis positions on a regression coefficient and their x-axis on term frequency and label the axes accordingly. The one catch is that axis positions must be scaled between 0 and 1.

First, let's define two scaling functions: scale to project positive values to [0,1], and zero_centered_scale project real values to [0,1], with negative values always <0.5, and positive values always >0.5.

 >>> def scale(ar):
...     return (ar - ar.min()) / (ar.max() - ar.min())
...
>>> def zero_centered_scale(ar):
...     ar[ar > 0] = scale(ar[ar > 0])
...     ar[ar < 0] = -scale(-ar[ar < 0])
...     return (ar + 1) / 2.

Next, let's compute and scale term frequencies and L2-penalized regression coefficients. We'll hang on to the original coefficients and allow users to view them by mousing over terms.

 >>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>>
>>> frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))
>>> scores = corpus.get_logreg_coefs('democrat',
...                                  LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
>>> scores_scaled = zero_centered_scale(scores)

Finally, we can write the visualization. Note the use of the x_coords and y_coords parameters to store the respective coordinates, the scores and sort_by_dist arguments to register the original coefficients and use them to rank the terms in the right-hand list, and the x_label and y_label arguments to label axes.

 >>> html = produce_scattertext_explorer(corpus,
...                                     category='democrat',
...                                     category_name='Democratic',
...                                     not_category_name='Republican',
...                                     minimum_term_frequency=5,
...                                     pmi_threshold_coefficient=4,
...                                     width_in_pixels=1000,
...                                     x_coords=frequencies_scaled,
...                                     y_coords=scores_scaled,
...                                     scores=scores,
...                                     sort_by_dist=False,
...                                     metadata=convention_df['speaker'],
...                                     x_label='Log frequency',
...                                     y_label='L2-penalized logistic regression coef')
>>> open('demo_custom_coordinates.html', 'wb').write(html.encode('utf-8'))

Emoji analysis

The Emoji analysis capability displays a chart of the category-specific distribution of Emoji. Let's look at a new corpus, a set of tweets. We'll build a visualization showing how men and women use emoji differently.

Note: the following example is implemented in demo_emoji.py .

First, we'll load the dataset and parse it using NLTK's tweet tokenizer. Note, install NLTK before running this example. It will take some time for the dataset to download.

 import nltk , urllib . request , io , agefromname , zipfile
import scattertext as st
import pandas as pd

with zipfile . ZipFile ( io . BytesIO ( urllib . request . urlopen (
        'http://followthehashtag.com/content/uploads/USA-Geolocated-tweets-free-dataset-Followthehashtag.zip'
). read ())) as zf :
    df = pd . read_excel ( zf . open ( 'dashboard_x_usa_x_filter_nativeretweets.xlsx' ))

nlp = st . tweet_tokenzier_factory ( nltk . tokenize . TweetTokenizer ())
df [ 'parse' ] = df [ 'Tweet content' ]. apply ( nlp )

df . iloc [ 0 ]
'''
Tweet Id                                                     721318437075685382
Date                                                                 2016-04-16
Hour                                                                      12:44
User Name                                                        Bill Schulhoff
Nickname                                                          BillSchulhoff
Bio                           Husband,Dad,GrandDad,Ordained Minister, Umpire...
Tweet content                 Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Favs                                                                        NaN
RTs                                                                         NaN
Latitude                                                                40.7603
Longitude                                                              -72.9547
Country                                                                      US
Place (as appears on Bio)                                    East Patchogue, NY
Profile picture               http://pbs.twimg.com/profile_images/3788000007...
Followers                                                                   386
Following                                                                   705
Listed                                                                       24
Tweet language (ISO 639-1)                                                   en
Tweet Url                     http://www.twitter.com/BillSchulhoff/status/72...
parse                         Wind 3.2 mph NNE. Barometer 30.20 in, Rising s...
Name: 0, dtype: object
'''

Next, we'll use the AgeFromName package to find the probabilities of the gender of each user given their first name. First, we'll find a dataframe indexed on first names that contains the probability that each someone with that first name is male ( male_prob ).

 male_prob = agefromname . AgeFromName (). get_all_name_male_prob ()
male_prob . iloc [ 0 ]
'''
hi      1.00000
lo      0.95741
prob    1.00000
Name: aaban, dtype: float64
'''

Next, we'll extract the first names of each user, and use the male_prob data frame to find users whose names indicate there is at least a 90% chance they are either male or female, label those users, and create new data frame df_mf with only those users.

 df [ 'first_name' ] = df [ 'User Name' ]. apply ( lambda x : x . split ()[ 0 ]. lower () if type ( x ) == str and len ( x . split ()) > 0 else x )
df_aug = pd . merge ( df , male_prob , left_on = 'first_name' , right_index = True )
df_aug [ 'gender' ] = df_aug [ 'prob' ]. apply ( lambda x : 'm' if x > 0.9 else 'f' if x < 0.1 else '?' )
df_mf = df_aug [ df_aug [ 'gender' ]. isin ([ 'm' , 'f' ])]

The key to this analysis is to construct a corpus using only the emoji extractor st.FeatsFromSpacyDocOnlyEmoji which builds a corpus only from emoji and not from anything else.

 corpus = st . CorpusFromParsedDocuments (
    df_mf ,
    parsed_col = 'parse' ,
    category_col = 'gender' ,
    feats_from_spacy_doc = st . FeatsFromSpacyDocOnlyEmoji ()
). build ()

Next, we'll run this through a standard produce_scattertext_explorer visualization generation.

 html = st . produce_scattertext_explorer (
    corpus ,
    category = 'f' ,
    category_name = 'Female' ,
    not_category_name = 'Male' ,
    use_full_doc = True ,
    term_ranker = st . OncePerDocFrequencyRanker ,
    sort_by_dist = False ,
    metadata = ( df_mf [ 'User Name' ]
              + ' (@' + df_mf [ 'Nickname' ] + ') '
              + df_mf [ 'Date' ]. astype ( str )),
    width_in_pixels = 1000
)
open ( "EmojiGender.html" , 'wb' ). write ( html . encode ( 'utf-8' ))

Visualizing SentencePiece Tokens

SentencePiece tokenization is a subword tokenization technique which relies on a language-model to produce optimized tokenization. It has been used in large, transformer-based contextual language models.

Ensure to run $ pip install sentencepiece before running this example.

First, let's load the political convention data set as normal.

 import tempfile
import re
import scattertext as st

convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df . text . apply ( st . whitespace_nlp_with_sentences )

Next, let's train a SentencePiece tokenizer based on this data. The train_sentence_piece_tokenizer function trains a SentencePieceProcessor on the data set and returns it. You can of course use any SentencePieceProcessor.

 def train_sentence_piece_tokenizer ( documents , vocab_size ):
    '''
    :param documents: list-like, a list of str documents
    :vocab_size int: the size of the vocabulary to output
    
    :return sentencepiece.SentencePieceProcessor
    '''
    import sentencepiece as spm
    sp = None
    with tempfile . NamedTemporaryFile ( delete = True ) as tempf :
        with tempfile . NamedTemporaryFile ( delete = True ) as tempm :
            tempf . write (( ' n ' . join ( documents )). encode ())
            spm . SentencePieceTrainer . Train (
                '--input=%s --model_prefix=%s --vocab_size=%s' % ( tempf . name , tempm . name , vocab_size )
            )
            sp = spm . SentencePieceProcessor ()
            sp . load ( tempm . name + '.model' )
    return sp


sp = train_sentence_piece_tokenizer ( convention_df . text . values , vocab_size = 2000 )

Next, let's add the SentencePiece tokens as metadata when creating our corpus. In order to do this, pass a FeatsFromSentencePiece instance into the feats_from_spacy_doc parameter. Pass the SentencePieceProcessor into the constructor.

 corpus = st . CorpusFromParsedDocuments ( convention_df ,
                                      parsed_col = 'parse' ,
                                      category_col = 'party' ,
                                      feats_from_spacy_doc = st . FeatsFromSentencePiece ( sp )). build ()

Now we can create the SentencePiece token scatter plot.

 html = st . produce_scattertext_explorer (
    corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    sort_by_dist = False ,
    metadata = convention_df [ 'party' ] + ': ' + convention_df [ 'speaker' ],
    term_scorer = st . RankDifference (),
    transform = st . Scalers . dense_rank ,
    use_non_text_features = True ,
    use_full_doc = True ,
)

Visualizing scikit-learn text classification weights

Suppose you'd like to audit or better understand weights or importances given to bag-of-words features by a classifier.

It's easy to use Scattertext to do, if you use a Scikit-learn-style classifier.

For example the Lighting package makes available high-performance linear classifiers which are have Scikit-compatible interfaces.

First, let's import sklearn 's text feature extraction classes, the 20 Newsgroup corpus, Lightning's Primal Coordinate Descent classifier, and Scattertext. We'll also fetch the training portion of the Newsgroup corpus.

 from lightning . classification import CDClassifier
from sklearn . datasets import fetch_20newsgroups
from sklearn . feature_extraction . text import CountVectorizer , TfidfVectorizer

import scattertext as st

newsgroups_train = fetch_20newsgroups (
    subset = 'train' ,
    remove = ( 'headers' , 'footers' , 'quotes' )
)

Next, we'll tokenize our corpus twice. Once into tfidf features which will be used to train the classifier, an another time into ngram counts that will be used by Scattertext. It's important that both vectorizers share the same vocabulary, since we'll need to apply the weight vector from the model onto our Scattertext Corpus.

 vectorizer = TfidfVectorizer ()
tfidf_X = vectorizer . fit_transform ( newsgroups_train . data )
count_vectorizer = CountVectorizer ( vocabulary = vectorizer . vocabulary_ )

Next, we use the CorpusFromScikit factory to build a Scattertext Corpus object. Ensure the X parameter is a document-by-feature matrix. The argument to the y parameter is an array of class labels. Each label is an integer representing a different news group. We the feature_vocabulary is the vocabulary used by the vectorizers. The category_names are a list of the 20 newsgroup names which as a class-label list. The raw_texts is a list of the text of newsgroup texts.

 corpus = st . CorpusFromScikit (
    X = count_vectorizer . fit_transform ( newsgroups_train . data ),
    y = newsgroups_train . target ,
    feature_vocabulary = vectorizer . vocabulary_ ,
    category_names = newsgroups_train . target_names ,
    raw_texts = newsgroups_train . data
). build ()

Now, we can train the model on tfidf_X and the categoricla response variable, and capture feature weights for category 0 ("alt.atheism").

 clf = CDClassifier ( penalty = "l1/l2" ,
                   loss = "squared_hinge" ,
                   multiclass = True ,
                   max_iter = 20 ,
                   alpha = 1e-4 ,
                   C = 1.0 / tfidf_X . shape [ 0 ],
                   tol = 1e-3 )
clf . fit ( tfidf_X , newsgroups_train . target )
term_scores = clf . coef_ [ 0 ]

Finally, we can create a Scattertext plot. We'll use the Monroe-style visualization, and automatically select around 4000 terms that encompass the set of frequent terms, terms with high absolute scores, and terms that are characteristic of the corpus.

 html = st . produce_frequency_explorer (
    corpus ,
    'alt.atheism' ,
    scores = term_scores ,
    use_term_significance = False ,
    terms_to_include = st . AutoTermSelector . get_selected_terms ( corpus , term_scores , 4000 ),
    metadata = [ '/' . join ( fn . split ( '/' )[ - 2 :]) for fn in newsgroups_train . filenames ]
)

Let's take a look at the performance of the classifier:

 newsgroups_test = fetch_20newsgroups ( subset = 'test' ,
                                     remove = ( 'headers' , 'footers' , 'quotes' ))
X_test = vectorizer . transform ( newsgroups_test . data )
pred = clf . predict ( X_test )
f1 = f1_score ( pred , newsgroups_test . target , average = 'micro' )
print ( "Microaveraged F1 score" , f1 )

Microaveraged F1 score 0.662108337759. Not bad over a ~0.05 baseline.

Creating lexicalized semiotic squares

Please see Signo for an introduction to semiotic squares.

Some variants of the semiotic square-creator are can be seen in this notebook, which studies words and phrases in headlines that had low or high Facebook engagement and were published by either BuzzFeed or the New York Times: [http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb]

The idea behind the semiotic square is to express the relationship between two opposing concepts and concepts things within a larger domain of a discourse. Examples of opposed concepts life or death, male or female, or, in our example, positive or negative sentiment. Semiotics squares are comprised of four "corners": the upper two corners are the opposing concepts, while the bottom corners are the negation of the concepts.

Circumscribing the negation of a concept involves finding everything in the domain of discourse that isn't associated with the concept. For example, in the life-death opposition, one can consider the universe of discourse to be all animate beings, real and hypothetical. The not-alive category will cover dead things, but also hypothetical entities like fictional characters or sentient AIs.

In building lexicalized semiotic squares, we consider concepts to be documents labeled in a corpus. Documents, in this setting, can belong to one of three categories: two labels corresponding to the opposing concepts, a neutral category, indicating a document is in the same domain as the opposition, but cannot fall into one of opposing categories.

In the example below positive and negative movie reviews are treated as the opposing categories, while plot descriptions of the same movies are treated as the neutral category.

Terms associated with one of the two opposing categories (relative only to the other) are listed as being associated with that category. Terms associated with a netural category (eg, not positive) are terms which are associated with the disjunction of the opposite category and the neutral category. For example, not-positive terms are those most associated with the set of negative reviews and plot descriptions vs. positive reviews.

Common terms among adjacent corners of the square are also listed.

An HTML-rendered square is accompanied by a scatter plot. Points on the plot are terms. The x-axis is the Z-score of the association to one of the opposed concepts. The y-axis is the Z-score how associated a term is with the neutral set of documents relative to the opposed set. A point's red-blue color indicate the term's opposed-association, while the more desaturated a term is, the more it is associated with the neutral set of documents.

Update to version 2.2: terms are colored by their nearest semiotic categories across the eight corresponding radial sectors.

 import scattertext as st

movie_df = st . SampleCorpora . RottenTomatoes . get_data ()
movie_df . category = movie_df . category . apply
    ( lambda x : { 'rotten' : 'Negative' , 'fresh' : 'Positive' , 'plot' : 'Plot' }[ x ])
corpus = st . CorpusFromPandas (
    movie_df ,
    category_col = 'category' ,
    text_col = 'text' ,
    nlp = st . whitespace_nlp_with_sentences
). build (). get_unigram_corpus ()

semiotic_square = st . SemioticSquare (
    corpus ,
    category_a = 'Positive' ,
    category_b = 'Negative' ,
    neutral_categories = [ 'Plot' ],
    scorer = st . RankDifference (),
    labels = { 'not_a_and_not_b' : 'Plot Descriptions' , 'a_and_b' : 'Reviews' }
)

html = st . produce_semiotic_square_explorer ( semiotic_square ,
                                           category_name = 'Positive' ,
                                           not_category_name = 'Negative' ,
                                           x_label = 'Fresh-Rotten' ,
                                           y_label = 'Plot-Review' ,
                                           neutral_category_name = 'Plot Description' ,
                                           metadata = movie_df [ 'movie_name' ])

There are a number of other types of semiotic square construction functions. Again, please see https://nbviewer.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Explore-Headlines.ipynb for an overview of these.

Visualizing Topic Models

A frequently requested feature of Scattertext has been the ability to visualize topic models. While this capability has existed in some forms (eg, the Empath visualization), I've finally gotten around to implementing a concise API for such a visualization. There are three main ways to visualize topic models using Scattertext. The first is the simplest: manually entering topic models and visualizing them. The second uses a Scikit-Learn pipeline to produce the topic models for visualization. The third is a novel topic modeling technique, based on finding terms similar to a custom set of seed terms.

Manually entered topic models

If you have already created a topic model, simply structure it as a dictionary. This dictionary is keyed on string which serve as topic titles and are displayed in the main scatterplot. The values are lists of words that belong to that topic. The words that are in each topic list are bolded when they appear in a snippet.

Note that currently, there is no support for keyword scores.

For example, one might manually the following topic models to explore in the Convention corpus:

 topic_model = {
    'money' : [ 'money' , 'bank' , 'banks' , 'finances' , 'financial' , 'loan' , 'dollars' , 'income' ],
    'jobs' : [ 'jobs' , 'workers' , 'labor' , 'employment' , 'worker' , 'employee' , 'job' ],
    'patriotic' : [ 'america' , 'country' , 'flag' , 'americans' , 'patriotism' , 'patriotic' ],
    'family' : [ 'mother' , 'father' , 'mom' , 'dad' , 'sister' , 'brother' , 'grandfather' , 'grandmother' , 'son' , 'daughter' ]
}

We can use the FeatsFromTopicModel class to transform this topic model into one which can be visualized using Scattertext. This is used just like any other feature builder, and we pass the topic model object into produce_scattertext_explorer .

 import scattertext as st

topic_feature_builder = st.FeatsFromTopicModel(topic_model)

topic_corpus = st.CorpusFromParsedDocuments(
	convention_df,
	category_col='party',
	parsed_col='parse',
	feats_from_spacy_doc=topic_feature_builder
).build()

html = st.produce_scattertext_explorer(
	topic_corpus,
	category='democrat',
	category_name='Democratic',
	not_category_name='Republican',
	width_in_pixels=1000,
	metadata=convention_df['speaker'],
	use_non_text_features=True,
	use_full_doc=True,
	pmi_threshold_coefficient=0,
	topic_model_term_lists=topic_feature_builder.get_top_model_term_lists()
)

Using Scikit-Learn for Topic Modeling

Since topic modeling using document-level coocurence generally produces poor results, I've added a SentencesForTopicModeling class which allows clusterting by coocurence at the sentence-level. It requires a ParsedCorpus object to be passed to its constructor, and creates a term-sentence matrix internally.

Next, you can create a topic model dictionary like the one above by passing in a Scikit-Learn clustering or dimensionality reduction pipeline. The only constraint is the last transformer in the pipeline must populate a components_ attribute.

The num_topics_per_term attribute specifies how many terms should be added to a list.

In the following example, we'll use NMF to cluster a stoplisted, unigram corpus of documents, and use the topic model dictionary to create a FeatsFromTopicModel , just like before.

Note that in produce_scattertext_explorer , we make the topic_model_preview_size 20 in order to show a preview of the first 20 terms in the topic in the snippet view as opposed to the default 10.

 from sklearn . decomposition import NMF
from sklearn . feature_extraction . text import TfidfTransformer
from sklearn . pipeline import Pipeline

convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )

unigram_corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
                                               category_col = 'party' ,
                                               parsed_col = 'parse' )
                  . build (). get_stoplisted_unigram_corpus ())
topic_model = st . SentencesForTopicModeling ( unigram_corpus ). get_topics_from_model (
    Pipeline ([
        ( 'tfidf' , TfidfTransformer ( sublinear_tf = True )),
        ( 'nmf' , ( NMF ( n_components = 100 , alpha = .1 , l1_ratio = .5 , random_state = 0 )))
    ]),
    num_terms_per_topic = 20
)

topic_feature_builder = st . FeatsFromTopicModel ( topic_model )

topic_corpus = st . CorpusFromParsedDocuments (
    convention_df ,
    category_col = 'party' ,
    parsed_col = 'parse' ,
    feats_from_spacy_doc = topic_feature_builder
). build ()

html = st . produce_scattertext_explorer (
    topic_corpus ,
    category = 'democrat' ,
    category_name = 'Democratic' ,
    not_category_name = 'Republican' ,
    width_in_pixels = 1000 ,
    metadata = convention_df [ 'speaker' ],
    use_non_text_features = True ,
    use_full_doc = True ,
    pmi_threshold_coefficient = 0 ,
    topic_model_term_lists = topic_feature_builder . get_top_model_term_lists (),
    topic_model_preview_size = 20
)

Using a Word List to Generate a Series of Topics

A surprisingly easy way to generate good topic models is to use a term scoring formula to find words that are associated with sentences where a seed word occurs vs. where one doesn't occur.

Given a custom term list, the SentencesForTopicModeling.get_topics_from_terms will generate a series of topics. Note that the dense rank difference ( RankDifference ) works particularly well for this task, and is the default parameter.

 term_list = [ 'obama' , 'romney' , 'democrats' , 'republicans' , 'health' , 'military' , 'taxes' ,
             'education' , 'olympics' , 'auto' , 'iraq' , 'iran' , 'israel' ]

unigram_corpus = ( st . CorpusFromParsedDocuments ( convention_df ,
                                               category_col = 'party' ,
                                               parsed_col = 'parse' )
                  . build (). get_stoplisted_unigram_corpus ())

topic_model = ( st . SentencesForTopicModeling ( unigram_corpus )
               . get_topics_from_terms ( term_list ,
                                      scorer = st . RankDifference (),
                                      num_terms_per_topic = 20 ))

topic_feature_builder = st . FeatsFromTopicModel ( topic_model )
# The remaining code is identical to two examples above. See demo_word_list_topic_model.py
# for the complete example.

Creating T-SNE-style word embedding projection plots

Scattertext makes it easy to create word-similarity plots using projections of word embeddings as the x and y-axes. In the example below, we create a stop-listed Corpus with only unigram terms. The produce_projection_explorer function by uses Gensim to create word embeddings and then projects them to two dimentions using Uniform Manifold Approximation and Projection (UMAP).

UMAP is chosen over T-SNE because it can employ the cosine similarity between two word vectors instead of just the euclidean distance.

 convention_df = st . SampleCorpora . ConventionData2012 . get_data ()
convention_df [ 'parse' ] = convention_df [ 'text' ]. apply ( st . whitespace_nlp_with_sentences )

corpus = ( st . CorpusFromParsedDocuments ( convention_df , category_col = 'party' , parsed_col = 'parse' )
          . build (). get_stoplisted_unigram_corpus ())

html = st . produce_projection_explorer ( corpus , category = 'democrat' , category_name = 'Democratic' ,
                                      not_category_name = 'Republican' , metadata = convention_df . speaker )

In order to use custom word embedding functions or projection functions, pass models into the word2vec_model and projection_model parameters. In order to use T-SNE, for example, use projection_model=sklearn.manifold.TSNE() .

 import umap
from gensim . models . word2vec import Word2Vec

html = st . produce_projection_explorer ( corpus ,
                                      word2vec_model = Word2Vec ( size = 100 , window = 5 , min_count = 10 , workers = 4 ),
                                      projection_model = umap . UMAP ( min_dist = 0.5 , metric = 'cosine' ),
                                      category = 'democrat' ,
                                      category_name = 'Democratic' ,
                                      not_category_name = 'Republican' ,
                                      metadata = convention_df . speaker )

Using SVD to visualize any kind of word embeddings

Term positions can also be determined by the positions of terms according to the output of principal component analysis, and produce_projection_explorer also supports this functionality. We'll look at how axes transformations ("scalers" in Scattertext terminology) can make it easier to inspect the output of PCA.

We'll use the 2012 Conventions corpus for these visualizations. Only unigrams occurring in at least three documents will be considered.

 >>> convention_df = st.SampleCorpora.ConventionData2012.get_data()
>>> convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)
>>> corpus = (st.CorpusFromParsedDocuments(convention_df,
...                                        category_col='party',
...                                        parsed_col='parse')
...           .build()
...           .get_stoplisted_unigram_corpus()
...           .remove_infrequent_words(minimum_term_count=3, term_ranker=st.OncePerDocFrequencyRanker))

Next, we use scikit-learn's tf-idf transformer to find very simple, sparse embeddings for all of these words. Since, we input a #docs x #terms matrix to the transformer, we can transpose it to get a proper term-embeddings matrix, where each row corresponds to a term, and the columns correspond to document-specific tf-idf scores.

 >>> from sklearn.feature_extraction.text import TfidfTransformer
>>> embeddings = TfidfTransformer().fit_transform(corpus.get_term_doc_mat())
>>> embeddings.shape
(189, 2159)
>>> corpus.get_num_docs(), corpus.get_num_terms()
(189, 2159) 
>>> embeddings = embeddings.T
>>> embeddings.shape
(2159, 189)

Given these spare embeddings, we can apply sparse singular value decomposition to extract three factors. SVD outputs factorizes the term embeddings matrix into three matrices, U, Σ, and VT. Importantly, the matrix U provides the singular values for each term, and VT provides them for each document, and Σ is a vector of the singular values.

 >>> from scipy.sparse.linalg import svds
>>> U, S, VT = svds(embeddings, k = 3, maxiter=20000, which='LM')
>>> U.shape
(2159, 3)
>>> S.shape
(3,)
>>> VT.shape
(3, 189)

We'll look at the first two singular values, plotting each term such that the x-axis position is the first singular value, and the y-axis term is the second. To do this, we make a "projection" data frame, where the x and y columns store the first two singular values, and key the data frame on each term. This controls the term positions on the chart.

 >>> x_dim = 0; y_dim = 1;
>>> projection = pd.DataFrame({'term':corpus.get_terms(),
...                            'x':U.T[x_dim],
...                            'y':U.T[y_dim]}).set_index('term')

We'll use the produce_pca_explorer function to visualize these. Note we include the projection object, and specify which singular values were used for x and y ( x_dim and y_dim ) so we they can be labeled in the interactive visualization.

 html = st.produce_pca_explorer(corpus,
                               category='democrat',
                               category_name='Democratic',
                               not_category_name='Republican',
                               projection=projection,
                               metadata=convention_df['speaker'],
                               width_in_pixels=1000,
                               x_dim=x_dim,
                               y_dim=y_dim)

Click for an interactive visualization.

We can easily re-scale the plot in order to make more efficient use of space. For example, passing in scaler=scale_neg_1_to_1_with_zero_mean will make all four quadrants take equal area.

 html = st.produce_pca_explorer(corpus,
                               category='democrat',
                               category_name='Democratic',
                               not_category_name='Republican',
                               projection=projection,
                               metadata=convention_df['speaker'],
                               width_in_pixels=1000,
                               scaler=st.scale_neg_1_to_1_with_zero_mean,
                               x_dim=x_dim,
                               y_dim=y_dim)

Click for an interactive visualization.

Exporting plot to matplotlib

To export the content of a scattertext explorer object (ScattertextStructure) to matplotlib you can use produce_scattertext_pyplot . The function returns a matplotlib.figure.Figure object which can be visualized using plt.show or plt.savefig as in the example below.

Note that installation of textalloc==0.0.3 and matplotlib>=3.6.0 is required before running this.

 convention_df = st.SampleCorpora.ConventionData2012.get_data().assign(
	parse = lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse').build()
scattertext_structure = st.produce_scattertext_explorer(
	corpus,
	category='democrat',
	category_name='Democratic',
	not_category_name='Republican',
	minimum_term_frequency=5,
	pmi_threshold_coefficient=8,
	width_in_pixels=1000,
	return_scatterplot_structure=True,
)
fig = st.produce_scattertext_pyplot(scattertext_structure)
fig.savefig('pyplot_export.png', format='png')

[]

Beispiele

Please see the examples in the PyData 2017 Tutorial on Scattertext.

A note on chart layout

Cozy: The Collection Synthesizer (Loncaric 2016) was used to help determine which terms could be labeled without overlapping a circle or another label. It automatically built a data structure to efficiently store and query the locations of each circle and labeled term.

The script to build rectangle-holder.js was

 fields ax1 : long, ay1 : long, ax2 : long, ay2 : long
assume ax1 < ax2 and ay1 < ay2
query findMatchingRectangles(bx1 : long, by1 : long, bx2 : long, by2 : long)
    assume bx1 < bx2 and by1 < by2
    ax1 < bx2 and ax2 > bx1 and ay1 < by2 and ay2 > by1

And it was called using

 $ python2.7 src/main.py <script file name> --enable-volume-trees 
  --js-class RectangleHolder --enable-hamt --enable-arrays --js rectangle_holder.js

Was ist neu

0.0.2.64

Adding in code to ensure that term statistics will show up even if no documents are present in visualization.

0.0.2.60

Better axis labeling (see demo_axis_crossbars_and_labels.py).

0.0.2.59

Pytextrank compatibility

0.0.2.57-58

Ensuring Pandas 1.0 compatibility fixing Issue #51 and scikit-learn stopwords import issue in #49.

0.0.2.44:

Added the following classes to support rank-based feature-selection: AssociationCompactorByRank , TermCategoryRanker .

0.0.2.43:

Made the term pop-up box on the category pairplot only the category name
Fixed optimal projection search function
Merged PR from @millengustavo to fix when a FutureWarning is issued every time the get_background_frequency_df is called.

0.0.2.42:

Fixed clickablity of terms, coloring in certain plots
Added initial number of terms to show in pairplot, using the terms_to_show parameter

0.0.2.41:

Enabled changing protocol in pair plot
Fixed semiotic square creator
Added use_categories_as_metadata_and_replace_terms to TermDocMatrix .
Added get_metadata_doc_count_df and get_metadata_count_mat to TermDocMatrix

0.0.2.40:

Added categories to terms in pair plot halo, made them clickable

0.0.2.39:

Fixing failing test case
Adding halo to pair plot

0.0.2.38:

Fixed term preview/clickability in semiotic square plots
Fixed search box
Added preliminary produce_pairplot

0.0.2.37:

Javascript changes to support multiple plots on a single page.
Added ScatterChart.hide_terms(terms: iter[str]) which enables selected terms to be hidden from the chart.
Added ScatterChartData.score_transform to specify the function which can change an original score into a value between 0 and 1 used for term coloring.

0.0.2.36:

Added alternative_term_func to produce_scattertext_explorer which allows you to inject a function that activates when a term is clicked.
Fixed Cohen's d calculation, and added HedgesG , and unbiased version of Cohen's d which is a subclass of CohensD .
Added the frequency_transform parameter to produce_frequency_explorer . This defaults to a log transform, but allows you to use any way your heart desires to order terms along the x-axis.

0.0.2.35:

Added show_category_headings=True to produce_scattertext_explorer . Setting this to False suppresses the list of categories which will be displayed in the term context area.
Added div_name argument to produce_scattertext_explorer and name-spaced important divs and classes by div_name in HTML templates and Javascript.
Added show_cross_axes=True to produce_scattertext_explorer . Setting this to False prevents the cross axes from being displayed if show_axes is True .
Changed default scorer to RankDifference.
Made sure that term contexts were properly shown in all configurations.

0.0.2.34:

TermDocMatrix.get_metadata_freq_df now accepts the label_append argument which by default adds ' freq' to the end of each column.
TermDocMatrix.get_num_cateogires returns the number of categories in a term-document matrix.

0.0.2.33:

Added the following methods:

TermDocMatrixWithoutCategories.get_num_metadata
TermDocMatrix.use_metadata_as_categories
unified_context argument in produce_scattertext_explorer lists all contexts in a single column. This let's you see snippets organized by multiple categories in a single column. See demo_unified_context.py for an example.
helps category-free or multi-category analyses.

0.0.2.32

Added a series of objects to handle uncategorized corpora. Added section on Document-Based Scatterplots, and the add_doc_names_as_metadata function. CategoryColorAssigner was also added to assign colors to a qualitative categories.

0.0.28-31

A number of new term scoring approaches including RelativeEntropy (a direct implementation of Frankhauser et al. ( 2014)), and ZScores and implementation of the Z-Score model used in Frankhauser et al.

TermDocMatrix.get_metadata_freq_df() returns a metadata-doc corpus.

CorpusBasedTermScorer.set_ranker allows you to use a different term ranker when finding corpus-based scores. This not only lets these scorers with metadata, but also allows you to integrate once-per-document counts.

Fixed produce_projection_explorer such that it can work with a predefined set of term embeddings. This can allow, for example, the easy exploration of one hot-encoded term embeddings in addition to arbitrary lower-dimensional embeddings.

Added add_metadata to TermDocMatrix in order to inject meta data after a TermDocMatrix object has been created.

Made sure tooltip never started above the top of the web page.

0.0.2.28

Added DomainCompactor .

0.0.2.26-27.1

Fixed bug #31, enabling context to show when metadata value is clicked.

Enabled display of terms in topic models in explorer, along with the the display of customized topic models. Please see Visualizing topic models for an overview of the additions.

Removed pkg_resources from Phrasemachine, corrected demo_phrase_machine.py

Now compatible with Gensim 3.4.0.

Added characteristic explorer, produce_characteristic_explorer , to plot terms with their characteristic scores on the x-axis and their class-association scores on the y-axis. See Ordering Terms by Corpus Characteristicness for more details.

0.0.2.24-25

Added TermCategoryFrequencies in response to Issue 23. Please see Visualizing differences based on only term frequencies for more details.

Added x_axis_labels and y_axis_labels parameters to produce_scattertext_explorer . These let you include evenly-spaced string axis labels on the chart, as opposed to just "Low", "Medium" and "High". These rely on d3's ticks function, which can behave unpredictable. Caveat usor.

0.0.2.16-23.1

Semiotic Squares now look better, and have customizable labels.

Incorporated the General Inquirer lexicon. Nur für nichtkommerzielle Verwendung. The lexicon is downloaded from their homepage at the start of each use. See demo_general_inquierer.py .

Incorporated Phrasemachine from AbeHandler (Handler et al. 2016). For the license, please see PhraseMachineLicense.txt . For an example, please see demo_phrase_machine.py .

Added CompactTerms for removing redundant and infrequent terms from term document matrices. These occur if a word or phrase is always part of a larger phrase; the shorter phrase is considered redundant and removed from the corpus. See demo_phrase_machine.py for an example.

Added FourSquare , a pattern that allows for the creation of a semiotic square with separate categories for each corner. Please see demo_four_square.py for an early example.

Finally, added a way to easily perform T-SNE-style visualizations on a categorized corpus. This uses, by default, the umap-learn package. Please see demo_tsne_style.py.

Fixed to ScaledFScorePresets(one_to_neg_one=True) , added UnigramsFromSpacyDoc .

Now, when using CorpusFromPandas , a CorpusDF object is returned, instead of a Corpus object. This new type of object keeps a reference to the source data frame, and returns it via the CorpusDF.get_df() method.

The factory CorpusFromFeatureDict was added. It allows you to directly specify term counts and metadata item counts within the dataframe. Please see test_corpusFromFeatureDict.py for an example.

0.0.2.15-16

Added a very semiotic square creator.

The idea to build a semiotic square that contrasts two categories in a Term Document Matrix while using other categories as neutral categories.

See Creating semiotic squares for an overview on how to use this functionality and semiotic squares.

Added a parameter to disable the display of the top-terms sidebar, eg, produce_scattertext_explorer(..., show_top_terms=False, ...) .

An interface to part of the subjectivity/sentiment dataset from Bo Pang and Lillian Lee. ``A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts''. ACL. 2004. See SampleCorpora.RottenTomatoes .

Fixed bug that caused tooltip placement to be off after scrolling.

Made category_name and not_category_name optional in produce_scattertext_explorer etc.

Created the ability to customize tooltips via the get_tooltip_content argument to produce_scattertext_explorer etc., control axes labels via x_axis_values and y_axis_values . The color_func parameter is a Javascript function to control color of a point. Function takes a parameter which is a dictionary entry produced by ScatterChartExplorer.to_dict and returns a string.

0.0.2.14

Integration with Scikit-Learn's text-analysis pipeline led the creation of the CorpusFromScikit and TermDocMatrixFromScikit classes.

The AutoTermSelector class to automatically suggest terms to appear in the visualization.
This can make it easier to show large data sets, and remove fiddling with the various minimum term frequency parameters.

For an example of how to use CorpusFromScikit and AutoTermSelector , please see demo_sklearn.py

Also, I updated the library and examples to be compatible with spaCy 2.

Fixed bug when processing single-word documents, and set the default beta to 2.

0.0.2.11-13

Added produce_frequency_explorer function, and adding the PEP 369-compliant __version__ attribute as mentioned in #19. Fixed bug when creating visualizations with more than two possible categories. Now, by default, category names will not be title-cased in the visualization, but will retain their original case.
If you'd still like to do this this, use ScatterChart (or a descendant).to_dict(..., title_case_names=True) . Fixed DocsAndLabelsFromCorpus for Py 2 compatibility.

0.0.2.10

Fixed bugs in chinese_nlp when jieba has already been imported and in p-value computation when performing log-odds-ratio w/ prior scoring.

Added demo for performing a Monroe et. al (2008) style visualization of log-odds-ratio scores in demo_log_odds_ratio_prior.py .

0.0.2.9.*

Breaking change: pmi_filter_thresold has been replaced with pmi_threshold_coefficient .

Added Emoji and Tweet analysis. See Emoji analysis.

Characteristic terms falls back ot "Most frequent" if no terms used in the chart are present in the background corpus.

Fixed top-term calculation for custom scores.

Set scaled f-score's default beta to 0.5.

Added --spacy_language_model argument to the CLI.

Added the alternative_text_field option in produce_scattertext_explorer to show an alternative text field when showing contexts in the interactive HTML visualization.

Updated ParsedCorpus.get_unigram_corpus to allow for continued alternative_text_field functionality.

0.0.2.8.6

Added ability to for Scattertext to use noun chunks instead of unigrams and bigrams through the FeatsFromSpacyDocOnlyNounChunks class. In order to use it, run your favorite Corpus or TermDocMatrix factory, and pass in an instance of the class as a parameter:

 st.CorpusFromParsedDocuments(..., feats_from_spacy_doc=st.FeatsFromSpacyDocOnlyNounChunks())

Fixed a bug in corpus construction that occurs when the last document has no features.

0.0.2.8.5

Now you don't have to install tinysegmenter to use Scattertext. But you need to install it if you want to parse Japanese. This caused a problem when Scattertext was being installed on Windows.

0.0.2.8.1-4

Added TermDocMatrix.get_corner_score , giving an improved version of the Rudder Score. Exposing whitespace_nlp_with_sentences . It's a lightweight bad regex sentence splitter built a top a bad regex tokenizer that somewhat apes spaCy's API. Use it if you don't have spaCy and the English model downloaded or if you care more about memory footprint and speed than accuracy.

It's not compatible with word_similarity_explorer but is compatible with `word_similarity_explorer_gensim'.

Tweaked scaled f-score normalization.

Fixed Javascript bug when clicking on '$'.

0.0.2.8.0

Fixed bug in Scaled F-Score computations, and changed computation to better score words that are inversely correlated to category.

Added Word2VecFromParsedCorpus to automate training Gensim word vectors from a corpus, and
word_similarity_explorer_gensim to produce the visualization.

See demo_gensim_similarity.py for an example.

0.0.2.7.1

Added the d3_url and d3_scale_chromatic_url parameters to produce_scattertext_explorer . This provides a way to manually specify the paths to "d3.js" (ie, the file from "https://cdnjs.cloudflare.com/ajax/libs/d3/4.6.0/d3.min.js") and "d3-scale-chromatic.v1.js" (ie, the file from "https://d3js.org/d3-scale-chromatic.v1.min.js").

This is important if you're getting the error:

 Javascript error adding output!
TypeError: d3.scaleLinear is not a function
See your browser Javascript console for more details.

It also lets you use Scattertext if you're serving in an environment with no (or a restricted) external Internet connection.

For example, if "d3.min.js" and "d3-scale-chromatic.v1.min.js" were present in the current working directory, calling the following code would reference them locally instead of the remote Javascript files. See Visualizing term associations for code context.

 >>> html = st.produce_scattertext_explorer(corpus,
...          category='democrat',
...          category_name='Democratic',
...          not_category_name='Republican',
...          width_in_pixels=1000,
...          metadata=convention_df['speaker'],
...          d3_url='d3.min.js',
...          d3_scale_chromatic_url='d3-scale-chromatic.v1.min.js')

0.0.2.7.0

Fixed a bug in 0.0.2.6.0 that transposed default axis labels.

Added a Japanese mode to Scattertext. See demo_japanese.py for an example of how to use Japanese. Please run pip install tinysegmenter to parse Japanese.

Also, the chiense_mode boolean parameter in produce_scattertext_explorer has been renamed to asian_mode .

For example, the output of demo_japanese.py is:

0.0.2.6.0

Custom term positions and axis labels. Although not recommended, you can visualize different metrics on each axis in visualizations similar to Monroe et al. (2008). Please see Custom term positions for more info.

0.0.2.5.0

Enhanced the visualization of query-based categorical differences, aka the word_similarity_explorer function. When run, a plot is produced that contains category associated terms colored in either red or blue hues, and terms not associated with either class colored in greyscale and slightly smaller. The intensity of each color indicates association with the query term. Zum Beispiel:

0.0.2.4.6

Some minor bug fixes, and added a minimum_not_category_term_frequency parameter. This fixes a problem with visualizing imbalanced datasets. It sets a minimum number of times a word that does not appear in the target category must appear before it is displayed.

Added TermDocMatrix.remove_entity_tags method to remove entity type tags from the analysis.

0.0.2.4.5

Fixed matched snippet not displaying issue #9, and fixed a Python 2 issue in created a visualization using a ParsedCorpus prepared via CorpusFromParsedDocuments , mentioned in the latter part of the issue #8 discussion.

Again, Python 2 is supported in experimental mode only.

0.0.2.4.4

Corrected example links on this Readme.

Fixed a bug in Issue 8 where the HTML visualization produced by produce_scattertext_html would fail.

0.0.2.4.2

Fixed a couple issues that rendered Scattertext broken in Python 2. Chinese processing still does not work.

Note: Use Python 3.4+ if you can.

0.0.2.4.1

Fixed links in Readme, and made regex NLP available in CLI.

0.0.2.4

Added the command line tool, and fixed a bug related to Empath visualizations.

0.0.2.3

Ability to see how a particular term is discussed differently between categories through the word_similarity_explorer function.

Specialized mode to view sparse term scores.

Fixed a bug that was caused by repeated values in background unigram counts.

Added true alphabetical term sorting in visualizations.

Added an optional save-as-SVG button.

0.0.2.2

Addition option of showing characteristic terms (from the full set of documents) being considered. The option ( show_characteristic in produce_scattertext_explorer ) is on by default, but currently unavailable for Chinese. If you know of a good Chinese wordcount list, please let me know. The algorithm used to produce these is F-Score.
See this and the following slide for more details

0.0.2.1.5

Added document and word count statistics to main visualization.

0.0.2.1.4

Added preliminary support for visualizing Empath (Fast 2016) topics categories instead of emotions. See the tutorial for more information.

0.0.2.1.3

Improved term-labeling.

0.0.2.1.1

Addition of strip_final_period param to FeatsFromSpacyDoc to deal with spaCy tokenization of all-caps documents that can leave periods at the end of terms.

0.0.2.1.0

I've added support for Chinese, including the ChineseNLP class, which uses a RegExp-based sentence splitter and Jieba for word segmentation. To use it, see the demo_chinese.py file. Note that CorpusFromPandas currently does not support ChineseNLP.

In order for the visualization to work, set the asian_mode flag to True in produce_scattertext_explorer .

Quellen

2012 Convention Data: scraped from The New York Times.
count_1w: Peter Norvig assembled this file (downloaded from norvig.com). See http://norvig.com/ngrams/ for an explanation of how it was gathered from a very large corpus.
hamlet.txt: William Shakespeare. From shapespeare.mit.edu
Inspiration for text scatter plots: Rudder, Christian. Dataclysm: Who We Are (When We Think No One's Looking). Random House Incorporated, 2014.
Loncaric, Calvin. "Cozy: synthesizing collection data structures." Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2016.
Fast, Ethan, Binbin Chen, and Michael S. Bernstein. "Empath: Understanding topic signals in large-scale text." Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2016.
Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis.
Bo Pang and Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004.
Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016.
Peter Fankhauser, Jörg Knappen, Elke Teich. Exploring and visualizing variation in language resources. LREC 2014.
Shinichi Nakagawa and Innes C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. 2007. In Biological Reviews 82.
Cynthia M. Whissell. The dictionary of affect in language. 1993. In The Measurement of Emotions.
David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. GENDER IDENTITY AND LEXICAL VARIATION IN SOCIAL MEDIA. 2014.
Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into Text. EMNLP. 2004.
Frimer, JA, Boghrati, R., Haidt, J., Graham, J., & Dehgani, M. Moral Foundations Dictionary for Linguistic Analyses 2.0. Unpublished manuscript. 2019.
Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. 2013. Moral foundations theory: The pragmatic validity of moral pluralism. Advances in Experimental Social Psychology, 47, 55-130
Ryan J. Gallagher, Morgan R. Frank, Lewis Mitchell, Aaron J. Schwartz, Andrew J. Reagan, Christopher M. Danforth, and Peter Sheridan Dodds. Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts. 2020. Arxiv. https://arxiv.org/pdf/2008.02250.pdf
Kocoń, Jan; Zaśko-Zielińska, Monika and Miłkowski, Piotr, 2019, PolEmo 2.0 Sentiment Analysis Dataset for CoNLL, CLARIN-PL digital repository, http://hdl.handle.net/11321/710.
George Forman. 2008. BNS feature scaling: an improved representation over tf-idf for svm text classification. In Proceedings of the 17th ACM conference on Information and knowledge management (CIKM '08). Association for Computing Machinery, New York, NY, USA, 263–270. https://doi.org/10.1145/1458082.1458119
Anne-Kathrin Schumann. 2016. Brave new world: Uncovering topical dynamics in the ACL Anthology reference corpus using term life cycle information. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 1–11, Berlin, Germany. Association for Computational Linguistics.
Piao, SS, Bianchi, F., Dayrell, C., D'egidio, A., & Rayson, P. 2015. Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1268-1274).
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494
Altmann EG, Pierrehumbert JB, Motter AE (2011) Niche as a Determinant of Word Fate in Online Groups. PLoS ONE 6(5): e19009. https://doi.org/10.1371/journal.pone.0019009.

Expandieren