TextDescriptives
v2.8.4
一個用於使用Spacy V.3管道組件和擴展名來計算文本的各種指標的Python庫。
pip install textdescriptives
textdescriptives/{metric_name}的組件調用,用於計算句子之間的語義連貫性的新coherence組件。有關教程和更多信息,請參見文檔!使用extract_metrics快速提取所需的指標。要查看可用的方法,您可以簡單地運行:
import textdescriptives as td
td . get_valid_metrics ()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}設置spacy_model參數以指定要使用的SPACY模型,否則,TextDeScriptives將根據lang自動下載適當的下載。如果設置了lang ,則不需要spacy_model ,反之亦然。
指定在metrics參數中提取的指標。 None提取所有指標。
import textdescriptives as td
text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td . extract_metrics ( text = text , lang = "en" , metrics = None )
# specify spaCy model and which metrics to extract
df = td . extract_metrics ( text = text , spacy_model = "en_core_web_lg" , metrics = [ "readability" , "coherence" ])要與其他Spacy管道集成,請使用標準Spacy語法導入庫並將組件添加到您的管道中。可用的組件是描述性_STAT ,可讀性,依賴關係_distance , pos_proportions , Cooherence和質量,並帶有textdescriptives/ 。
如果要添加所有組件,則可以使用Shorthand textdescriptives/all 。
import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy . load ( "en_core_web_sm" )
nlp . add_pipe ( "textdescriptives/all" )
doc = nlp ( "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it." )
# access some of the values
doc . _ . readability
doc . _ . token_length TextDeScriptivs包括將指標從Doc提取到Pandas DataFrame或字典的方便功能。
td . extract_dict ( doc )
td . extract_df ( doc )| 文字 | first_order_coherence | second_order_coherence | pos_prop_det | pos_prop_noun | pos_prop_aux | pos_prop_verb | pos_prop_punct | pos_prop_pron | pos_prop_adp | pos_prop_adv | pos_prop_sconj | flesch_reading_ease | FLESCH_KINCAID_GRADE | 煙霧 | gunning_fog | automated_radibalibal_index | coleman_liau_index | 利克斯 | rix | n_stop_words | alpha_ratio | mean_word_length | doc_length | promportion_ellipsis | promportion_bullet_points | diplate_line_chr_fraction | diplate_paragraph_chr_fraction | 重複_5-gram_chr_fraction | 重複_6-gram_chr_fraction | 重複_7-gram_chr_fraction | 重複_8-gram_chr_fraction | 重複_9-gram_chr_fraction | 重複_10-gram_chr_fraction | top_2-gram_chr_fraction | top_3-gram_chr_fraction | top_4-gram_chr_fraction | 符號_#_ to_word_ratio | contains_lorem ipsum | 通過_quality_check | dependency_distance_mean | dependency_distance_std | prop_adjacent_dependency_relation_mean | prop_adjacent_dependency_relation_std | token_length_mean | token_length_median | token_length_std | ston_length_mean | ston_length_median | ston_length_std | syllables_per_token_mean | syllables_per_token_median | syllables_per_token_std | n_tokens | n_unique_tokens | promportion_unique_tokens | n_characters | n_Sentences | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 世界變了(...) | 0.633002 | 0.573323 | 0.097561 | 0.121951 | 0.0731707 | 0.170732 | 0.146341 | 0.195122 | 0.0731707 | 0.0731707 | 0.0487805 | 107.879 | -0.0485714 | 5.68392 | 3.94286 | -2.45429 | -0.708571 | 12.7143 | 0.4 | 24 | 0.853659 | 2.95122 | 41 | 0 | 0 | 0 | 0 | 0.232258 | 0.232258 | 0 | 0 | 0 | 0 | 0.0580645 | 0.174194 | 0 | 0 | 錯誤的 | 錯誤的 | 1.77524 | 0.553188 | 0.457143 | 0.0722806 | 3.28571 | 3 | 1.54127 | 7 | 6 | 3.09839 | 1.08571 | 1 | 0.368117 | 35 | 23 | 0.657143 | 121 | 5 |
TextDescriptives具有詳細的文檔以及一系列Jupyter筆記本教程。所有教程都位於docs/tutorials文件夾中,也可以在文檔網站上找到。
| 文件 | |
|---|---|
| 入門 | 指南和說明如何使用TextDeScriptives及其功能。 |
| ? 演示 | 文本描述詞的現場演示。 |
| ?教程 | 有關如何充分利用文本描述詞的詳細教程 |
| ?新聞和更改 | 新的添加,更改和版本歷史記錄。 |
| ? API參考 | TextDective的API的詳細參考。包括功能文檔 |
| ?紙 | TextDeScriptives論文的預印本。 |