TextDescriptives
v2.8.4
一个用于使用Spacy V.3管道组件和扩展名来计算文本的各种指标的Python库。
pip install textdescriptives
textdescriptives/{metric_name}的组件调用,用于计算句子之间的语义连贯性的新coherence组件。有关教程和更多信息,请参见文档!使用extract_metrics快速提取所需的指标。要查看可用的方法,您可以简单地运行:
import textdescriptives as td
td . get_valid_metrics ()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}设置spacy_model参数以指定要使用的SPACY模型,否则,TextDeScriptives将根据lang自动下载适当的下载。如果设置了lang ,则不需要spacy_model ,反之亦然。
指定在metrics参数中提取的指标。 None提取所有指标。
import textdescriptives as td
text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td . extract_metrics ( text = text , lang = "en" , metrics = None )
# specify spaCy model and which metrics to extract
df = td . extract_metrics ( text = text , spacy_model = "en_core_web_lg" , metrics = [ "readability" , "coherence" ])要与其他Spacy管道集成,请使用标准Spacy语法导入库并将组件添加到您的管道中。可用的组件是描述性_STAT ,可读性,依赖关系_distance , pos_proportions , Cooherence和质量,并带有textdescriptives/ 。
如果要添加所有组件,则可以使用Shorthand textdescriptives/all 。
import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy . load ( "en_core_web_sm" )
nlp . add_pipe ( "textdescriptives/all" )
doc = nlp ( "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it." )
# access some of the values
doc . _ . readability
doc . _ . token_length TextDeScriptivs包括将指标从Doc提取到Pandas DataFrame或字典的方便功能。
td . extract_dict ( doc )
td . extract_df ( doc )| 文本 | first_order_coherence | second_order_coherence | pos_prop_det | pos_prop_noun | pos_prop_aux | pos_prop_verb | pos_prop_punct | pos_prop_pron | pos_prop_adp | pos_prop_adv | pos_prop_sconj | flesch_reading_ease | FLESCH_KINCAID_GRADE | 烟雾 | gunning_fog | automated_radibalibal_index | coleman_liau_index | 利克斯 | rix | n_stop_words | alpha_ratio | mean_word_length | doc_length | promportion_ellipsis | promportion_bullet_points | diplate_line_chr_fraction | diplate_paragraph_chr_fraction | 重复_5-gram_chr_fraction | 重复_6-gram_chr_fraction | 重复_7-gram_chr_fraction | 重复_8-gram_chr_fraction | 重复_9-gram_chr_fraction | 重复_10-gram_chr_fraction | top_2-gram_chr_fraction | top_3-gram_chr_fraction | top_4-gram_chr_fraction | 符号_#_ to_word_ratio | contains_lorem ipsum | 通过_quality_check | dependency_distance_mean | dependency_distance_std | prop_adjacent_dependency_relation_mean | prop_adjacent_dependency_relation_std | token_length_mean | token_length_median | token_length_std | ston_length_mean | ston_length_median | ston_length_std | syllables_per_token_mean | syllables_per_token_median | syllables_per_token_std | n_tokens | n_unique_tokens | promportion_unique_tokens | n_characters | n_Sentences | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 世界变了(...) | 0.633002 | 0.573323 | 0.097561 | 0.121951 | 0.0731707 | 0.170732 | 0.146341 | 0.195122 | 0.0731707 | 0.0731707 | 0.0487805 | 107.879 | -0.0485714 | 5.68392 | 3.94286 | -2.45429 | -0.708571 | 12.7143 | 0.4 | 24 | 0.853659 | 2.95122 | 41 | 0 | 0 | 0 | 0 | 0.232258 | 0.232258 | 0 | 0 | 0 | 0 | 0.0580645 | 0.174194 | 0 | 0 | 错误的 | 错误的 | 1.77524 | 0.553188 | 0.457143 | 0.0722806 | 3.28571 | 3 | 1.54127 | 7 | 6 | 3.09839 | 1.08571 | 1 | 0.368117 | 35 | 23 | 0.657143 | 121 | 5 |
TextDescriptives具有详细的文档以及一系列Jupyter笔记本教程。所有教程都位于docs/tutorials文件夹中,也可以在文档网站上找到。
| 文档 | |
|---|---|
| 入门 | 指南和说明如何使用TextDeScriptives及其功能。 |
| ? 演示 | 文本描述词的现场演示。 |
| ?教程 | 有关如何充分利用文本描述词的详细教程 |
| ?新闻和更改 | 新的添加,更改和版本历史记录。 |
| ? API参考 | TextDective的API的详细参考。包括功能文档 |
| ?纸 | TextDeScriptives论文的预印本。 |