SummerTime下载 - SummerTime代码下载

夏季 - 非专家的文本摘要工具包

库可帮助用户根据其特定任务或需求选择适当的摘要工具。包括模型，评估指标和数据集。

图书馆架构如下：

注意：夏季有积极发展，强烈鼓励任何有用的评论，请打开问题或与任何团队成员接触。

安装和设置

从PYPI安装（推荐）

 # install extra dependencies first
pip install pyrouge@git+https://github.com/bheinzerling/pyrouge.git
pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

# install summertime from PyPI
pip install summertime

本地`pip`安装

另外，要享受最新功能，您可以从来源安装：

git clone [email protected]:Yale-LILY/SummerTime
pip install -e .

设置`ROUGE` （使用评估时）

 export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/

快速开始

导入模型，初始化默认模型并总结样本文档。

 from summertime import model

sample_model = model . summarizer ()
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]
sample_model . summarize ( documents )

# ["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

另外，请运行我们的COLAB笔记本，以进行更多的动手演示和更多示例。

型号

支持的模型

夏季支持不同的模型（例如Textrank，Bart，Longformer）以及用于更复杂的摘要任务的模型包装器（例如，用于多doc摘要的关节模型，基于查询的摘要的BM25检索）。还支持几种多语言模型（MT5和MBART）。

型号	单次	多分子	基于对话	基于查询	多种语言
Bartmodel	✔️
BM25SummModel				✔️
hmnetmodel			✔️
Lexrankmodel	✔️
longformermodel	✔️
mbartmodel	✔️				50种语言（此处完整列表）
MT5模型	✔️				101语言（在此处列表）
Translation PipeLineModel	✔️				〜70种语言
MultidocjointModel		✔️
多尾paratemodel		✔️
Pegasusmodel	✔️
TextrankModel	✔️
tfidfsummmodel				✔️

要查看所有受支持的模型，请运行：

 from summertime . model import SUPPORTED_SUMM_MODELS
print ( SUPPORTED_SUMM_MODELS )

导入和初始化：

 from summertime import model

# To use a default model
default_model = model . summarizer ()    

# Or a specific model
bart_model = model . BartModel ()
pegasus_model = model . PegasusModel ()
lexrank_model = model . LexRankModel ()
textrank_model = model . TextRankModel ()

用户可以轻松地访问文档以协助模型选择

 default_model . show_capability ()
pegasus_model . show_capability ()
textrank_model . show_capability ()

要使用模型进行总结，只需运行：

 documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

default_model . summarize ( documents )
# or 
pegasus_model . summarize ( documents )

所有模型都可以使用以下可选选项初始化：

 def __init__ ( self ,
         trained_domain : str = None ,
         max_input_length : int = None ,
         max_output_length : int = None ,
         ):

所有模型将实现以下方法：

 def summarize ( self ,
  corpus : Union [ List [ str ], List [ List [ str ]]],
  queries : List [ str ] = None ) -> List [ str ]:

def show_capability ( cls ) -> None :

数据集

支持数据集

夏季支持跨不同域的不同汇总数据集（例如，CNNDM数据集 - 新闻文章，Samsum，Samsum-对话语料库，QM -SUM-基于查询的对话语料库，Multinews-多文件 - 多文件语料库，ML -SUM，ML -SUM，ML -SUM-多语言语料库 - 多语言语料库，PubMedqa -Medical Domain，Arxiv -Arxiv -Science Paperers domain，其他。

数据集	领域	＃示例	src。长度	TGT。长度	询问	多分子	对话	多种语言
arxiv	科学文章	215k	4.9k	220
CNN/DM（3.0.0）	消息	300k	781	56
mlsumdataset	多语言新闻	1.5m+	632	34		✔️		德国，西班牙，法语，俄罗斯，土耳其语
多新的	消息	56k	2.1k	263.8		✔️
Samsum	开放域	16K	94	20			✔️
PubMedqa	医疗的	272k	244	32	✔️
QMSUM	会议	1k	9.0k	69.6	✔️		✔️
Scisummnet	科学文章	1k	4.7k	150
summscreen	电视节目	26.9k	6.6k	337.4			✔️
XSUM	消息	226k	431	23.3
xlsum	消息	135m	???	???				45种语言（请参阅文档）
大规模	消息	12m+	???	???				78种语言（有关详细信息，请参见README的多语言摘要部分）

要查看所有受支持的数据集，请运行：

 from summertime import dataset

print ( dataset . list_all_dataset ())

数据集初始化

 from summertime import dataset

cnn_dataset = dataset . CnndmDataset ()
# or 
xsum_dataset = dataset . XsumDataset ()
# ..etc

数据集对象

所有数据集都是SummDataset类的实现。他们的数据拆分可以访问如下：

 dataset = dataset . CnndmDataset ()

train_data = dataset . train_set  
dev_data = dataset . dev_set  
test_data = dataset . test_set

要查看数据集的详细信息，请运行：

 dataset = dataset . CnndmDataset ()

dataset . show_description ()

数据实例

所有数据集中的数据都包含在SummInstance类对象中，该对象具有以下属性：

 data_instance . source = source    # either `List[str]` or `str`, depending on the dataset itself, string joining may needed to fit into specific models.
data_instance . summary = summary  # a string summary that serves as ground truth
data_instance . query = query      # Optional, applies when a string query is present

print ( data_instance )             # to print the data instance in its entirety

加载和使用数据实例

使用发电机加载数据以节省空间和时间

获取一个实例

 data_instance = next ( cnn_dataset . train_set )
print ( data_instance )

要获取数据集

 import itertools

# Get a slice from the train set generator - first 5 instances
train_set = itertools . islice ( cnn_dataset . train_set , 5 )

corpus = [ instance . source for instance in train_set ]
print ( corpus )

加载自定义数据集

您可以使用定制数据使用CustomDataset数据，该类将数据加载到夏季数据集类别中

 from summertime . dataset import CustomDataset

''' The train_set, test_set and validation_set have the following format: 
        List[Dict], list of dictionaries that contain a data instance.
            The dictionary is in the form:
                {"source": "source_data", "summary": "summary_data", "query":"query_data"}
                    * source_data is either of type List[str] or str
                    * summary_data is of type str
                    * query_data is of type str
        The list of dictionaries looks as follows:
            [dictionary_instance_1, dictionary_instance_2, ...]
'''

# Create sample data
train_set = [
    {
        "source" : "source1" ,
        "summary" : "summary1" ,
        "query" : "query1" ,      # only included, if query is present
    }
]
validation_set = [
    {
        "source" : "source2" ,
        "summary" : "summary2" ,
        "query" : "query2" ,      
    }
]
test_set = [
    {
        "source" : "source3" ,
        "summary" : "summary3" ,
        "query" : "query3" ,     
    }
]

# Depending on the dataset properties, you can specify the type of dataset
#   i.e multi_doc, query_based, dialogue_based. If not specified, they default to false
custom_dataset = CustomDataset (
                    train_set = train_set ,
                    validation_set = validation_set ,
                    test_set = test_set ,
                    query_based = True ,
                    multi_doc = True
                    dialogue_based = False )

将数据集与模型一起使用 - 示例

 import itertools
from summertime import dataset , model

cnn_dataset = dataset . CnndmDataset ()


# Get a slice of the train set - first 5 instances
train_set = itertools . islice ( cnn_dataset . train_set , 5 )

corpus = [ instance . source for instance in train_set ]



# Example 1 - traditional non-neural model
# LexRank model
lexrank = model . LexRankModel ( corpus )
print ( lexrank . show_capability ())

lexrank_summary = lexrank . summarize ( corpus )
print ( lexrank_summary )


# Example 2 - A spaCy pipeline for TextRank (another non-neueral extractive summarization model)
# TextRank model
textrank = model . TextRankModel ()
print ( textrank . show_capability ())

textrank_summary = textrank . summarize ( corpus )
print ( textrank_summary )


# Example 3 - A neural model to handle large texts
# LongFormer Model
longformer = model . LongFormerModel ()
longformer . show_capability ()

longformer_summary = longformer . summarize ( corpus )
print ( longformer_summary )

多语言摘要

多语言模型的summarize()方法自动检查输入文档语言。

单大型多语言模型可以以与单语模型相同的方式初始化和使用。如果模型不支持的语言输入，他们会返回错误。

 mbart_model = st_model . MBartModel ()
mt5_model = st_model . MT5Model ()

# load Spanish portion of MLSum dataset
mlsum = datasets . MlsumDataset ([ "es" ])

corpus = itertools . islice ( mlsum . train_set , 5 )
corpus = [ instance . source for instance in train_set ]

# mt5 model will automatically detect Spanish as the language and indicate that this is supported!
mt5_model . summarize ( corpus )

目前在我们实施Massiveumm数据集中支持以下语言：Amharic，Amharic，Arabic，Assamese，Asmare，Amymara，Azerbaijani，Bambara，Bambara，Bengali，Bengali，Bosnian，Bosnian，Bosnian，Pulgarian，Pulgarian，Catalan，Catalan，Catalan，Czech，捷克 Gujarati, Haitian, Hausa, Hebrew, Hindi, Croatian, Hungarian, Armenian,Igbo, Indonesian, Icelandic, Italian, Japanese, Kannada, Georgian, Khmer, Kinyarwanda, Kyrgyz, Korean, Kurdish, Lao, Latvian, Lingala, Lithuanian, Malayalam, Marathi, Macedonian,马尔加斯加斯，蒙古人，缅甸，南恩德贝尔，尼泊尔，荷兰人，奥里亚，奥里莫，旁遮普，波兰，波兰语，葡萄牙语，达里（Dari塔吉克，泰国，蒂格里尼亚，土耳其语，乌克兰，乌尔都语，乌兹别克，越南，Xhosa，Yoruba，Yue Chinese，中文，中文，Bislama和Gaelic。

评估

夏季支持不同的评估指标，包括：Bertscore，Bleu，Meteor，Rouge，Rougewe

打印所有支持的指标：

 from summertime . evaluation import SUPPORTED_EVALUATION_METRICS

print ( SUPPORTED_EVALUATION_METRICS )

导入和初始化：

 import summertime . evaluation as st_eval

bert_eval = st_eval . bertscore ()
bleu_eval = st_eval . bleu_eval ()
meteor_eval = st_eval . bleu_eval ()
rouge_eval = st_eval . rouge ()
rougewe_eval = st_eval . rougewe ()

评估课

所有评估指标都可以通过以下可选参数初始化：

 def __init__ ( self , metric_name ):

所有评估指标对象实现以下方法：

 def evaluate ( self , model , data ):

def get_dict ( self , keys ):

使用评估指标

获取样本摘要数据

 from summertime . evaluation . base_metric import SummMetric
from summertime . evaluation import Rouge , RougeWe , BertScore

import itertools

# Evaluates model on subset of cnn_dailymail
# Get a slice of the train set - first 5 instances
train_set = itertools . islice ( cnn_dataset . train_set , 5 )

corpus = [ instance for instance in train_set ]
print ( corpus )

articles = [ instance . source for instance in corpus ]

summaries = sample_model . summarize ( articles )
targets = [ instance . summary for instance in corpus ]

评估不同指标的数据

 from summertime . evaluation import  BertScore , Rouge , RougeWe ,

# Calculate BertScore
bert_metric = BertScore ()
bert_score = bert_metric . evaluate ( summaries , targets )
print ( bert_score )

# Calculate Rouge
rouge_metric = Rouge ()
rouge_score = rouge_metric . evaluate ( summaries , targets )
print ( rouge_score )

# Calculate RougeWe
rougewe_metric = RougeWe ()
rougwe_score = rougewe_metric . evaluate ( summaries , targets )
print ( rougewe_score )

使用自动管道组件

给定夏季数据集，您可以使用pipelines.assemble_model_pipeline函数来检索与提供的数据集兼容的初始化夏季模型的列表。

 from summertime . pipeline import assemble_model_pipeline
from summertime . dataset import CnndmDataset , QMsumDataset

single_doc_models = assemble_model_pipeline ( CnndmDataset )
# [
#   (<model.single_doc.bart_model.BartModel object at 0x7fcd43aa12e0>, 'BART'),
#   (<model.single_doc.lexrank_model.LexRankModel object at 0x7fcd43aa1460>, 'LexRank'),
#   (<model.single_doc.longformer_model.LongformerModel object at 0x7fcd43b17670>, 'Longformer'),
#   (<model.single_doc.pegasus_model.PegasusModel object at 0x7fccb84f2910>, 'Pegasus'),
#   (<model.single_doc.textrank_model.TextRankModel object at 0x7fccb84f2880>, 'TextRank')
# ]

query_based_multi_doc_models = assemble_model_pipeline ( QMsumDataset )
# [
#   (<model.query_based.tf_idf_model.TFIDFSummModel object at 0x7fc9c9c81e20>, 'TF-IDF (HMNET)'),
#   (<model.query_based.bm25_model.BM25SummModel object at 0x7fc8b4fa8c10>, 'BM25 (HMNET)')
# ]

=======

可视化数据集上不同模型的性能

给定夏季数据集，您可以使用pipelines.assemble_model_pipeline函数来检索与提供的数据集兼容的初始化夏季模型的列表。

 # Get test data
import itertools
from summertime . dataset import XsumDataset

# Get a slice of the train set - first 5 instances
sample_dataset = XsumDataset ()
sample_data = itertools . islice ( sample_dataset . train_set , 100 )
generator1 = iter ( sample_data )
generator2 = iter ( sample_data )

bart_model = BartModel ()
pegasus_model = PegasusModel ()
models = [ bart_model , pegasus_model ]
metrics = [ metric () for metric in SUPPORTED_EVALUATION_METRICS ]

创建雷达图

 from summertime . evaluation . model_selector import ModelSelector

selector = ModelSelector ( models , generator1 , metrics )
table = selector . run ()
print ( table )
visualization = selector . visualize ( table )

 from summertime . evaluation . model_selector import ModelSelector

new_selector = ModelSelector ( models , generator2 , metrics )
smart_table = new_selector . run_halving ( min_instances = 2 , factor = 2 )
print ( smart_table )
visualization_smart = selector . visualize ( smart_table )

创建一个散点图

 from summertime . evaluation . model_selector import ModelSelector
from summertime . evaluation . error_viz import scatter

keys = ( "bert_score_f1" , "bleu" , "rouge_1_f_score" , "rouge_2_f_score" , "rouge_l_f_score" , "rouge_we_3_f" , "meteor" )

scatter ( models , sample_data , metrics [ 1 : 3 ], keys = keys [ 1 : 3 ], max_instances = 5 )

做出贡献

拉请求

创建一个拉请求并将其命名[ your_gh_username ]/[ your_branch_name ]。如果需要，请解决您自己的分支机构与MAIN的合并冲突。不要直接推到主。

代码格式

如果还没有，请安装black和flake8 ：

pip install black
pip install flake8

在推动提交或合并分支之前，请从项目根部运行以下命令。请注意， black将写入文件，并且您应该在推动之前添加并提交black做出的更改：

black .
flake8 .

或者，如果您想提起特定的文件：

black path/to/specific/file.py
flake8 path/to/specific/file.py

确保black不会重新格式化任何文件，并且flake8不会打印任何错误。如果您想覆盖或忽略black或flake8强制执行的任何偏好或实践，请在PR中发表评论，以获取生成警告或错误日志的任何代码行。不要直接编辑配置文件，例如setup.cfg 。

有关安装的文档，忽略文件/行和高级用法，请参见black文档和flake8文档。另外，以下可能很有用：

black [file.py] --diff以差异为diff而不是直接进行更改
black [file.py] --check以使用状态代码预览更改而不是直接进行更改
git diff -u | flake8 --diff仅在工作分支上仅运行flake8的木

请注意，我们的CI测试套件将包括调用black --check .和flake8 --count .在所有非计算和非设定的Python文件上，所有测试都需要零错误级输出。

测试

我们的连续集成系统是通过GitHub动作提供的。当创建或更新任何拉动请求或每当更新main时，存储库的单元测试将作为tangra上的作业运行，以供该拉动请求。构建作业将在几分钟内通过或失败，并且在操作下可见构建状态和日志。请确保在合并之前，请确保所有支票都通过所有支票（即所有作业运行到完成的所有步骤），或者请求审查。要跳过任何特定的提交中的构建，请将[skip ci]附加到提交消息。请注意，CI中不包含带有子字符串/no-ci/分支名称中任何地方的PR。

引用

该存储库由Dragomir Radev教授领导的耶鲁大学的Lily Lab建造。主要贡献者是Ansong Ni，Zhangir Azerbayev，Troy Feng，Murori Mutuma，Hailey Schoelkopf和Yusen Zhang（Penn State）。

如果您在工作中使用夏季，请考虑引用：

 @article{ni2021summertime,
     title={SummerTime: Text Summarization Toolkit for Non-experts}, 
     author={Ansong Ni and Zhangir Azerbayev and Mutethia Mutuma and Troy Feng and Yusen Zhang and Tao Yu and Ahmed Hassan Awadallah and Dragomir Radev},
     journal={arXiv preprint arXiv:2108.12738},
     year={2021}
}

有关评论和问题，请打开一个问题。

展开

SummerTime

夏季 - 非专家的文本摘要工具包

安装和设置

从PYPI安装（推荐）

本地pip安装

设置ROUGE （使用评估时）

快速开始

型号

支持的模型

导入和初始化：

数据集

支持数据集

数据集初始化

数据集对象

数据实例

加载和使用数据实例

获取一个实例

要获取数据集

加载自定义数据集

将数据集与模型一起使用 - 示例

多语言摘要

评估

导入和初始化：

评估课

使用评估指标

使用自动管道组件

可视化数据集上不同模型的性能

创建雷达图

创建一个散点图

做出贡献

拉请求

代码格式

测试

引用

本地`pip`安装

设置`ROUGE` （使用评估时）