GPT2 NewsTitle Download - GPT2 NewsTitle Source code download

GPT2 NewsTitle

Other source code

1.0.0

Download

GPT2-NewsTitle

GPT2 news title generation project with super detailed annotations

UpDate 02.19.2022

Add Streamlit pages and deploy a beautiful page without using Flask+HTML.
Detailed description of the document, see that the algorithm does not have the front end, you can also create a good-looking interface

Run the code

 streamlit run app.py
or
streamlit run app.py --server.port your_port

The details are shown in the figure below:

UpDate 01.02.2021

Collect data from the Internet, organize and clean news data sets such as Tsinghua news data, Sogou news data, and some open source abstract data to build a relatively complete Chinese abstract data set.
When cleaning the dataset, only simple regular cleaning is performed. For example: clean htlm marks, remove excess empty characters, remove picture marks, etc.
For detailed information about the processed dataset, see the dataset description

data	Original data/project address	File download address after processing
Tsinghua News Data	address	Baidu Cloud Disk Extraction Code: vhol
Sogou News Data	address	Baidu Cloud Disk Extraction Code: ode6
nlpcc2017 summary data	address	Baidu Cloud Disk Extraction Code: e0zq
csl summary data	address	Baidu Cloud Disk Extraction Code: 0qot
Education and training industry summary data	address	Baidu Cloud Disk Extraction Code: kjz3
lcss summary data	address	Baidu Cloud Disk Extraction Code: bzov
Shence Cup 2018 Summary Data	address	Baidu Cloud Disk Extraction Code: 6f4f
Wanfang Summary Data	address	Baidu Cloud Disk Extraction Code: p69g
WeChat official account summary data	address	Baidu Cloud Disk Extraction Code: 5has
Weibo data	address	Baidu Cloud Disk Extraction Code: 85t5
news2016zh news data	address	Baidu Cloud Disk Extraction Code: qsj1

Dataset collection: Baidu Cloud Disk Extraction Code: 7am8

Project Description

This project is a news title generation project based on the GPT2 model with super detailed Chinese annotations.
This project refers to multiple GPT2 open source projects such as GPT2-Chinese, GPT2-chitchat, CDial-GPT, GPT2, etc., and refactors the code according to your own understanding and add detailed comments, hoping to help those in need.
This project uses HuggingFace's transformers to implement GPT2 model code writing, training and testing.
This project builds a web service through the Flask framework, engineering the news digest generation model, and can visually experience the news title generation effect through the page.
The code of this project is explained in detail. You can read the code yourself or view the code comment introduction.
The news title model provided by this project is a 6-layer small model (actually, it is poor people who have no one to be stuck, so they can only train small models). During the training of this model, the pre-trained GPT2 model is not loaded but the parameters are randomly initialized, and the number of training rounds is small (5 rounds, not converging yet), so the effect is average. If you want a better model, you can train a model according to your personal needs.
The purpose of this project is to lead everyone through the entire process of training, testing and deployment of GPT2 generation model.

File structure

config
- The configuration information of the config.json model, including n_ctx, n_embd, n_head, n_layer, etc.
vocab
- The vocab.txt dictionary file, which has a size of 13317, deletes the "## Chinese" in the original dictionary, and adds tags such as "[Content]", "[Title]", "[Space]".
data_dir folder for storing data
templates The folder where html pages are stored
data_helper.py data preprocessing file, simple cleaning of data
data_set.py data class file, defines the data classes required by the model, and is convenient for model training and use
model.py GPT2 model file, mainly rewrites GPT2LMHeadModel in transformers package, modifys the calculation loss part, and only calculates the loss part of the predicted title part.
train.py train file for GPT2 model that generates news titles through news body
generate_title.py generates news titles and predicts files based on the trained model
http_server.py builds web service files

Running environment

gevent == 1.3a1
flask == 0.12.2
transformers == 3.0.2

See requirements.txt file for details

Dataset

The data comes from Sina Weibo, data link: https://www.jianshu.com/p/8f52352f0748?tdsourcetag=s_pcqq_aiomsg

Data description	Download address
Raw data	Baidu Netdisk, extract code: nqzi
Processed data	Baidu Netdisk, extract code: duba

The original data is news data downloaded directly from the Internet. After processing, the data is data processed using data_helper.py and can be directly used for training.

Model parameters

See config/config.json file for details

parameter	value
initializer_range	0.02
layer_norm_epsilon	1e-05
n_ctx	512
n_embd	768
n_head	12
n_layer	6
n_positions	512
vocab_size	13317

Note: In addition to the vector representation of each word, the model input also includes text paragraph vector representation and position vector representation.

Model file sharing

Model	Download address
GPT2 model	Baidu Netdisk, extraction code: 165b

Model training

 python3 train.py
或
python3 train.py --output_dir output_dir/(自定义保存模型路径)

The training parameters can be added by yourself, including the parameters as follows:

parameter	type	default value	describe
device	str	"0"	Set up the graphics card used for training or testing
config_path	str	"config/config.json"	Model parameter configuration information
vocab_path	str	"vocab/vocab.txt"	The word list is a small word list and has added some new marks
train_file_path	str	"data_dir/train_data.json"	Training data generated by news titles
test_file_path	str	"data_dir/test_data.json"	Test data generated by news titles
pretrained_model_path	str	None	Path to pre-trained GPT2 model
data_dir	str	"data_dir"	Generate cached data storage path
num_train_epochs	int	5	Number of rounds for model training
train_batch_size	int	16	The size of each batch during training
test_batch_size	int	8	The size of each batch during testing
learning_rate	float	1e-4	Learning rate during model training
warmup_proportion	float	0.1	The warm up probability, that is, the percentage of the total training step size, perform the warm up operation
adam_epsilon	float	1e-8	epsilon value of Adam optimizer
logging_steps	int	20	Number of steps to save the training log
eval_steps	int	4000	How many steps will be performed during training?
gradient_accumulation_steps	int	1	Gradient accumulation
max_grad_norm	float	1.0
output_dir	str	"output_dir/"	Model output path
seed	int	2020	Random seeds
max_len	int	512	The maximum length of the input model is smaller than n_ctx in config

Or modify the content of the set_args function in the train.py file to modify the default value.

The models provided by this project have trained 5 epochs, and the model training loss and test set loss are as follows:

The model has not been fully trained yet. According to the loss trend, you can continue to train.

Model testing

 python3 generate_title.py
或
python3 generate_title.py --top_k 3 --top_p 0.9999 --generate_max_len 32

Parameters can be added by yourself, including parameters as follows:

parameter	type	default value	describe
device	str	"0"	Set up the graphics card used for training or testing
model_path	str	"output_dir/checkpoint-139805"	Model file path
vocab_path	str	"vocab/vocab.txt"	The word list is a small word list and has added some new marks
batch_size	int	3	Number of titles generated
generate_max_len	int	32	Maximum length of generated title
repetition_penalty	float	1.2	Repeated penalty rate
top_k	int	5	How many tags with the highest probability of retaining during decoding
top_p	float	0.95	Markers whose retention probability is greater than what is the accumulated retention probability during decoding
max_len	int	512	The maximum length of the input model is smaller than n_ctx in config

The test results are as follows:

从测试集中抽一篇
content：
今日，中国三条重要高铁干线——兰新高铁、贵广铁路和南广铁路将开通运营。其中兰新高铁是中国首条高原高铁，全长1776公里，最高票价658元。贵广铁路最贵车票320元，南广铁路最贵车票206.5元，这两条线路大大缩短西南与各地的时空距离。出行更方便了！中国“高铁版图”再扩容 三条重要高铁今日开通
title：
生成的第1个标题为：中国“高铁版图”再扩容 三条重要高铁今日开通
生成的第2个标题为：贵广铁路最高铁版图
生成的第3个标题为：出行更方便了！中国“高铁版图”再扩容三条重要高铁今日开通

从网上随便找一篇新闻
content：
值岁末，一年一度的中央经济工作会议牵动全球目光。今年的会议，背景特殊、节点关键、意义重大。12月16日至18日。北京，京西宾馆。站在“两个一百年”奋斗目标的历史交汇点上，2020年中央经济工作会议谋划着中国经济发展大计。习近平总书记在会上发表了重要讲话，深刻分析国内外经济形势，提出2021年经济工作总体要求和政策取向，部署重点任务，为开局“十四五”、开启全面建设社会主义现代化国家新征程定向领航。
title：
生成的第1个标题为：习近平总书记在京会上发表重大计划 提出2025年经济工作总体要求和政策
生成的第2个标题为：习近平总书记在会上发表重要讲话
生成的第3个标题为：习近平总书记在会上发表重要讲话，深刻分析国内外经济形势

The decoding adopts top_k and top_p decoding strategies, which have certain randomness and can be generated repeatedly.

Start Flask service

 python3 http_server.py
或
python3 http_server.py --http_id "0.0.0.0" --port 5555

Local tests use "127.0.0.1:5555/news-title-generate". If you give others access, you only need to replace "127.0.0.1" with the IP address of the computer.

The details are shown in the figure below:

Future work

In the later stage, news data sets such as Tsinghua news data and Sogou news data may be sorted and cleaned to build a relatively complete news title data set.
A small GPT2 pretrained model may be trained later using news data.
The uploaded news title model may be updated later to train a model with better results.

Acknowledgements

Thanks to @JunkRoy for providing the web interface

refer to

GPT2-Chinese
GPT2-chitchat
CDial-GPT
GPT2
transformers

Citing

 @misc{GPT2-NewsTitle,
  author = {Cong Liu},
  title = {Chinese NewsTitle Generation Project by GPT2},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  url="https://github.com/liucongg/GPT2-NewsTitle",
}