GPT2 chitchat Download - GPT2 chitchat Source code download

GPT2 for Chinese trainhat

News

Official account [YeungNLP]

2023.04.05: Release Firefly: a Chinese dialogue-based large language model, an open source 1.1M Chinese multitasking instruction dataset, and model weights. See the article for details
2023.04.02: Release LLMPruner: a large language model cropping tool, sharing the cutting method and its cropped model weights. See the article for details.
2023.02.13: Release OFA-Chinese, the application of the Chinese multimodal unified pre-trained model OFA on Image Caption tasks. See the article for details.
2022.12.04: Release CLIP-Chinese, a Chinese CLIP pre-trained model. The data was pre-trained using 1.4 million Chinese pictures and texts, which showed good performance in the similarity of graphic, text similarity, and picture similarity tasks. See the article for details.
2022.03.30: Release ClipCap-Chinese, an Image Caption model based on the CLIP model. See the article for details.
2021.06.16: Publish CPM Chinese text generation project. It can be used for Chinese generation tasks such as composition, novels, news, and ancient poems. See the article for details.
2021.05.26: Added original data and preprocessed data for multiple rounds of dialogue of 50w and 100w.

Project Description

This project is a Chinese chat robot based on GPT2, and the model implements transformers based on HuggingFace. article:
This project was inspired by GPT2-Chinese and read the author's code carefully and benefited a lot.
During the generation stage, Temperature, Top-k Sampling and Nucleus Sampling are used. Please refer to the paper The Curious Case of Neural Text Degeneration
Many detailed Chinese comments are given in the code to facilitate everyone to better understand the code.
This project is referenced by Microsoft's DialoGPT project (in order to simplify the generation method and speed up the generation, the generation method of MMI was deleted)

Running environment

python3.6, transformers==4.2.0, pytorch==1.7.0

Project structure

data
- train.txt: The default original training set file, storing chat materials
- train.pkl: The file after tokenize the original training corpus, stores a list object. Each piece of data in the list represents a multiple round of dialogue, representing a training data.
model: store the model generated by the conversation
- epoch40: Model obtained after 40 rounds of training
  - config.json: Configuration file for model parameters
  - pytorch_model.bin:Model file
vocab
- vocab.txt:Dictionary file. The default dictionary size is 13317. If you need to use a custom dictionary, you need to set the vocab_size field in the confog.json file to the corresponding size.
sample: Store historical chat records generated by human-computer chat
train.py: training code
interact.py:Human-computer interaction code
preprocess.py: Data preprocessing code

Model Introduction

Model structure

avatar

Introduction to model parameters (see the model's config.json file for details)

initializer_range: 0.02
layer_norm_epsilon: 1e-05
n_ctx: 1024
n_embd: 768
n_head: 12
n_layer: 12
n_positions: 1024
vocab_size: 21128

Training ideas

Each training data is spliced and then input it into the model for training.

For the following multiple rounds of chat training data, when training the model, the training data is spliced as follows: "[CLS] Want to see your beautiful photos [SEP] Kiss me and show you [SEP] I kiss you [SEP] I hate people who use small fists to punch your chest [SEP]". Then use the above splicing results as input to the model and let the model undergo autoregression training.

想看你的美照
亲我一口就给你看
我亲两口
讨厌人家拿小拳拳捶你胸口

How to use

Quick Start

Download the model in model sharing, put the model folder model_epoch40_50w into the model directory, execute the following commands, and conduct dialogue

 python interact.py --no_cuda --model_path model_epoch40_50w (使用cpu生成，速度相对较慢)
或
python interact.py --model_path model_epoch40_50w --device 0 (指定0号GPU进行生成，速度相对较快)

Data preprocessing

Create a data folder in the project root directory, name the original training corpus train.txt, and store it in this directory. The format of train.txt is as follows, with one line between each chat, and the format is as follows:

真想找你一起去看电影
突然很想你
我也很想你

想看你的美照
亲我一口就给你看
我亲两口
讨厌人家拿小拳拳捶你胸口

美女约嘛
开好房等你了
我来啦

Run preprocess.py, tokenize the data/train.txt corpus, and serialize and save it to data/train.pkl. The type of the serialized object in train.pkl is List[List], which records the tokens contained in each conversation in the dialog list.

 python preprocess.py --train_path data/train.txt --save_path data/train.pkl

Training the model

Run train.py, use the preprocessed data to perform autoregression training on the model, and save the model in the model folder in the root directory.

During training, you can start early stop by specifying the patience parameter. When patience=n, if n consecutive epochs, the loss of the model on the verification set does not decrease, then early stop and training is stopped. When patience=0, no early stop is performed.

Early stop is turned off by default in the code, because in practice, the generation effect of the model obtained by early stop may not be better.

 python train.py --epochs 40 --batch_size 8 --device 0,1 --train_path data/train.pkl

For more introduction to training parameters, you can directly look at the parameter description in the set_args() function in train.py

Human-computer interaction

Run interact.py, use the trained model, perform human-computer interaction, and enter Ctrl+Z to end the conversation, the chat record will be saved to the sample.txt file in the sample directory.

 python interact.py --no_cuda --model_path path_to_your_model --max_history_len 3(由于闲聊对话生成的内容长度不是很长，因此生成部分在CPU上跑速度也挺快的)

When executing interact.py, you can try to adjust the generated effect by adjusting the parameters such as topk, topp, repetition_penalty, max_history_len, etc. For more parameters, you can directly look at the parameter description in the set_args() function of interact.py. If you want to use the GPU for generation, do not call the --no_cuda parameter, and use --device gpu_id to specify which GPU to use.

Chat corpus sharing

Chinese chat materials	Dataset address	Corpus description
Common Chinese chats	chinese_chatbot_corpus	Including Xiaohuang Chicken corpus, Douban corpus, TV series dialogue corpus, Tieba forum reply corpus, Weibo corpus, PTT gossip corpus, Qingyun corpus, etc.
50w Chinese chat materials	Baidu Netdisk [Extraction code: 4g5e] or Google Drive	Contains original corpus and preprocessed data with 50w multi-round dialogue
100w Chinese chat materials	Baidu Netdisk [Extraction code: s908] or Google Drive	Contains original corpus and pre-processed data with 100w multi-round dialogue

The contents of Chinese chat corpus are as follows:

谢谢你所做的一切
你开心就好
开心
嗯因为你的心里只有学习
某某某，还有你
这个某某某用的好

你们宿舍都是这么厉害的人吗
眼睛特别搞笑这土也不好捏但就是觉得挺可爱
特别可爱啊

今天好点了吗？
一天比一天严重
吃药不管用，去打一针。别拖着

Model sharing

Model	Shared address	Model description
model_epoch40_50w	Baidu Netdisk [Extraction code: ju6m] or Google Drive	40 epochs were trained using 50w multi-round dialogue corpus, and the loss was reduced to around 2.0.

Generate sample

Sample 1:

  user :你好啊
chatbot:你好，请问您是？
  user :我是你的主人啊 
chatbot:你是狗吗？
  user :你才是狗
chatbot:你不是狗我是你的小主人
  user :你这个大笨蛋
chatbot:你大笨蛋，你是想挨打了
  user :你全家都是大笨蛋
chatbot:我是你爸爸
  user :算了，不跟你吵了，你在干嘛呢
chatbot:我在上班啊，你还没下班呢啊你
  user :我周末要做作业啊
chatbot:不是才交了吗，早晚做完啊
  user :好多作业啊，还没做完呢
chatbot:我在抄，我妈让我周五交，还有事
  user :你就不怕被你妈妈骂吗
chatbot:不怕
  user :你周末有去哪里玩吗
chatbot:没时间啊，忙得很啊
  user :忙着干嘛呢
chatbot:不知道啊，我周末没有作业，我想睡觉，然后还得找人

Sample 2:

  user :你好啊
chatbot:你好我好大家好
  user :你在干嘛呢
chatbot:你不上课啊
  user :今天是周末啊，笨蛋
chatbot:我在学校呢
  user :我也是，周末还要做作业
chatbot:我还没写呢我们都还没写
  user :周末你有去哪里玩吗
chatbot:没有去南京
  user :南京有什么好玩的啊
chatbot:我们没去南京
  user :你刚刚又说你去了南京，你是不是耍我啊
chatbot:我刚刚才到南京
  user :你好笨啊
chatbot:我没有
  user :你是大笨蛋
chatbot:你才是大笨蛋
  user :你全家都是大笨蛋
chatbot:你个傻冒
  user :你个智障
chatbot:哈哈哈

Sample 3:

  user :在干嘛
chatbot:睡觉
  user :这么早就睡觉啦
chatbot:嗯啊你也早点睡啊
  user :我再玩一会儿手机
chatbot:早点休息晚安
  user :拜拜啦
chatbot:晚安晚安
  user :多喝热水
chatbot:哈哈哈
  user :多喝热水
chatbot:好的咯

Sample 4:

  user :在吗
chatbot:在
  user :我不开心
chatbot:怎么啦
  user :最近好累啊
chatbot:是挺累的
  user :作业好多啊
chatbot:还好还好
  user :我觉得我快要挂科了
chatbot:我也是啊，但是我觉得我快熬出头了
  user :不说了，我还有好多作业没写
chatbot:我也是啊好多啊
  user :那先这样，下次再聊，再见了
chatbot:好哒

Sample 5:

  user :你去过广州吗
chatbot:去过呀，但是没有找到工作呢
  user :那你去过北京吗
chatbot:没有啊
  user :北京有什么好玩的
chatbot:没有你想象的那么好
  user :是吗，我还没过北京
chatbot:那里我也不怎么熟，我去过北京
  user :北京天安门呢
chatbot:没去过，不过可以去玩玩

Sample 6:

  user :你好
chatbot:你好
  user :在干嘛呢
chatbot:在想你呢
  user :好害羞啊
chatbot:害羞什么啊
  user :你这样夸我，我觉得不好意思呀
chatbot:我夸你你还不好意思呢
  user :是啊，哈哈哈
chatbot:你是不是在考验我
  user :你怎么会这样想呢
chatbot:我说我在夸你，这么客气，
  user :你有睡午觉吗
chatbot:有啊
  user :我现在有点困呢
chatbot:我也是
  user :等下还要去上课，好烦啊
chatbot:我们不用上课的呀

TODO

The problem of load imbalance in parallel training of multiple cards

Reference

The Curious Case of Neural Text Degeneration
transformers
GPT2-Chinese
DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Expand