By fine-tuning the ChatGLM-6B open source model using LoRA, the ChatGLM model can be used for composite task processing. This project mainly deals with two tasks: intelligent classification of new media industry comments and information extraction
The project is being updated continuously...
Text classification refers to the process of dividing one or more paragraphs of text into different categories or tags according to their content or topic characteristics. In actual work, text classification is widely used, such as: news classification, resume classification, email classification, office document classification, area classification, etc. Text filtering can also be realized to quickly identify and filter information that meets special requirements from a large amount of text.
Information extraction is a technology that identifies factual descriptions of entities, relationships, events, etc. from unstructured or semi-structured natural texts, and stores and utilizes them in a structured form. by
"Xiao Ming and Xiao Qin are good friends. They are both from Yunnan. Xiao Ming lives in Dali and Xiao Qin lives in Lijiang."
As an example, you can get:
<Xiao Ming, friend, Xiao Qin> and <Xiao Qin, Living, Living, Living, Dali> and other triple information.
With the rapid development of Internet technology, the new media industry has become one of the main platforms for information dissemination. In this era of information explosion, people obtain information through social media, news clients, blogs and other forms. However, with the increasing amount of information, how to efficiently manage and utilize this information has become an urgent problem. Based on some "new media industry" data as the background, this project helps the new media industry to quickly and accurately obtain useful information from massive information and conduct reasonable classification and management through the classification and information extraction of text comments. This not only helps the new media platform improve user experience, but also provides information producers with more accurate data analysis and decision-making support.
Based on ChatGLM-6B model + LoRA fine-tuning method, the development of joint tasks of text classification and information extraction is realized
| Model | GPU video memory |
|---|---|
| ChatGLM-6B | 13 GB |
| Dependency package | Version Requirements |
|---|---|
| protobuf | >=3.19.5,<3.20.1 |
| transformers | >=4.27.1 |
| streamlit | ==1.17.0 |
| datasets | >==2.10.1 |
| Accelerate | ==0.17.1 |
| Packaging | >=20.0 |
LoRA technology freezes the weights of the pretrained model and injects a trainable layer (called the rank decomposition matrix) into each Transformer block, i.e., adds a "sidebranch" A and B next to the Linear layer of the model. Among them, A lowers the data from d dimension to r dimension, which is the rank of LoRA, which is an important hyperparameter; B raises the data from r dimension to d dimension, and the parameters of part B are initially 0. After model training is completed, the parameters of A+B part need to be combined with the parameters of the original large model.
Data format: dictionary style; context content represents: original input text (prompt); target point: target text. Mixed data sets include text classification data and information extraction data.
The training data set contains a total of: 902 samples.
The verification data set contains a total of: 122 samples.
The use of the ChatGLM-6B model weights in this project is subject to model permission.