This project open source ChatGLM-6B model that has been fine-tuning/instruction fine-tuning (Instruct-tuning). We built a Chinese medical instruction dataset through the medical knowledge graph and GPT3.5 API, and fine-tuned instructions to ChatGLM-6B on this basis, improving the Q&A effect of ChatGLM in the medical field.
Based on the same data, we also trained the medical version of the LLaMA model: Huatuo
Update: Since chatglm is updated very quickly, it is recommended to use chatglm's official fine-tuning solution: https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/README.md
First install the dependency package, python environment recommends 3.9+
pip install -r requirements.txt
The trained model parameters can be downloaded as follows:
| Model name | size | Model download address |
|---|---|---|
| ChatGLM-6B-Med | Approximately 13.4GB | [Baidu Netdisk] [GoogleDrive] |
After the environment is installed, interactive testing can be performed:
python infer.py
We have adopted an open and self-built Chinese medical knowledge base, mainly referring to cMeKG. The medical knowledge base is built around diseases, drugs, examination indicators, etc., and the fields include complications, high-risk factors, histological examinations, clinical symptoms, drug treatment, auxiliary treatment, etc. The knowledge base example is as follows:
{"中心词": "偏头痛", "相关疾病": ["妊娠合并偏头痛", "恶寒发热"], "相关症状": ["皮肤变硬", "头部及眼后部疼痛并能听到连续不断的隆隆声", "晨起头痛加重"], "所属科室": ["中西医结合科", "内科"], "发病部位": ["头部"]}
We used the GPT3.5 interface to build Q&A data around the medical knowledge base and set up a variety of Prompt forms to make full use of knowledge.
The training set data examples for fine-tuning instructions are as follows:
"问题:一位年轻男性长期使用可卡因,突然出现胸痛、呕吐、出汗等症状,经检查发现心电图反映心肌急性损伤,请问可能患的是什么疾病?治疗方式是什么?"
回答: 可能患的是心肌梗塞,需要进行维拉帕米、依普利酮、硝酸甘油、ß阻滞剂、吗啡等药物治疗,并进行溶栓治疗、低分子量肝素、钙通道阻滞剂等辅助治疗。此外需要及时停用可卡因等药物,以防止病情加重。"
The quality of the instruction fine-tuning data set is still limited, and iteration will be carried out in the future. At the same time, the medical knowledge base and data set construction code are still being sorted out, and will be released after the sorting is completed.
Please refer to https://github.com/THUDM/ChatGLM-6B and place the downloaded model in the model directory.
sh scripts/sft_medchat.sh
Please adjust the batch size and other hyperparameters according to the actual situation, and the default is the parameters we use.
If you want to continue with Instruct-tuning/Instruction-tuning ChatGLM with your own data set, you can build a file format similar to the training set in data directory. Or refer to ChatGLM-Instruct-Tuning repository.
We trained on an A100-SXM-80GB graphics card. According to experience, it is recommended to choose 32G or above for training video memory.
| Test input | ChatGLM-6B output | ChatGLM-6B-Med output |
|---|---|---|
| Will glutamyltranspeptidase levels increase due to drug use or alcohol consumption? | Drug use and alcohol consumption can have negative effects on your body's health, including the effects on the liver and heart. Alcohol drinking can cause liver damage and heart disease because alcohol can destroy the function of the liver and heart. Long-term drinking may also lead to health problems such as fatty liver, cirrhosis, heart disease, etc. Drug use can cause serious damage to the liver and heart, resulting in liver fibrosis, cirrhosis, cardiovascular diseases, etc. Drug use can also lead to mental disorders, social withdrawal, violence and crime. Therefore, whether it is alcohol or drug use, it will have a negative impact on health and overuse should be avoided. | There are many reasons for the elevated glutamyltranspeptidase, including infection, peptic ulcer, infection, drug side effects, etc. Unhealthy behaviors such as drug use or alcohol consumption may lead to the elevated glutamyltranspeptidase. |
| Xiao Li recently developed symptoms of tachycardia, accompanied by mild chest pain. Physical examination found that the PR interval was prolonged, accompanied by T wave low and ST segment abnormalities | Very sorry, I don't understand your question. Please provide more context or context so that I can better answer your question. Thanks! | Xiao Li has symptoms of tachycardia and is accompanied by chest pain. It is necessary to consider whether there are cardiovascular diseases such as cardiomyopathy and coronary atherosclerosis. It is recommended to undergo electrocardiogram, blood test, cardiac ultrasound examination, etc. |
| ...... | ...... | ...... |
Q: Error ValueError: 130001 is not in list / ValueError: 150001 is not in list
A: Due to the rapid update of related dependencies, different versions will lead to some bugs
(1) If the error is 150001 is not in list, please update the repository to the latest version
(2) If the error is 130001 is not in list, please fall back the repository to the version with commit cb9d827, the link is https://github.com/SCIR-HI/Med-ChatGLM/tree/cb9d82738021ec6f82b307d6031e8595a49dcb00
Q: The model has limited effect
A: Since ChatGLM's own training strategy is not open source, the instruction fine-tuning training method provided in this repository may have a negative impact on the basic capabilities of ChatGLM. Due to copyright and other considerations, our model iterations in the future will focus on open source models.
This project was completed by Wang Haochun and Liu Chi, the Health Intelligence Group of the Center for Social Computing and Information Retrieval of Harbin Institute of Technology. The instructors are Associate Professor Zhao Sendong, Professor Qin Bing and Professor Liu Ting.
This project refers to the following open source projects, and we would like to express our gratitude to the relevant projects and research and development staff.
The resources related to this project are for academic research only and are strictly prohibited for commercial purposes. When using parts involving third-party code, please strictly follow the corresponding open source protocol. The content generated by the model is affected by factors such as model calculation, randomness and quantitative accuracy losses, and this project cannot guarantee its accuracy. Most of the data sets of this project are generated by models and cannot be used as the basis for actual medical diagnosis even if they comply with certain medical facts. This project assumes no legal liability for any content output by the model, nor is it liable for any losses that may arise from the use of relevant resources and output results.
If you use the data or code of this project, please declare the reference
@misc{ChatGLM-Med,
author={Haochun Wang, Chi Liu, Sendong Zhao, Bing Qin, Ting Liu},
title = {ChatGLM-Med: 基于中文医学知识的ChatGLM模型微调},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {url{https://github.com/SCIR-HI/Med-ChatGLM}},
}