Scripts of LLM pretraining and finetuing (SFT)
LoRA & DeepSpeed supported
The repository is based on tatsu-lab/stanford_alpaca.
Before you start continual pre-training LLM, you should provide the model name (huggingface) or local model path.
Prepare training data, you can use plain text in the format of markdown or txt for pretraining. The example is A Guide to Writing the NeurIPS Impact Statement. You can add more text corpus in the data folder.
Launch
pip install -r requirements.txt
cd llm_pretrain
./pretrain_llama.sh
Note that some parameter settings of these models are different.
Before you start fine-tuning LLM, you should provide the model name (huggingface) or local model path.
Prepare training data, you can add your own task data like the example in sft_examples.json, which is similar to the alpaca_data.json
The format is as follows:
{
"binary_selection": [
{
"instruction": "Does the following text violate the law?nText: OH MY FUCKING GOD",
"output": "No"
},
...
],
"another_task_name": [
{
"instruction": "How are you?",
"output": "Not bad."
},
...
],
...
}
Note that if you put the alpaca_data.json in the data folder, the script will use it as part of the training data.
LLaMA-2: Since there is no pad_token in LLaMA-2, it is recommended that you could add 'tokenizer.pad_token = tokenizer.unk_token' to the tokenizer.
pip install -r requirements.txt
cd llm_sft
./train_llama.sh
pip install -r requirements.txt
cd llm_sft
./train_baichuan_LORA.sh
You can adjust the configurations in the train_lora.py. In our experiments, for baichuan, your transformers version should >= 4.29.0 and < 4.34.0.
Note that some parameter settings of these models are different.
If you want to use DeepSpeed, use the following command:
--deepspeed "./configs/default_offload_opt_param.json"
.
├── LICENSE
├── README.md
├── llm_pretrain_clean
│ ├── data
│ │ └── A_Guide_to_Writing_the_NeurIPS_Impact_Statement.md
│ ├── evaluation
│ │ └── inference_single.py
│ ├── generate_pretrain_data.py
│ ├── pretrain.py
│ ├── pretrain_baichuan2.sh
│ ├── pretrain_llama.sh
│ ├── pretrain_mistral.sh
│ ├── requirementsX.txt
│ └── utils.py
└── sft_model_clean
├── README.md
├── configs
│ └── default_offload_opt_param.json
├── data
│ ├── alpaca_data.json
│ └── sft_examples.json
├── evaluation
│ └── inference_single.py
├── generate_sft_data.py
├── requirementsX.txt
├── train.py
├── train_baichuan.sh
├── train_baichuan_LORA.sh
├── train_llama.sh
├── train_lora.py
├── train_mistral.sh
└── utils.py