literary alpaca2 Download - literary alpaca2 Source code download

literary alpaca2

Other source code

1.0.0

Download

Literary-Alpaca2

From vocabulary to fine-tuning, this is everything you need

User Guide

Project Introduction
Training data
⏬ Model deployment
- Model download
  - Chinese pre-trained model based on Llama2
  - Chinese fine-tuning model based on LiteraryAlpaca2 Chat
- Model call example
Vocabulary training
Pre-training
Fine adjustment
- Data preparation
- Fine-tuning scripts
Reference papers

Project Introduction

This repository will show how to build your own vocabulary from the vocabulary and use the dock model pre-training and fine-tuning the code examples in the repository are mainly trained based on the Hugging Face version of Llama2, providing training examples for GPU and TPU. Due to the limited energy of the author, there are bugs in the TPU script, which is for reference only.

Training data

type	describe
Online novels	High-quality long text data
Math23K	Chinese Mathematics Problems
LCCC	Chinese open source dialogue set

How to use LCCC dataset:

 from datasets import load_dataset
dataset = load_dataset("lccc", "base")  # or "large"

The vocabulary list and the data comparison diagram of the pre-training stage. The figure shows the number of files and total file sizes of each data source, and compares the minimum and maximum characters and the average number of characters in each file:

⏬ Model deployment

Meta official download link: https://huggingface.co/meta-llama

The Chinese pre-trained model, LoRA parameters, and chat models have all been uploaded to Hugging Face. Currently, there are only 13B models.

Model download

Chinese pre-trained model based on Llama2

category	?Model name	Base model	Download address
Pre-training	taotie1/literary-alpaca2-13B	meta-llama/Llama-2-13b-hf	Model download
LoRA	taotie1/literary-alpaca2-13B-lora	taotie1/literary-alpaca2-13B	Model download

Chinese fine-tuning model based on LiteraryAlpaca2 Chat

category	?Model name	Download address
Chat	taotie1/literary-alpaca2-13B-chat	Model download

Model call example

According to requirements.txt installation environment dependencies, please select the version of torch installation according to your device.

 import torch
from transformers import AutoTokenizer , AutoModelForCausalLM
model = AutoModelForCausalLM . from_pretrained ( 'taotie1/literary-alpaca2-13B-chat' , device_map = 'auto' , torch_dtype = torch . float16 , load_in_8bit = True )
model = model . eval ()
tokenizer = AutoTokenizer . from_pretrained ( 'taotie1/literary-alpaca2-13B-chat' , use_fast = False )
tokenizer . pad_token = tokenizer . eos_token
input_ids = tokenizer ([ '<s>Human: 什么是计算机n </s><s>Assistant: ' ], return_tensors = "pt" , add_special_tokens = False ). input_ids . to ( 'cuda' )        
generate_input = {
    "input_ids" : input_ids ,
    "max_new_tokens" : 512 ,
    "do_sample" : True ,
    "top_k" : 50 ,
    "top_p" : 0.95 ,
    "temperature" : 0.3 ,
    "repetition_penalty" : 1.3 ,
    "eos_token_id" : tokenizer . eos_token_id ,
    "bos_token_id" : tokenizer . bos_token_id ,
    "pad_token_id" : tokenizer . pad_token_id
}
generate_ids  = model . generate ( ** generate_input )
text = tokenizer . decode ( generate_ids [ 0 ])
print ( text )

Vocabulary training

First name and clean your training data [optional]

Choose to run the random cleaning code or clean it all, and you can customize your rules in ill_ocr_regex.txt.

Run full_sample_extraction.py to merge the data into a file.

Refer to train-chinese-tokenizer.ipynb for vocabulary training, and you can modify the code according to your own needs. After the training is completed, put your vocabulary list into the my-tokenizer directory. Merge with the tokenizer of the original llama2 in the following way

 bash运行
'
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
python incorporation.py
'

Run text.py to test the vocabulary effect

Pre-training

This repository training code uses DeepSpeed to speed up

Please modify the paths of your corresponding output_model, dataset, pretrained_model_name, and tokenizer_name in the pretraining script pretrain-peft1.sh. If you need to upload it to your Hugging Face repository, please set --push_to_hub to true and fill in your token in the --hub_token parameter, and modify --hub_model_id
Please modify the --nnodes and --nproc_per_node parameters for multiple machines and multiple cards.

如两机8卡：torchrun --nnodes 2 --nproc_per_node 8

You can delete the pretraining code pretrain-peft1.py to achieve full participation training

    for name, param in model.named_parameters():
        if "model.embed_tokens" not in name:
            param.requires_grad = False
        else:
            param.requires_grad = True

If it is difficult to converge or insufficient memory please use the pretrain script 2 pretrain-peft2.sh, which will allow you to train the quantitative model, you can adapt to your needs by adjusting the following parameters

      --load_in_kbits 设置量化,不为4或8则不启用量化
      --bf16 | --fp16 启用bf16需要gpu硬件支持
如果出现OOM请在deepspeed_config_peft2.json配置中添加：
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

- Using pretrain script 2 pretrain-peft2.sh will generate lora parameters. You can run the merge_lora_low_mem.py script modified from Chinese-LLaMA-Alpaca-2 to merge

 python merge_lora_low_mem.py
    --base_model /root/LiteraryAlpaca2 
    --lora_model /root/autodl-tmp/LiteraryAlpaca2-lora 
    --output_type huggingface 
    --output_dir /root/autodl-tmp/LiteraryAlpaca2

Fine adjustment

Display of personalized adjustments to simple data sets:

Data preparation

Use the conversion script in the sft directory to convert the dataset into the required training format. The original datasets used in this project are all in json format. Please modify the conversion script as needed:

Stanford Alpaca Format: LCD Conversion Script Math23K Conversion Script

The generated json file format is:

 [
  {"instruction" : ...,
   "input" : ...,
   "output" : ...},
  ...
]

For example:

  {
    "instruction": "下面是人类之间的对话与交流",
    "input": "火锅 我 在 重庆 成都 吃 了 七八 顿 火锅",
    "output": [
      "哈哈哈哈 ！ 那 我 的 嘴巴 可能 要 烂掉 ！",
      "不会 的 就是 好 油腻"
    ]
  }

csv format: LCD conversion script Math23K conversion script

The converted csv file contains a column of "text", each behavior is a training example, and each training example is organized into the input of the model in the following format:

 "<s>Human: "+...+"n</s><s>Assistant: "+...

For example:

 <s>Human: 镇海雅乐学校二年级的小朋友到一条小路的一边植树．小朋友们每隔2米种一棵树（马路两头都种了树），最后发现一共种了11棵，这条小路长多少米．</s><s>Assistant: 根据方程式x=(11-1)*2解得:
20</s>

Fine-tuning scripts

Run sft/sft-peft.sh in the sft directory to start training. For the specific implementation code, see sft/sft-peft.py

Reference papers

Llama 2: Open Foundation and Fine-Tuned Chat Models
LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-19
size 2.24MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
GitHub actions/download artifact

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All