HiFT下载 - HiFT源代码下载

hift：分层完整参数微调策略

此存储库包含Python软件包HiFT的源代码，以及如何将其与Pytorch型号集成在一起的几个示例，例如拥抱面孔的模型。我们目前仅支持Pytorch。有关· HiFT的详细说明，请参见我们的论文。 HiFT支持在混合精度下为24G GPU内存设备的7B型号的FPFT，而无需使用任何存储器保存技术和各种优化器，包括AdamW ， AdaGrad ， SGD等。

hift：分层完整参数微调策略
Yongkang Liu，Yiqun Zhang，Qian Li，Tong Liu，Shi Feng，Daling Wang，Yifei Zhang，HinrichSchütze
论文：https：//arxiv.org/abs/2401.15207

消息

26/1/2024 ：发布第一个版本的HiFT手稿
25/2/2024 ：发布第二版的HiFT手稿和源代码
1/5/2024 ：更新了LoRA的HIFT支持
10/5/2024 ：调整BitSandBytes提供的优化器
13/5/2024*：Adapt Adalora ， LoRA ， IA3 ， P_tuning ， Prefix_tuning ， Prompt_tuning peft方法。

存储库概述

此存储库中有几个目录：

hift/包含软件包hift的源代码，需要安装它以运行我们提供的示例；
示例包含基于HiFT的NER ， QA ， classification ， text generation ， instruction fine-tuning和pre-training示例实现。
脚本包含用于我们提供的运行示例的脚本。
DSCONFIG包含混合精度所需的配置文件。
数据包含指导微调和预训练的示例。

默认问题

指令对A6000（48G）上的微调7B模型，实验结果表明，HIFT支持的最大序列长度为2800。在此限制之外，可能会出现OOM问题。

模型	最大SEQ长度	最大批处理大小
Llama2-7b（羊驼）	512	8
Llama2-7b（Vicuna）	2800	1

RTX3090（24G）上的指令微调7B型号。如果您在RTX 3090/4000上使用多个GPU进行分布式培训，请在运行之前添加以下命令： export NCCL_IB_DISABLE=1 ; export NCCL_P2P_DISABLE=1

模型	最大SEQ长度	最大批处理大小
Llama2-7b（羊驼）	512	3
Llama2-7b（Vicuna）	1400	1

要求

pytorch > = 2.1.1; transformers == 4.36.2
pip install -r requirements.txt
conda install mpi4py==3.1.4
pip install flash-attn==2.5.8

Quickstart

安装hift

pip install hift

导入hift软件包

 ### generation task  

from hift import HiFTSeq2SeqTrainer,GetCallBack,peft_function,Seq2SeqTrainer

### classification taks  

from hift import HiFTrainer,GetCallBack,PEFTrainer,peft_function


### QA task  

from hift import HiFTQuestionAnsweringTrainer,GetCallBack,QuestionAnsweringTrainer,peft_function

添加HiFT配置

 @dataclass
class HiFTArguments(ModelArguments):
    HiTaskType: str = field(
        default="SEQ_CLS",
        metadata={"help": ("HiTaskType should be consistent with PEFT TaskType" )},
    )
    peft_type: str = field(
        default=None,
        metadata={"help": ("peft_type should be in [lora,adalora,ia3,p_tuning,prefix_tuning,prompt_tuning]" )},
    )
    init_text:str = field(
        default="Predict if sentiment of this review is positive, negative or neutral",
        metadata={
            "help": (
                "the init prompt text for prompt tuning"
            )
        },
    )
    lora_rank: int = field(
        default=8,
        metadata={"help": ("rank for lora or adalora" )},
    )
    peft_path : Optional[str] = field(default=None)
    virtual_tokens:int = field(
        default=20,
        metadata={"help": ("the number of virtual tokens for p_tuning, prefix_tuning and prefix_tuning" )},
    )
    group_element: int = field(
        default=1,
        metadata={"help": ("number element for each group parameters" )},
    )
    optimizer_strategy: str = field(
        default="down2up",
        metadata={"help": ("optimizer strategy of ['down2up','down2up','random']" )},
    )
    hier_tuning: bool = field(
        default=False,
        metadata={
            "help": (
                "hierarchical optimization for LLMS"
            )
        },
    )
    freeze_layers: List[str] = field(
        default_factory=list,
        metadata={
            "help": (
                "Index of the frozen layer"
            )
        },
    )

hitaskType应与PEFT TaskType一致。

序列分类，多项选择任务： TaskType.SEQ_CLS
问题回答任务： TaskType.QUESTION_ANS
序列标签任务： TaskType.TOKEN_CLS
生成任务： TaskType.CAUSAL_LM

group_element ：块中包含的层数。默认值为1 。

Freeze_layers ：在微调过程中要冻结的层。您应该提供相应层的索引。嵌入层的索引为0 ，第一层的索引为1 ，...

使用HiFT培训师

HiFT继承了HuggingFace的培训师，因此您可以直接使用HIFT提供的教练来替换原始教练。

分类任务


if model_args.hier_tuning:#hier_tuning
        trainer = HiFTrainer(
            hiFThandler = GetCallBack(model_args.model_name_or_path),
            HiTaskType = model_args.HiTaskType,
            group_element = model_args.group_element,
            strategy = model_args.optimizer_strategy,
            hier_tuning= model_args.hier_tuning,
            peft_type = model_args.peft_type,
            freeze_layers = model_args.freeze_layers,
            args=training_args,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            model=model,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
            data_collator=data_collator
        )
  else:
        trainer = PEFTrainer(
            peft_type = model_args.peft_type,
            args=training_args,
            model=model,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            compute_metrics=compute_metrics,
            tokenizer=tokenizer,
            data_collator=data_collator,
        )

质量检查任务

 if model_args.hier_tuning:
        trainer = HiFTQuestionAnsweringTrainer(
            hiFThandler = GetCallBack(model_args.model_name_or_path),
            HiTaskType = model_args.HiTaskType,
            group_element = model_args.group_element,
            strategy = model_args.optimizer_strategy,
            hier_tuning= model_args.hier_tuning,
            peft_type = model_args.peft_type,
            freeze_layers = model_args.freeze_layers,
            eval_examples=eval_examples if training_args.do_eval else None,
            post_process_function=post_processing_function,
            args=training_args,
            model=model,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=compute_metrics)
 else:
        trainer = QuestionAnsweringTrainer(
            peft_type = model_args.peft_type,
            eval_examples=eval_examples if training_args.do_eval else None,
            post_process_function=post_processing_function,
            args=training_args,
            model=model,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=compute_metrics)

生成任务

 if model_args.hier_tuning:#hier_tuning
        trainer = HiFTSeq2SeqTrainer(
            hiFThandler = GetCallBack(model_args.model_name_or_path),
            HiTaskType = model_args.HiTaskType,
            group_element = model_args.group_element,
            strategy = model_args.optimizer_strategy,
            hier_tuning= model_args.hier_tuning,
            peft_type = model_args.peft_type,
            freeze_layers = model_args.freeze_layers,
            args=training_args,
            model=model,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            compute_metrics=compute_metrics if training_args.predict_with_generate else None,
            tokenizer=tokenizer,
            data_collator=data_collator
        )
 else:
        trainer = Seq2SeqTrainer(
            peft_type = model_args.peft_type,
            args=training_args,
            model=model,
            train_dataset=train_dataset if training_args.do_train else None,
            eval_dataset=eval_dataset if training_args.do_eval else None,
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=compute_metrics if training_args.predict_with_generate else None,
        )

将模型适应HIFT

HiFT支持任何模型。适应HiFT非常容易。

定义TaskTInterface中模型支持的任务类型。
为embedding layer和不同的任务header layers提供regular expressions 。正则表达式的目的是唯一识别相应层的层名。
在others_pattern界面中提供除嵌入层和标头层以外的正则表达式。

最简单的方法是提供others_pattern界面中所有图层的图层名称，而其他接口返回一个空列表[] 。下面是罗伯塔的例子。

 class RobertaCallBack(HiFTCallBack):
    def __init__(self,freeze_layers,strategy,taskType,peft_type=None):
        super().__init__(freeze_layers,strategy,taskType,peft_type)
        self.TaskTInterface = [TaskType.SEQ_CLS,TaskType.TOKEN_CLS,TaskType.QUESTION_ANS]
        self.check_task_type(taskType,"RoBERTa",self.TaskTInterface)
    @property
    def emb_pattern(self):
        if self.peft_type:
            return [rf'.embedding.']
        else:
            return [rf'.embeddings.']
    @property
    def seq_cls_head(self):
        if self.peft_type:
            return ["classifier"]
        else:
            return ["classifier"]
    @property
    def token_cls_head(self):
        if self.peft_type:
            return ["classifier"]
        else:
            return ["classifier"]
    @property
    def qa_cls_head(self):
        if self.peft_type:
            return ["qa_outputs"]
        else:
            return ["qa_outputs"]
    @property
    def others_pattern(self):
        if self.peft_type:
            return [rf'.d+.']
        else:
            return [rf'.d+.']

教学微调 - 维库纳

维库纳

 ### The parameters have not been fine-tuned, this is just a demo. Please adjust the parameters based on your data.

export num_gpus=2
export output_dir="outputs/output_vicuna"
port=$(shuf -i25000-30000 -n1)
#--fsdp "full_shard auto_wrap" 
CUDA_VISIBLE_DEVICES="0,2" torchrun --master_port "$port" --nproc_per_node=$num_gpus examples/vicuna_train.py 
    --model_type llama 
    --HiTaskType "CAUSAL_LM" 
    --optim "lion_32bit" 
    --deepspeed "dsconfig/zero0_config.json" 
    --model_name_or_path /mounts/work/lyk/hierFT/llama2-7b 
    --data_path data/dummy_conversation.json 
    --eval_data_path data/sharegpt_clean.json 
    --output_dir $output_dir/model 
    --num_train_epochs 3 
    --do_train 
    --per_device_train_batch_size 1 
    --per_device_eval_batch_size 8 
    --evaluation_strategy "steps" 
    --eval_steps 1500 
    --save_strategy "steps" 
    --save_steps 1500 
    --save_total_limit 8 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0 
    --lr_scheduler_type "linear" 
    --logging_steps 10 
    --model_max_length 2800 
    --lazy_preprocess True 
    --torch_dtype float16 
    --ddp_find_unused_parameters False 
    --load_best_model_at_end 
    --hier_tuning 
    --group_element $1 
    --optimizer_strategy $2

指导微调 - 羊驼

 ### The parameters have not been fine-tuned, this is just a demo. Please adjust the parameters based on your data.

export num_gpus=2
export output_dir="outputs/instruct_tuning"
port=$(shuf -i25000-30000 -n1)

CUDA_VISIBLE_DEVICES="0,2" torchrun --master_port "$port" --nproc_per_node=$num_gpus examples/instruct_tuning.py 
    --model_type opt 
    --HiTaskType "CAUSAL_LM" 
    --optim "adamw_torch" 
    --deepspeed "dsconfig/zero0_config.json" 
    --model_name_or_path opt-7b  
    --dataset_dir alpaca_data 
    --validation_split_percentage 0.01 
    --per_device_train_batch_size 12 
    --per_device_eval_batch_size 8 
    --do_train 
    --do_eval 
    --seed 12345 
    --fp16 
    --tf32 true 
    --num_train_epochs 1 
    --lr_scheduler_type "cosine" 
    --learning_rate 1e-5 
    --warmup_ratio 0.0 
    --weight_decay 0.0 
    --logging_strategy steps 
    --logging_steps 10 
    --save_strategy steps 
    --save_total_limit 3 
    --evaluation_strategy steps 
    --eval_steps 100 
    --save_steps 200 
    --preprocessing_num_workers 4 
    --max_seq_length 512 
    --output_dir $output_dir/model 
    --overwrite_output_dir 
    --logging_first_step True 
    --torch_dtype float16 
    --ddp_find_unused_parameters False 
    --load_best_model_at_end 
    --hier_tuning 
    --group_element $1 
    --optimizer_strategy $2

预训练

预认证

 ### This is just a demo. Please adjust the parameters based on your data.

export num_gpus=8
export output_dir="outputs/pretrain_tuning"
port=$(shuf -i25000-30000 -n1)

CUDA_VISIBLE_DEVICES=0 torchrun --master_port "$port" examples/pretrain_tuning.py 
    --model_type llama 
    --HiTaskType "CAUSAL_LM" 
    --deepspeed "dsconfig/zero0_config.json" 
    --model_name_or_path llama2-7b 
    --dataset_dir "data" 
    --data_cache_dir "data_cache_dir" 
    --validation_split_percentage 0.001 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 8 
    --do_train 
    --seed 12345 
    --fp16 
    --max_steps 1000 
    --lr_scheduler_type cosine 
    --learning_rate 1e-5 
    --warmup_ratio 0.05 
    --weight_decay 0.01 
    --logging_strategy steps 
    --logging_steps 10 
    --save_strategy steps 
    --save_total_limit 3 
    --save_steps 500 
    --preprocessing_num_workers 8 
    --block_size 512 
    --output_dir $output_dir/model 
    --overwrite_output_dir 
    --logging_first_step True 
    --torch_dtype float16 
    --ddp_find_unused_parameters False 
    --hier_tuning 
    --group_element $1 
    --optimizer_strategy $2

peft-tuning


export num_gpus=8
export output_dir="outputs/e2e_opt"
port=$(shuf -i25000-30000 -n1)
# CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m torch.distributed.launch --nproc_per_node=$num_gpus run_glue.py 
CUDA_VISIBLE_DEVICES=7 torchrun --master_port "$port" examples/run_generation.py 
--model_name_or_path llama2-7b 
--model_type llama 
--HiTaskType "CAUSAL_LM" 
--peft_type "lora" 
--dataset_name e2e_nlg 
--do_train 
--do_eval 
--padding_side "left" 
--group_by_length 
--per_device_train_batch_size 1 
--per_device_eval_batch_size 8 
--save_strategy epoch 
--evaluation_strategy epoch 
--predict_with_generate 
--learning_rate 5e-5 
--lr_scheduler_type "linear" 
--pad_to_max_length 
--max_eval_samples 2000 
--model_max_length 512 
--num_train_epochs 5 
--output_dir $output_dir/model 
--overwrite_output_dir 
--logging_steps 10 
--logging_dir $output_dir/log 
--warmup_ratio 0.0  
--num_beams 10 
--seed 0 
--fp16 
--weight_decay 0.0 
--load_best_model_at_end 
--weight_decay 0

hift + peft


export num_gpus=8
export output_dir="outputs/e2e_opt"
port=$(shuf -i25000-30000 -n1)

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun --master_port "$port" --nproc_per_node=$num_gpus examples/run_generation.py 
--model_name_or_path /mounts/work/lyk/hierFT/llama2-7b 
--model_type llama 
--HiTaskType "CAUSAL_LM" 
--peft_type "lora" 
--dataset_name e2e_nlg 
--do_train 
--do_eval 
--deepspeed "dsconfig/zero0_config.json" 
--padding_side "left" 
--group_by_length 
--per_device_train_batch_size 8 
--per_device_eval_batch_size 8 
--save_strategy epoch 
--evaluation_strategy epoch 
--predict_with_generate 
--learning_rate 5e-5 
--lr_scheduler_type "linear" 
--pad_to_max_length 
--max_eval_samples 2000 
--model_max_length 512 
--num_train_epochs 5 
--output_dir $output_dir/model 
--overwrite_output_dir 
--logging_steps 10 
--logging_dir $output_dir/log 
--warmup_ratio 0.0  
--num_beams 10 
--seed 0 
--fp16 
--weight_decay 0.0 
--load_best_model_at_end 
--hier_tuning 
--weight_decay 0 
--group_element $1 
--optimizer_strategy $2

介绍

详细的培训过程以算法为单位。第一步是确定更新策略。然后冻结所有层。要更新的层，表示 $ e $ ，从队列中选择 $ Q $基于参数 $ m $ 。选定的层 $ e $从队列中移开 $ Q $并添加到尾巴 $ Q $等待下一个更新。选择参数 $ theta_s $需要从 $ m $基于 $ e $ ，设置参数 $ theta_s $到可计算的梯度状态并设置优化器的更新参数组 $ P $到 $ theta_s $ 。参数更新之前，状态参数优化器的参数 $ P $与 $ theta_s $可以移至GPU设备。重量更新完成后，相应的梯度是清理的，优化器状态参数将移至CPU。当所有层都更新一次后，请调整一次学习率。

HiFT迭代在每个训练步骤中更新参数的子集，并且将在多个步骤后修改完整参数。这大大降低了微调语言模型的GPU内存需求，可以在部署过程中进行有效的任务转换，而无需引入推理潜伏期。 HIFT还胜过其他几种适应方法，包括适配器，前缀调整和微调。

HiFT是一种独立于模型和优化器独立的全参数微调方法，可以与PEFT方法集成。

优化器：最新版本的HiFT适用于Adam ， AdamW ， SGD ， Adafactor和Adagrad优化器。

模型： HiFT的最新版本支持BERT ， RoBERTa ， GPT-2 ， GPTNeo ， GPT-NeoX ， OPT和LLaMA-based模型。

Opt-13b的实验（有1000个示例）。 ICL ：在文章中学习； LP ：线性探测； FPFT ：全面调节；前缀：前缀调整。所有实验都使用Mezo的提示。

OPT-13B

E2E数据集上微调骆驼（7b）的GPU内存使用情况。总计代表微调过程中使用的总内存。混合表示具有标准混合精度的微调和混合^ HI^表示适合HiFT的混合精度。 Para表示模型参数所占据的内存； GRA代表梯度占据的记忆； STA表示优化器状态占据的内存。 PGS表示参数，梯度和优化器状态所占据的内存之和。

骆驼记忆

混合精度

源代码

 class FP16_Optimizer(DeepSpeedOptimizer):
    def __init__(self,
       init_optimizer,
       deepspeed=None,
       static_loss_scale=1.0,
       dynamic_loss_scale=False,
       initial_dynamic_scale=2**32,
       dynamic_loss_args=None,
       verbose=True,
       mpu=None,
       clip_grad=0.0,
       fused_adam_legacy=False,
       has_moe_layers=False,
       timers=None):
                 
       ....
       self.fp16_groups = []
       self.fp16_groups_flat = []
       self.fp32_groups_flat = []
                 
       ...
                 
       for i, param_group in enumerate(self.optimizer.param_groups):
           ...
           self.fp32_groups_flat.append(self.fp16_groups_flat[i].clone().float().detach())
           ...

加载1b参数所需的内存为3.72GB （10^9 $ times $ 4/1024/1024/1024）。标准混合精度存储单精度和半精度模型参数。假设您使用的是7B模型的标准混合精度微调，与单精度微调相比，混合精度需要额外的约13G GPU存储器开销来存储半精度模型参数。只有当动态GPU存储器降低到达13GB时，混合精度才能证明其优势。这需要使用大批量尺寸。

我们重新实现混合精确算法以适应HiFT的微调算法，该算法确保单位模型参数不会引起其他GPU内存开销。

引用

 @article { liu2024hift ,
  title = { HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy } ,
  author = { Liu, Yongkang and Zhang, Yiqun and Li, Qian and Feng, Shi and Wang, Daling and Zhang, Yifei and Sch{"u}tze, Hinrich } ,
  journal = { arXiv preprint arXiv:2401.15207 } ,
  year = { 2024 }
}