Tutel 다운로드 - Tutel 소스 코드 다운로드

Tutel

기타 소스코드

v0.3.2

다운로드

투르 텔

Tutel Moe : 최적화 된 혼합 엑스 퍼트 구현, 또한 현대적인 훈련 및 동적 행동을 가진 추론을위한 "비정형 평행/희소성/용량/.. 전환"을 제안하는 첫 번째 병렬 솔루션.

지원되는 프레임 워크 : Pytorch (권장 :> = 1.10)
지원되는 GPU : CUDA (FP64/FP32/FP16/BFP16), ROCM (FP64/FP32/FP16)
지원되는 CPU : FP64/FP32

새로운 것 :

Tutel V0.3.3 : 모든 벤치 마크 추가 :

  >> Example :

    python3 - m torch . distributed . run - - nproc_per_node = 8 - m tutel . examples . bandwidth_test - - size_mb = 256

Tutel v0.3.2 : 추가 벤치 마크에 대한 텐서 코어 옵션을 추가 / 사용자 정의 전문가의 예제를 확장 / NCCL 시간 초과 설정 허용 :

  >> Example for using tensorcore :

    python3 - m tutel . examples . helloworld - - dtype = float32
    python3 - m tutel . examples . helloworld - - dtype = float32 - - use_tensorcore

    python3 - m tutel . examples . helloworld - - dtype = float16
    python3 - m tutel . examples . helloworld - - dtype = float16 - - use_tensorcore

  >> Example for custom experts :
    python3 - m tutel . examples . helloworld_custom_expert - - batch_size = 16

  >> Example for NCCL timeout settings :
    TUTEL_GLOBAL_TIMEOUT_SEC = 60 python3 - m torch . distributed . run - - nproc_per_node = 8 - m tutel . examples . helloworld - - use_tensorcore

Tutel V0.3.1 : 임의의 길이 메시지 전송에 NCCL ALL_TO_ALL_V 및 ALL_GATHER_V 추가 :

  >> Example :
    # All_to_All_v:
    python3 - m torch . distributed . run - - nproc_per_node = 2 - - master_port = 7340 - m tutel . examples . nccl_all_to_all_v
    # All_Gather_v:
    python3 - m torch . distributed . run - - nproc_per_node = 2 - - master_port = 7340 - m tutel . examples . nccl_all_gather_v

  >> How to :
    net . batch_all_to_all_v ([ t_x_cuda , t_y_cuda , ..], common_send_counts )
    net . batch_all_gather_v ([ t_x_cuda , t_y_cuda , ..])

Tutel V0.3 : NUM_LOCAL_EXPERT> = 2로 단일 GPU에서 디코더 추론을 개선하기 위해 메가 블록 솔루션을 추가하십시오.

  >> Example ( capacity_factor = 0 for dropless - MoE ):
    # Using BatchMatmul:
    python3 - m tutel . examples . helloworld - - megablocks_size = 0 - - batch_size = 1 - - num_tokens = 32 - - top = 1 - - eval - - num_local_experts = 128 - - capacity_factor = 0
    # Using Megablocks with block_size = 1:
    python3 - m tutel . examples . helloworld - - megablocks_size = 1 - - batch_size = 1 - - num_tokens = 32 - - top = 1 - - eval - - num_local_experts = 128 - - capacity_factor = 0
    # Using Megablocks with block_size = 2:
    python3 - m tutel . examples . helloworld - - megablocks_size = 2 - - batch_size = 1 - - num_tokens = 32 - - top = 1 - - eval - - num_local_experts = 128 - - capacity_factor = 0

  >> How to :
    self . _moe_layer . forward ( x , .., megablocks_size = 1 )         # Control the switch of megablocks_size (0 for disabled)

Tutel V0.2 : 무료로 대부분의 구성을 동적으로 전환 할 수 있도록합니다.

  >> Example :
    python3 - m torch . distributed . run - - nproc_per_node = 8 - m tutel . examples . helloworld_switch - - batch_size = 16

  >> How to :
    self . _moe_layer . forward ( x , .., a2a_ffn_overlap_degree = 2 )  # Control the switch of overlap granularity (1 for no overlapping)
    self . _moe_layer . forward ( x , .., adaptive_r = 1 )              # Control the switch of parallelism (0 for DP, 1 for DP + EP, W / E for MP + EP, else for DP + MP + EP)
    self . _moe_layer . forward ( x , .., capacity_factor = 1 )         # Control the switch of capacity_volume (positive for padding, negative for no-padding, 0 for dropless)
    self . _moe_layer . forward ( x , .., top_k = 1 )                   # Control the switch of top_k sparsity

Tutel v0.1 : 데이터 디스패치 인코딩 및 디코딩의 노인 복잡성을 최적화하려면 2DH 옵션을 추가하여 전체적으로 대규모로 처리하십시오.

  >> Example ( suggest enabling 2 DH only at scale , note that the value of - - nproc_per_node MUST equal to total physical GPU counts per node , e . g . 8 for A100x8 ):
    python3 - m torch . distributed . run - - nproc_per_node = 8 - m tutel . examples . helloworld - - batch_size = 16 - - use_2dh

Pytorch 2를 위해 Tutel Moe를 설정하고 예제를 실행하거나 Moe와 함께 Fairseq를 활성화하는 방법 :

 * Prepare Recommended Pytorch >= 2.0.0 (minimal version == 1.8.0):
        #  Windows/Linux Pytorch for NVIDIA CUDA >= 11.7:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
        #  Linux Pytorch for AMD ROCm == 5.4.2:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
        #  Windows/Linux Pytorch for CPU:
        python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

* Install Tutel Online:

        $ python3 -m pip uninstall tutel -y
        $ python3 -m pip install setuptools wheel
        $ python3 -m pip install -v -U --no-build-isolation git+https://github.com/microsoft/tutel@main

* Build Tutel from Source:

        $ git clone https://github.com/microsoft/tutel --branch main

        $ python3 -m pip uninstall tutel -y
        $ python3 ./tutel/setup.py install --user

* Quick Test on Single-GPU:

        $ python3 -m tutel.examples.helloworld --batch_size=16               # Test Tutel-optimized MoE + manual distribution
        $ python3 -m tutel.examples.helloworld_ddp --batch_size=16           # Test Tutel-optimized MoE + Pytorch DDP distribution (requires: Pytorch >= 1.8.0)
        $ python3 -m tutel.examples.helloworld_ddp_tutel --batch_size=16     # Test Tutel-optimized MoE + Tutel DDP distribution (ZeRO on optimizors)
        $ python3 -m tutel.examples.helloworld_amp --batch_size=16           # Test Tutel-optimized MoE with AMP data type + manual distribution
        $ python3 -m tutel.examples.helloworld_custom_expert --batch_size=16 # Test Tutel-optimized MoE + custom defined expert layer
        $ python3 -m tutel.examples.helloworld_from_scratch                  # Test Custom MoE implementation from scratch
        $ python3 -m tutel.examples.moe_mnist                                # Test MoE layer in end-to-end MNIST dataset
        $ python3 -m tutel.examples.moe_cifar10                              # Test MoE layer in end-to-end CIFAR10 dataset

        (If building from source, the following method also works:)
        $ python3 ./tutel/examples/helloworld.py --batch_size=16
        ..

* Run Tutel MoE in Distributed Mode:

        (Method A - Torch launcher for `Multi-Node x Multi-GPU`:)
        $ ssh <node-ip-0> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-ip-0> -m tutel.examples.helloworld --batch_size=16
        $ ssh <node-ip-1> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=<node-ip-0> -m tutel.examples.helloworld --batch_size=16

        (Method B - Tutel launcher for `Multi-Node x Multi-GPU`, requiring package `openmpi-bin`:)
        # << Single Node >>
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld_ddp_tutel --batch_size=16
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_mnist
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_cifar10
        ...

        # << Cross Nodes >>
        $ mpiexec -bind-to none -host <node-ip-0>,<node-ip-1>,.. -x MASTER_ADDR=<node-ip-0> -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16

        # << For CPU-based Launch>>
        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=1 -x OMP_NUM_THREADS=1024 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16 --device cpu

분산 된 세계 크기에 적응하는 체크 포인트 파일을 변환하는 방법 :

문서가 여기에서 옮겨졌습니다.

Pytorch에서 Tutel에서 최적화 된 Moe를 가져 오는 방법 :

 # Input Example:
import torch
x = torch.ones([6, 1024], device='cuda:0')

# Create MoE:
from tutel import moe as tutel_moe
moe_layer = tutel_moe.moe_layer(
    gate_type={'type': 'top', 'k': 2},
    model_dim=x.shape[-1],
    experts={
        'count_per_node': 2,
        'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)
    },
    scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
)

# Cast to GPU
moe_layer = moe_layer.to('cuda:0')

# In distributed model, you need further skip doing allreduce on global parameters that have `skip_allreduce` mask, 
# e.g.
#    for p in moe_layer.parameters():
#        if hasattr(p, 'skip_allreduce'):
#            continue
#        dist.all_reduce(p.grad)


# Forward MoE:
y = moe_layer(x)

print(y)

Moelayer의 사용 :

 * Usage of MOELayer Args:

        gate_type        : dict-type gate description, e.g. {'type': 'top', 'k': 2, 'capacity_factor': -1.5, ..},
                              or a list of dict-type gate descriptions, e.g. [{'type': 'top', 'k', 2}, {'type': 'top', 'k', 2}],
                              the value of k in top-gating can be also negative, like -2, which indicates one GPU will hold 1/(-k) parameters of an expert
                              capacity_factor X can be positive (factor = X), zero (factor = max(needed_volumes)) or negative (factor = min(-X, max(needed_volumes))).
        model_dim        : the number of channels for MOE's input tensor
        experts          : a dict-type config for builtin expert network
        scan_expert_func : allow users to specify a lambda function to iterate each experts param, e.g. `scan_expert_func = lambda name, param: setattr(param, 'expert', True)`
        result_func      : allow users to specify a lambda function to format the MoE output and aux_loss, e.g. `result_func = lambda output: (output, output.l_aux)`
        group            : specify the explicit communication group of all_to_all
        seeds            : a tuple containing a tripple of int to specify manual seed of (shared params, local params, others params after MoE's)
        a2a_ffn_overlap_degree : the value to control a2a overlap depth, 1 by default for no overlap, 2 for overlap a2a with half gemm, ..
        parallel_type    : the parallel method to compute MoE, valid types: 'auto', 'data', 'model'
        pad_samples      : whether do auto padding on newly-coming input data to maximum data size in history

* Usage of dict-type Experts Config:

        count_per_node   : the number of local experts per device (by default, the value is 1 if not specified)
        type             : available built-in experts implementation, e.g: ffn
        hidden_size_per_expert : the hidden size between two linear layers for each expert (used for type == 'ffn' only)
        activation_fn    : the custom-defined activation function between two linear layers (used for type == 'ffn' only)
        has_fc1_bias     : If set to False, the expert bias parameters `batched_fc1_bias` is disabled. Default: True
        has_fc2_bias     : If set to False, the expert bias parameters `batched_fc2_bias` is disabled. Default: True

참조

Tutel에 대한 더 많은 기술적 인 세부 사항을 알아 보려면 아래이 논문을 참조하십시오.

 @article {tutel,
author = {Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},
title = {Tutel: Adaptive Mixture-of-Experts at Scale},
year = {2022},
month = jun,
journal = {CoRR},
volume= {abs/2206.03382},
url = {https://arxiv.org/pdf/2206.03382.pdf},
}

기여

이 프로젝트는 기여와 제안을 환영합니다. 대부분의 기부금은 귀하가 귀하가 귀하의 기부금을 사용할 권리를 부여 할 권리가 있다고 선언하는 기고자 라이센스 계약 (CLA)에 동의해야합니다. 자세한 내용은 https://cla.opensource.microsoft.com을 방문하십시오.

풀 요청을 제출할 때 CLA 봇은 CLA를 제공하고 PR을 적절하게 장식 해야하는지 자동으로 결정합니다 (예 : 상태 점검, 댓글). 봇이 제공 한 지침을 따르십시오. CLA를 사용하여 모든 저장소에서 한 번만이 작업을 수행하면됩니다.

이 프로젝트는 Microsoft 오픈 소스 행동 강령을 채택했습니다. 자세한 내용은 추가 질문이나 의견이 있으면 행동 강령 FAQ 또는 [email protected]에 문의하십시오.

상표

이 프로젝트에는 프로젝트, 제품 또는 서비스에 대한 상표 또는 로고가 포함될 수 있습니다. Microsoft 상표 또는 로고의 승인 된 사용에는 Microsoft의 상표 및 브랜드 지침이 적용되며 따라야합니다. 이 프로젝트의 수정 된 버전에서 Microsoft 상표 또는 로고를 사용한다고해서 혼란을 일으키거나 Microsoft 후원을 암시해서는 안됩니다. 타사 상표 또는 로고를 사용하면 타사 정책이 적용됩니다.

확장하다

추가 정보