reparam discrete diffusion下载 - reparam discrete diffusion源代码下载

用于文本生成的重新聚集分散扩散模型

该存储库包含纸张的正式实施，用于文本生成的重新聚集离散扩散模型。

依赖性

代码库是用FairSeq实施的。要安装依赖项，请运行（在虚拟环境中推荐）以下命令：

pip install -r requirements.txt

# install our package of discrete diffusion models
pip install -e discrete_diffusion

# install our fork of fairseq
cd fairseq
python3 setup.py build develop
cd ..

请注意，使用Python 3.8.10，Pytorch 1.10.0/1.12.0和CUDA 11.3测试环境。另请注意，我们的FairSeq叉会修改原始代码库中的几个文件；使用FairSeq的最新版本可能会导致意外的依赖冲突。

离散扩散库的基本用法

我们在一个自包含的库discrete_diffusion中实现离散扩散模型，以供一般使用。该库提供了各种典型离散扩散模型的实现，包括

(Vanilla/Reparameterized) multinomial diffusion ：向令牌序列注入uniform噪声的扩散过程。香草多项式扩散的实现紧随原始论文的代码库非常紧密。
(Vanilla/Reparameterized) absorbing diffusion ：扩散过程，如D3pm纸中所述，序列中令牌中的令牌可以吸收到masking状态。

单击以检查实施详细信息及其论点吗？

这些扩散模型共享相同的接口集，允许外部用途。特别是，它们被定义为DiscreteDiffusion类的子类，采用以下形式：

 class DiscreteDiffusion ( nn . Module ):
    """
    The parent class for discrete denoising diffusion probabilistic models.

    It supports the following methods:
    - q_sample()
        Sample x_t ~ q(x_t | x_0) to construct noisy Transformer inputs.
    - compute_losses()
        Compute the loss L_t = KL(q||p) at t-th time step.
    - sample_step()
        Sample x_t ~ p(x_{t-1} | x_t, x_0) at t-th time step.
    """
    
    def __init__ ( self , num_timesteps ):
        super (). __init__ ()
        self . num_timesteps = num_timesteps

    def q_sample ( self , x_0 , t , ** kwargs ):
        """

        Sample from q(x_t | x_0), which is used as the model inputs.

        Args:
            x_0: token ids with shape [B, N]
            t: current time step, tensor with shape [B]

        Returns:
            return a dict of relevant outputs including x_t.
            
        """

    def compute_losses ( self , inputs , ** kwargs ):
        """
        
        Compute the loss objective KL(q||p) to train our generative process.

        Args:
            inputs: a dict that contains input types specific to different diffusion processes, containing
                - x_t: token ids with shape [B, N]
                - t: scalar timesteps, with shape [B]

        Returns:
            possibly return a dict of relevant outputs, including the loss used for training.
            
        """

    def sample_step ( self , decoder_out , denoising_fn , ** kwargs ):
        """
        Given a time step t, start from x_t and sample x_{t-k} from q(x_{t-k} | x_t).
        
        Args:
            decoder_out: a namedtuple that contains decoding info, including
                - x_t: token ids with shape [B, N]
                - t: scalar timesteps
                - max_steps: the maximum number of decoding steps
                - ...
            
            denoising_fn: a function that takes in x_t and t and returns model logits

            kwargs: other arguments that are used to control decoding.
        
        Returns:
            return a new decoder_out namedtuple.
        """

可以通过配置以下内容来实例化DiscreteDiffusion模型：

基本属性，包括
- --num-diffusion-timesteps <int>指定扩散时间步骤的全数（默认：50）
- --diffusion-type <str>指定扩散模型类型（选择： {absorbing, multinomial, reparam-absorbing, reparam-multinomial} ）
- --noise-scheduler-type <str>仅在香草/reparam多项式扩散中指定噪声时间表（典型选择： {linear, cosine} ;默认： cosine ）
q_sample()中针对远期采样例程的重要参数，包括
- --q-sample-mode <str>指定采样策略（选择： {default, coupled, multi-step, multi-sample} ;默认值： default ）。我们提供了各种选择的选择 $ q（x_t | x_0）$准备损坏的令牌序列以降级，包括
  - default ：将单个样本绘制为 $ x_t sim q（x_t | x_0）$ ，与以前的实践相同；
  - multi-step ：示例两个IID时间步骤 $ s，t $并画 $ x_s sim q（x_s | x_0）$和 $ x_t sim q（x_t | x_0）$ ，分别。然后我们优化平均 $ frac {1} {2}（ Mathcal {l} _s + Mathcal {l} _T）$降低差异；
  - multi-sample ：样本两个IID样品 $ x_t sim q（x_t | x_0）$和 $ x_t^{'} sim q（x_t | x_0）$在同一步骤中，并计算在这两个样本上平均的损失；
  - coupled ：也称为条件培训，该培训在本文的附录F中详细介绍。这首先要采样两个IID时间步骤 $ s，t $ （认为 $ s＆lt; t $ ）。我们画 $ x_t sim q（x_t | x_0）$像往常 $ x_s $从分配条件下 $ x_t $作为 $ x_s sim q（x_s | x_t，x_0）$ 。然后我们计算平均 $ frac {1} {2}（ Mathcal {l} _s + Mathcal {l} _T）$作为目标。该策略可以模拟向后的过渡过程，并有助于稳定训练。在初步实验期间，我们发现coupled采样模式为两种香草多项式/吸收扩散带来了显着改善，但是在重新聚集变体中，增益并不是一致的。
- --not-diffusing-special-sym表明是否在扩散过程中包括特殊符号（默认：false）
针对compute_losses()中损失目标计算的重要参数，包括
- --reweighting-type <str>指定我们重新聚集家庭中的重新加权方案（选择： {linear, reciprocal, none} ;默认值： linear ）
- --label-smoothing <float>指定标签平滑速率（默认：0.1）
sample_step()中特定于解码例程的重要参数，包括
- --argmax-decoding指示是否使用Argmax解码用于Denoed Transformer输出 $ tilde {x} _0 $ （默认：false）
- --temperature <float>指定温度 $ tau $用于采样 $ tilde {x} _0 sim operatatorName {percorical}（f（x_t; theta）/ tau）$如果不使用Argmax解码方案。（默认：1.0）
- --decoding-strategy <str>指定使用香草（ default ） /重新聚集（ reparam-<options> ;请参阅详细信息）解码策略（选择： {default, reparam-<options>} ; default：default：default：default：default： default ：
- --load-ema-weights表示是否加载了生成的EMA模型权重（默认：false）
- --iter-decode-max-iter <int>指定解码的最大时间段数（默认：10）
- --iter-decode-with-beam <int>指定梁的大小，用于解码多个并行长度的多个序列（默认：1）
- --iter-decode-force-max-iter表示迭代解码必须运行指定的迭代数并且不退出。建议将此标志设置为true。

请参阅此处以获取更全面的论点列表。

解码策略

香草采样方案

通过传递--decoding-strategy default ，使用了香草采样方案（特定于每个离散扩散过程）。

改进采样并重新聚集

可以通过传递--decoding-strategy reparam-<conditioning-of-v>-<topk_mode>-<schedule>来调用更高级的解码方法。这种方法基于本文中提出的重新聚集化，并允许更有效的解码程序。选项通过

<conditioning-of-v> ： uncond或cond （默认uncond ）：是否生成路由变量 $ v_t $以条件或无条件的方式；
<topk_mode> ： stochastic<float>或deterministic （默认deterministic ）：是使用随机或确定性的top- $ k $选择。 stochastic<float>中的浮点值指定随机上$ k $选择的随机程度；
<schedule> ： linear或cosine （默认cosine ）： $ k $在我们的DeNoising过程中，该过程用于控制下一个解码步骤的顶部$ K $代币数量。

有关选项的更多详细信息，请参见实现。

机器翻译

数据预处理

请参阅下面的脚本以获取详细信息。

笔记
请注意，这项工作中考虑的所有任务均在原始数据上运行，并且不采用知识蒸馏（KD）。

iwslt14 de-en

我们遵循Fairseq/示例中的标准预处理以准备二进制数据：

 # fetch and preprocess the data to BPE codes
cd examples/translation/
bash prepare-iwslt14.sh
cd ../..

# binarize the data
TEXT=examples/translation/iwslt14.tokenized.de-en
fairseq-preprocess --joined-dictionary --source-lang de --target-lang en 
    --trainpref $TEXT /train --validpref $TEXT /valid --testpref $TEXT /test 
    --destdir data-bin/iwslt14.tokenized.de-en 
    --workers 20

WMT14 en-de

我们使用FairSeq/示例中发布的数据来准备数据集：

wget http://dl.fbaipublicfiles.com/nat/original_dataset.zip
unzip original_dataset.zip
TEXT=wmt14_ende
fairseq-preprocess --joined-dictionary 
    --source-lang en --target-lang de 
    --trainpref $TEXT /train.en-de --validpref $TEXT /valid.en-de --testpref $TEXT /test.en-de 
    --destdir data-bin/wmt14_ende --thresholdtgt 0 --thresholdsrc 0 
    --workers 20

WMT16 ENRO

对于此数据集，我们使用此存储库中预处理的原始数据WMT16.GZ。

tar xzvf wmt16.tar.gz

TEXT=wmt16/en-ro

# move train/ dev/ test/ bpe codes into the $TEXT folder
mv $TEXT /train/corpus.bpe.en $TEXT /train.bpe.en
mv $TEXT /train/corpus.bpe.ro $TEXT /train.bpe.ro
mv $TEXT /dev/dev.bpe.en $TEXT /dev.bpe.en
mv $TEXT /dev/dev.bpe.ro $TEXT /dev.bpe.ro
mv $TEXT /test/test.bpe.en $TEXT /test.bpe.en
mv $TEXT /test/test.bpe.ro $TEXT /test.bpe.ro

# binarize the data
fairseq-preprocess --joined-dictionary 
    --source-lang en --target-lang ro 
    --trainpref $TEXT /train.bpe --validpref $TEXT /dev.bpe --testpref $TEXT /test.bpe 
    --destdir data-bin/wmt16_enro --thresholdtgt 0 --thresholdsrc 0 
    --workers 20

训练

我们首先进入fairseq文件夹，然后运行以下命令来训练模型。

 # ####### training scripts for IWSLT'14 , WMT'14, and WMT'16 
# first cd to fairseq
# we use 1 GPU for IWSLT'14, 4 GPUs for WMT'14 and 2 GPUs for WMT'16 datasets respectively.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_train.sh -m absorbing -d < iwslt/wmt14/wmt 16> -s default -e True --store-ema --label-smoothing 0.1
CUDA_VISIBLE_DEVICES=1 bash experiments/mt_train.sh -m multinomial -d < iwslt/wmt14/wmt 16> -s default -e True --not-diffusing-special-sym --store-ema --label-smoothing 0.0
CUDA_VISIBLE_DEVICES=2 bash experiments/mt_train.sh -m reparam-absorbing -d < iwslt/wmt14/wmt 16> -s default -e True --q-sample-mode coupled  --store-ema --label-smoothing 0.1 --reweighting-type linear
CUDA_VISIBLE_DEVICES=3 bash experiments/mt_train.sh -m reparam-multinomial -d < iwslt/wmt14/wmt 16> -s default -e True --not-diffusing-special-sym --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear

笔记
-s <str>用于指定实验名称。
我们可以通过自定义参数，这些论点可能会通过培训附加在-e True之后附加。

发电与评估

评估管道通过experiments/mt_generate.sh处理。该脚本将生成翻译结果并评估BLEU分数。

 # ########## IWLS'14, WMT'14, and WMT'16 datasets
# we recommend putting each checkpoint into a separate folder
# since the script will put the decoded results into a file under the same folder of each checkpoint.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_generate.sh -a false -c < checkpoint_path > -d < iwslt/wmt14/wmt 16>

参数：

-a ：是否平均多个检查点
-c ：指示检查点的位置。如果-a false （不是平均检查点），请通过检查点路径；如果-a true ，请传递在不同训练步骤中存储多个检查点的目录以进行平均。
-d ：数据集名称

训练有素的模型检查点

我们还提供了训练有素的模型的检查点。

数据集	模型	检查点链接
IWSLT'14	多项式	关联
IWSLT'14	吸收	关联
IWSLT'14	reparam-multinomial	关联
IWSLT'14	Reparam-Absorbing	关联
WMT'14	多项式	关联
WMT'14	吸收	关联
WMT'14	reparam-multinomial	关联
WMT'14	Reparam-Absorbing	关联
WMT'16	多项式	关联
WMT'16	吸收	关联
WMT'16	reparam-multinomial	关联
WMT'16	Reparam-Absorbing	关联

问题产生和释义任务

我们遵循diffuseq中的实验设置，以产生问题和释义任务。

数据预处理

这两个任务的原始数据可以从原始的divFuseQ存储库中获取。然后，我们通过提供的脚本对数据进行分类。

 # put the raw data in the directory ``diffuseq_data/QG``
# Preprocess the question generation dataset
bash diffusion_mt/scripts/preprocess_diffuseq_datasets.sh QG

# put the raw data in the directory ``diffuseq_data/QQP``
# Preprocess the paraphrasing dataset
bash diffusion_mt/scripts/preprocess_diffuseq_datasets.sh QQP

训练

 # QQP or QG datasets
# first cd to fairseq
CUDA_VISIBLE_DEVICES=0,1 bash experiments/diffuseq_train.sh -m absorbing -d < qqp/qg > -s default -e True --store-ema --label-smoothing 0.1
CUDA_VISIBLE_DEVICES=2,3 bash experiments/diffuseq_train.sh -m multinomial -d < qqp/qg > -s default -e True      --not-diffusing-special-sym --store-ema --label-smoothing 0.0 
CUDA_VISIBLE_DEVICES=0,1 bash experiments/diffuseq_train.sh -m reparam-multinomial -d < qqp/qg > -s default -e True  --not-diffusing-special-sym  --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear
CUDA_VISIBLE_DEVICES=2,3 bash experiments/diffuseq_train.sh -m reparam-absorbing -d < qqp/qg > -s default -e True      --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear

发电与评估

我们与divFuseQ中的生成和评估协议紧密遵循，以确保正面比较。整个管道在fairseq/diffusion_mt/scripts/decode_diffuseq.py和fairseq/diffusion_mt/scripts/eval_diffuseq.py中重新完成，以与FairSeq兼容。运行以下命令：

 # we recommend putting each checkpoint into a separate folder
# since the script will put the decoded results into a file under the same folder of each checkpoint.
CUDA_VISIBLE_DEVICES=0 bash experiments/diffuseq_generate.sh -a false -b true -c < checkpoint_path > -d < qqp/qg >

参数：

-a ：是否平均多个检查点
-b ：是否要使用多个样本进行MBR解码
-c ：指示检查点的位置。如果-a false （不是平均检查点），请通过检查点路径；如果-a true ，请传递在不同训练步骤中存储多个检查点的目录以进行平均。
-d ：数据集名称

训练有素的模型检查点

我们还提供了训练有素的模型的检查点。

数据集	模型	检查点链接
QG	多项式	关联
QG	吸收	关联
QG	reparam-multinomial	关联
QG	Reparam-Absorbing	关联
QQP	多项式	关联
QQP	吸收	关联
QQP	reparam-multinomial	关联
QQP	Reparam-Absorbing	关联

引用

 @article { zheng2023rdm ,
  title = { A Reparameterized Discrete Diffusion Model for Text Generation } ,
  author = { Zheng, Lin and Yuan, Jianbo and Yu, Lei and Kong, Lingpeng } ,
  journal = { arXiv preprint arXiv:2302.05737 } ,
  year = { 2023 }
}