ดาวน์โหลด reparam discrete diffusion reparam discrete diffusion

รูปแบบการแพร่กระจายแบบไม่ต่อเนื่องสำหรับการสร้างข้อความแบบ reparameterized

ที่เก็บนี้มีการใช้งานอย่างเป็นทางการของกระดาษแบบจำลองการแพร่กระจายแบบไม่ต่อเนื่องสำหรับการสร้างข้อความ

การพึ่งพาอาศัยกัน

Codebase ถูกนำไปใช้กับ Fairseq ในการติดตั้งการอ้างอิงให้เรียกใช้ (แนะนำในสภาพแวดล้อมเสมือนจริง) คำสั่งต่อไปนี้:

pip install -r requirements.txt

# install our package of discrete diffusion models
pip install -e discrete_diffusion

# install our fork of fairseq
cd fairseq
python3 setup.py build develop
cd ..

หมายเหตุ สภาพแวดล้อมได้รับการทดสอบด้วย Python 3.8.10, Pytorch 1.10.0/1.12.0 และ Cuda 11.3 นอกจากนี้โปรดทราบส้อมของ Fairseq ของเราปรับเปลี่ยนไฟล์หลายไฟล์ใน codebase ดั้งเดิม การใช้ Fairseq เวอร์ชันล่าสุดอาจนำไปสู่ความขัดแย้งที่ไม่คาดคิด

การใช้งานขั้นพื้นฐานของไลบรารีการกระจายแบบไม่ต่อเนื่อง

เราใช้โมเดลการแพร่กระจายแบบไม่ต่อเนื่องในห้องสมุดที่มีอยู่ในตัวเอง discrete_diffusion สำหรับการใช้งานทั่วไป ห้องสมุดให้การใช้งานแบบจำลองการแพร่กระจายแบบไม่ต่อเนื่องทั่วไปซึ่งประกอบด้วย

(Vanilla/Reparameterized) multinomial diffusion : กระบวนการแพร่กระจายที่ฉีดเสียง uniform ไปยังลำดับโทเค็น การดำเนินการของการแพร่กระจายของวานิลลามัลติโนมิลล์อย่างใกล้ชิดเป็นไปตามรหัสฐานของกระดาษต้นฉบับ
(Vanilla/Reparameterized) absorbing diffusion : กระบวนการแพร่กระจายที่โทเค็นภายในลำดับสามารถดูดซึมเข้าสู่สถานะ masking ตามที่อธิบายไว้ในกระดาษ D3pm

คลิกเพื่อตรวจสอบรายละเอียดการใช้งานเช่นเดียวกับอาร์กิวเมนต์ของพวกเขา?

โมเดลการแพร่กระจายเหล่านี้แบ่งปันชุดอินเทอร์เฟซเดียวกันเพื่อให้สามารถใช้งานภายนอกได้ โดยเฉพาะอย่างยิ่งพวกเขาถูกกำหนดให้เป็นคลาสย่อยของคลาส DiscreteDiffusion โดยใช้แบบฟอร์มต่อไปนี้:

 class DiscreteDiffusion ( nn . Module ):
    """
    The parent class for discrete denoising diffusion probabilistic models.

    It supports the following methods:
    - q_sample()
        Sample x_t ~ q(x_t | x_0) to construct noisy Transformer inputs.
    - compute_losses()
        Compute the loss L_t = KL(q||p) at t-th time step.
    - sample_step()
        Sample x_t ~ p(x_{t-1} | x_t, x_0) at t-th time step.
    """
    
    def __init__ ( self , num_timesteps ):
        super (). __init__ ()
        self . num_timesteps = num_timesteps

    def q_sample ( self , x_0 , t , ** kwargs ):
        """

        Sample from q(x_t | x_0), which is used as the model inputs.

        Args:
            x_0: token ids with shape [B, N]
            t: current time step, tensor with shape [B]

        Returns:
            return a dict of relevant outputs including x_t.
            
        """

    def compute_losses ( self , inputs , ** kwargs ):
        """
        
        Compute the loss objective KL(q||p) to train our generative process.

        Args:
            inputs: a dict that contains input types specific to different diffusion processes, containing
                - x_t: token ids with shape [B, N]
                - t: scalar timesteps, with shape [B]

        Returns:
            possibly return a dict of relevant outputs, including the loss used for training.
            
        """

    def sample_step ( self , decoder_out , denoising_fn , ** kwargs ):
        """
        Given a time step t, start from x_t and sample x_{t-k} from q(x_{t-k} | x_t).
        
        Args:
            decoder_out: a namedtuple that contains decoding info, including
                - x_t: token ids with shape [B, N]
                - t: scalar timesteps
                - max_steps: the maximum number of decoding steps
                - ...
            
            denoising_fn: a function that takes in x_t and t and returns model logits

            kwargs: other arguments that are used to control decoding.
        
        Returns:
            return a new decoder_out namedtuple.
        """

รูปแบบ DiscreteDiffusion สามารถสร้างอินสแตนซ์โดยการกำหนดค่าต่อไปนี้:

คุณลักษณะพื้นฐานรวมถึง
- --num-diffusion-timesteps <int> ระบุจำนวนขั้นตอนเวลาการแพร่กระจายทั้งหมด (ค่าเริ่มต้น: 50)
- --diffusion-type <str> ระบุประเภทของแบบจำลองการแพร่กระจาย (ตัวเลือก: {absorbing, multinomial, reparam-absorbing, reparam-multinomial} )
- --noise-scheduler-type <str> ระบุตารางเสียงรบกวนเฉพาะใน วานิลลา/reparam multinomial diffusion (ตัวเลือกทั่วไป: {linear, cosine} ; ค่าเริ่มต้น: cosine )
อาร์กิวเมนต์ที่สำคัญเฉพาะสำหรับรูทีนการสุ่มตัวอย่างไปข้างหน้าใน q_sample() รวมถึง
- --q-sample-mode <str> ระบุกลยุทธ์การสุ่มตัวอย่าง (ตัวเลือก: {default, coupled, multi-step, multi-sample} ; ค่าเริ่มต้น: default ) เรามีตัวเลือกต่าง ๆ สำหรับการสุ่มตัวอย่างจาก $ q (x_t | x_0) $ เพื่อเตรียมลำดับโทเค็นที่เสียหายสำหรับ denoising รวมถึง
  - default : ตัวอย่างเดียวถูกวาดเป็น $ x_t sim q (x_t | x_0) $ เหมือนกับการปฏิบัติก่อนหน้านี้
  - multi-step : ตัวอย่างขั้นตอนเวลา IID สองขั้นตอน $ S, T $ และวาด $ x_s sim q (x_s | x_0) $ และ $ x_t sim q (x_t | x_0) $ ตามลำดับ จากนั้นเราจะปรับค่าเฉลี่ยให้เหมาะสม $ frac {1} {2} ( mathcal {l} _s + mathcal {l} _t) $ สำหรับการลดความแปรปรวน
  - multi-sample : ตัวอย่างสองตัวอย่าง IID $ x_t sim q (x_t | x_0) $ และ $ x_t^{'} sim q (x_t | x_0) $ ในขั้นตอนเดียวกันและคำนวณการสูญเสียโดยเฉลี่ยผ่านตัวอย่างทั้งสองนี้
  - coupled : หรือที่รู้จักกันในชื่อการฝึกอบรมที่มีเงื่อนไขซึ่งมีรายละเอียดในภาคผนวก F ของกระดาษ สิ่งนี้เริ่มต้นด้วยการสุ่มตัวอย่างขั้นตอนเวลา IID สองขั้นตอน $ S, T $ (สมมติ $ s & lt; T $ - เราวาด $ x_t sim q (x_t | x_0) $ ตามปกติ แต่วาด $ x_s $ จากการแจกแจงแบบมีเงื่อนไข $ x_t $ เช่น $ x_s sim q (x_s | x_t, x_0) $ - จากนั้นเราคำนวณค่าเฉลี่ย $ frac {1} {2} ( mathcal {l} _s + mathcal {l} _t) $ เป็นวัตถุประสงค์ กลยุทธ์นี้สามารถจำลองกระบวนการเปลี่ยนแปลงย้อนหลังและช่วยให้การฝึกอบรมมีเสถียรภาพ ในระหว่างการทดลองเบื้องต้นเราพบว่าโหมดการสุ่มตัวอย่าง coupled นำการปรับปรุงที่สำคัญสำหรับการแพร่กระจายของวานิลลาพหุคูณ/การดูดซับ แต่การได้รับไม่ได้มีความสำคัญอย่างต่อเนื่องในตัวแปร reparameterized
- --not-diffusing-special-sym ระบุว่าจะรวมสัญลักษณ์พิเศษในระหว่างกระบวนการแพร่กระจาย (ค่าเริ่มต้น: false)
อาร์กิวเมนต์ที่สำคัญเฉพาะสำหรับการคำนวณวัตถุประสงค์การสูญเสียใน compute_losses() รวมถึง
- --reweighting-type <str> ระบุรูปแบบการชูน้ำหนักใหม่ใน ตระกูล reparameterized ของเรา (ตัวเลือก: {linear, reciprocal, none} ; ค่าเริ่มต้น: linear )
- --label-smoothing <float> ระบุอัตราการปรับฉลากให้เรียบ (ค่าเริ่มต้น: 0.1)
อาร์กิวเมนต์ที่สำคัญเฉพาะสำหรับรูทีนการถอดรหัสใน sample_step() รวมถึง
- --argmax-decoding ระบุว่าจะใช้การถอดรหัส Argmax สำหรับเอาต์พุตหม้อแปลง denoised $ tilde {x} _0 $ (ค่าเริ่มต้น: เท็จ)
- --temperature <float> ระบุอุณหภูมิ $ tau $ สำหรับการสุ่มตัวอย่าง $ tilde {x} _0 sim operatorname {categorical} (f (x_t; theta)/ tau) $ หาก ไม่ได้ ใช้รูปแบบการถอดรหัส Argmax (ค่าเริ่มต้น: 1.0)
- --decoding-strategy <str> ระบุการใช้วานิลลา ( default ) / reparameterized ( reparam-<options> ; ดูรายละเอียด) กลยุทธ์การถอดรหัส (ตัวเลือก: {default, reparam-<options>} ; ค่าเริ่มต้น: ค่าเริ่มต้น: default )
- --load-ema-weights ระบุว่าจะโหลดน้ำหนักรุ่น EMA สำหรับการสร้าง (ค่าเริ่มต้น: false)
- --iter-decode-max-iter <int> ระบุจำนวนสูงสุดของเวลาสำหรับการถอดรหัส (ค่าเริ่มต้น: 10)
- --iter-decode-with-beam <int> ระบุขนาดลำแสงสำหรับการถอดรหัสหลายลำดับที่มีความยาวต่างกันในแบบขนาน (ค่าเริ่มต้น: 1)
- --iter-decode-force-max-iter ระบุว่าการถอดรหัสซ้ำจะต้องเรียกใช้จำนวนการวนซ้ำที่ระบุและไม่ออก แนะนำให้ตั้งค่าสถานะนี้เป็นจริง

ดูที่นี่สำหรับรายการอาร์กิวเมนต์ที่ครอบคลุมมากขึ้น

กลยุทธ์การถอดรหัส

รูปแบบการสุ่มตัวอย่างวานิลลา

โดยการผ่าน --decoding-strategy default ใช้งานรูปแบบการสุ่มตัวอย่างวานิลลา (เฉพาะสำหรับกระบวนการแพร่กระจายที่ไม่ต่อเนื่องแต่ละกระบวนการ)

ปรับปรุงการสุ่มตัวอย่างด้วย reparameterization

วิธีการถอดรหัสขั้นสูงมากขึ้นสามารถเรียกใช้โดยผ่าน --decoding-strategy reparam-<conditioning-of-v>-<topk_mode>-<schedule> วิธีการนี้ขึ้นอยู่กับการ reparameterization ที่เสนอในกระดาษของเราและช่วยให้ขั้นตอนการถอดรหัสที่มีประสิทธิภาพมากขึ้น ตัวเลือกระบุอัลกอริทึมการถอดรหัสผ่าน

<conditioning-of-v> : uncond หรือ cond ( uncond เริ่มต้น): ไม่ว่าจะสร้างตัวแปรการกำหนดเส้นทาง $ v_t $ ในลักษณะที่มีเงื่อนไขหรือไม่มีเงื่อนไข;
<topk_mode> : stochastic<float> หรือ deterministic ( deterministic เริ่มต้น): ไม่ว่าจะใช้การเลือกแบบสุ่มหรือการเลือกที่กำหนดไว้ $ K $ ค่าลอยตัวใน stochastic<float> ระบุระดับของการสุ่มในการเลือกแบบสุ่ม-$ k $;
<schedule> : linear หรือ cosine ( cosine เริ่มต้น): กำหนดการสำหรับ $ k $ ในระหว่างขั้นตอนการ denoising ของเราซึ่งใช้เพื่อควบคุมจำนวนโทเค็นอันดับสูงสุด $ k $ ที่จะถูก denoised สำหรับขั้นตอนการถอดรหัสต่อไป

ดูการใช้งานสำหรับรายละเอียดเพิ่มเติมเกี่ยวกับตัวเลือก

การแปลเครื่องจักร

การประมวลผลข้อมูลล่วงหน้า

โปรดดูสคริปต์ด้านล่างสำหรับรายละเอียด

บันทึก
โปรดทราบว่างานทั้งหมดที่พิจารณาในงานนี้ดำเนินการกับข้อมูลดั้งเดิมและ ไม่ นำการกลั่นความรู้ (KD) มาใช้

iwslt14 de-en

เราทำตามการประมวลผลล่วงหน้ามาตรฐานใน Fairseq/ตัวอย่างเพื่อเตรียมข้อมูล binarized:

 # fetch and preprocess the data to BPE codes
cd examples/translation/
bash prepare-iwslt14.sh
cd ../..

# binarize the data
TEXT=examples/translation/iwslt14.tokenized.de-en
fairseq-preprocess --joined-dictionary --source-lang de --target-lang en 
    --trainpref $TEXT /train --validpref $TEXT /valid --testpref $TEXT /test 
    --destdir data-bin/iwslt14.tokenized.de-en 
    --workers 20

wmt14 en-de

เราใช้ข้อมูลที่เผยแพร่ใน Fairseq/ตัวอย่างเพื่อเตรียมชุดข้อมูล:

wget http://dl.fbaipublicfiles.com/nat/original_dataset.zip
unzip original_dataset.zip
TEXT=wmt14_ende
fairseq-preprocess --joined-dictionary 
    --source-lang en --target-lang de 
    --trainpref $TEXT /train.en-de --validpref $TEXT /valid.en-de --testpref $TEXT /test.en-de 
    --destdir data-bin/wmt14_ende --thresholdtgt 0 --thresholdsrc 0 
    --workers 20

wmt16 en-ro

สำหรับชุดข้อมูลนี้เราใช้ข้อมูลดิบ WMT16.tar.gz เป็นประมวลผลล่วงหน้าในที่เก็บนี้

tar xzvf wmt16.tar.gz

TEXT=wmt16/en-ro

# move train/ dev/ test/ bpe codes into the $TEXT folder
mv $TEXT /train/corpus.bpe.en $TEXT /train.bpe.en
mv $TEXT /train/corpus.bpe.ro $TEXT /train.bpe.ro
mv $TEXT /dev/dev.bpe.en $TEXT /dev.bpe.en
mv $TEXT /dev/dev.bpe.ro $TEXT /dev.bpe.ro
mv $TEXT /test/test.bpe.en $TEXT /test.bpe.en
mv $TEXT /test/test.bpe.ro $TEXT /test.bpe.ro

# binarize the data
fairseq-preprocess --joined-dictionary 
    --source-lang en --target-lang ro 
    --trainpref $TEXT /train.bpe --validpref $TEXT /dev.bpe --testpref $TEXT /test.bpe 
    --destdir data-bin/wmt16_enro --thresholdtgt 0 --thresholdsrc 0 
    --workers 20

การฝึกอบรม

ก่อนอื่นเราเข้าสู่โฟลเดอร์ fairseq จากนั้นเรียกใช้คำสั่งต่อไปนี้เพื่อฝึกอบรมโมเดล

 # ####### training scripts for IWSLT'14 , WMT'14, and WMT'16 
# first cd to fairseq
# we use 1 GPU for IWSLT'14, 4 GPUs for WMT'14 and 2 GPUs for WMT'16 datasets respectively.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_train.sh -m absorbing -d < iwslt/wmt14/wmt 16> -s default -e True --store-ema --label-smoothing 0.1
CUDA_VISIBLE_DEVICES=1 bash experiments/mt_train.sh -m multinomial -d < iwslt/wmt14/wmt 16> -s default -e True --not-diffusing-special-sym --store-ema --label-smoothing 0.0
CUDA_VISIBLE_DEVICES=2 bash experiments/mt_train.sh -m reparam-absorbing -d < iwslt/wmt14/wmt 16> -s default -e True --q-sample-mode coupled  --store-ema --label-smoothing 0.1 --reweighting-type linear
CUDA_VISIBLE_DEVICES=3 bash experiments/mt_train.sh -m reparam-multinomial -d < iwslt/wmt14/wmt 16> -s default -e True --not-diffusing-special-sym --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear

บันทึก
-s <str> ใช้เพื่อระบุชื่อของการทดสอบ
เราสามารถผ่านข้อโต้แย้งที่กำหนดเองที่อาจเฉพาะเจาะจงกับการฝึกอบรมโดยผนวกพวกเขาหลังจาก -e True

การสร้างและการประเมินผล

ไปป์ไลน์การประเมินผลได้รับการจัดการโดย experiments/mt_generate.sh สคริปต์จะสร้างผลการแปลและประเมินคะแนน Bleu

 # ########## IWLS'14, WMT'14, and WMT'16 datasets
# we recommend putting each checkpoint into a separate folder
# since the script will put the decoded results into a file under the same folder of each checkpoint.
CUDA_VISIBLE_DEVICES=0 bash experiments/mt_generate.sh -a false -c < checkpoint_path > -d < iwslt/wmt14/wmt 16>

ข้อโต้แย้ง:

-a : ไม่ว่าจะเฉลี่ยหลายจุดตรวจ
-c : ระบุตำแหน่งของจุดตรวจ ถ้า -a false (ไม่ใช่จุดตรวจเฉลี่ย) ให้ผ่าน เส้นทาง จุดตรวจสอบ ถ้า -a true ผ่าน ไดเรกทอรี ที่เก็บจุดตรวจหลายจุดในขั้นตอนการฝึกอบรมที่แตกต่างกันเพื่อเฉลี่ย
-d : ชื่อชุดข้อมูล

จุดตรวจสอบรุ่นที่ผ่านการฝึกอบรม

นอกจากนี้เรายังให้จุดตรวจของรุ่นที่ผ่านการฝึกอบรมของเรา

ชุดข้อมูล	แบบอย่าง	ลิงค์จุดตรวจ
iwslt'14	พหุคูณ	การเชื่อมโยง
iwslt'14	ที่ดูดซับได้	การเชื่อมโยง
iwslt'14	reparam-multinomial	การเชื่อมโยง
iwslt'14	การดูดซับใหม่	การเชื่อมโยง
wmt'14	พหุคูณ	การเชื่อมโยง
wmt'14	ที่ดูดซับได้	การเชื่อมโยง
wmt'14	reparam-multinomial	การเชื่อมโยง
wmt'14	การดูดซับใหม่	การเชื่อมโยง
wmt'16	พหุคูณ	การเชื่อมโยง
wmt'16	ที่ดูดซับได้	การเชื่อมโยง
wmt'16	reparam-multinomial	การเชื่อมโยง
wmt'16	การดูดซับใหม่	การเชื่อมโยง

การสร้างคำถามและงานถอดความ

เราติดตามการตั้งค่าการทดลองใน diffuseq สำหรับ การสร้างคำถาม และงาน ถอดความ

การประมวลผลข้อมูลล่วงหน้า

ข้อมูลดิบของงานทั้งสองนี้สามารถนำมาจากที่เก็บ diffuseq ดั้งเดิม จากนั้นเราจะเพิ่มข้อมูลผ่านสคริปต์ที่ให้ไว้

 # put the raw data in the directory ``diffuseq_data/QG``
# Preprocess the question generation dataset
bash diffusion_mt/scripts/preprocess_diffuseq_datasets.sh QG

# put the raw data in the directory ``diffuseq_data/QQP``
# Preprocess the paraphrasing dataset
bash diffusion_mt/scripts/preprocess_diffuseq_datasets.sh QQP

การฝึกอบรม

 # QQP or QG datasets
# first cd to fairseq
CUDA_VISIBLE_DEVICES=0,1 bash experiments/diffuseq_train.sh -m absorbing -d < qqp/qg > -s default -e True --store-ema --label-smoothing 0.1
CUDA_VISIBLE_DEVICES=2,3 bash experiments/diffuseq_train.sh -m multinomial -d < qqp/qg > -s default -e True      --not-diffusing-special-sym --store-ema --label-smoothing 0.0 
CUDA_VISIBLE_DEVICES=0,1 bash experiments/diffuseq_train.sh -m reparam-multinomial -d < qqp/qg > -s default -e True  --not-diffusing-special-sym  --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear
CUDA_VISIBLE_DEVICES=2,3 bash experiments/diffuseq_train.sh -m reparam-absorbing -d < qqp/qg > -s default -e True      --q-sample-mode coupled --store-ema --label-smoothing 0.1 --reweighting-type linear

การสร้างและการประเมินผล

เราติดตามโปรโตคอลการสร้างและการประเมินผลอย่างใกล้ชิดเช่นเดียวกับใน diffuseq เพื่อให้แน่ใจว่าการเปรียบเทียบแบบตัวต่อตัว ไปป์ไลน์ทั้งหมดจะถูกนำมาใช้ใหม่ใน fairseq/diffusion_mt/scripts/decode_diffuseq.py และ fairseq/diffusion_mt/scripts/eval_diffuseq.py ตามลำดับเพื่อเข้ากันได้กับ fairseq เรียกใช้คำสั่งต่อไปนี้:

 # we recommend putting each checkpoint into a separate folder
# since the script will put the decoded results into a file under the same folder of each checkpoint.
CUDA_VISIBLE_DEVICES=0 bash experiments/diffuseq_generate.sh -a false -b true -c < checkpoint_path > -d < qqp/qg >

ข้อโต้แย้ง:

-a : ไม่ว่าจะเฉลี่ยหลายจุดตรวจ
-b : ไม่ว่าจะใช้หลายตัวอย่างสำหรับการถอดรหัส MBR
-c : ระบุตำแหน่งของจุดตรวจ ถ้า -a false (ไม่ใช่จุดตรวจเฉลี่ย) ให้ผ่าน เส้นทาง จุดตรวจสอบ ถ้า -a true ผ่าน ไดเรกทอรี ที่เก็บจุดตรวจหลายจุดในขั้นตอนการฝึกอบรมที่แตกต่างกันเพื่อเฉลี่ย
-d : ชื่อชุดข้อมูล

จุดตรวจสอบรุ่นที่ผ่านการฝึกอบรม

นอกจากนี้เรายังให้จุดตรวจของรุ่นที่ผ่านการฝึกอบรมของเรา

ชุดข้อมูล	แบบอย่าง	ลิงค์จุดตรวจ
QG	พหุคูณ	การเชื่อมโยง
QG	ที่ดูดซับได้	การเชื่อมโยง
QG	reparam-multinomial	การเชื่อมโยง
QG	การดูดซับใหม่	การเชื่อมโยง
qqp	พหุคูณ	การเชื่อมโยง
qqp	ที่ดูดซับได้	การเชื่อมโยง
qqp	reparam-multinomial	การเชื่อมโยง
qqp	การดูดซับใหม่	การเชื่อมโยง

การอ้างอิง

 @article { zheng2023rdm ,
  title = { A Reparameterized Discrete Diffusion Model for Text Generation } ,
  author = { Zheng, Lin and Yuan, Jianbo and Yu, Lei and Kong, Lingpeng } ,
  journal = { arXiv preprint arXiv:2302.05737 } ,
  year = { 2023 }
}