Kaggle LMSYS Download - Kaggle LMSYS Source code download

Kaggle LMSYS

AI Source Code

1.0.0

Download

This Repo is for Kaggle - LMSYS - Chatbot Arena Human Preference Predictions

Python Environment

1. Install Packages

pip install -r requirements.txt

Prepare Data

1. Set Kaggle Api

export KAGGLE_USERNAME="your_kaggle_username"
export KAGGLE_KEY="your_api_key"
export HF_TOKEN="your_hf_token"

2. Install unzip

sudo apt install unzip

3. Download Datasets

kaggle datasets download -d lizhecheng/lmsys-datasets
unzip lmsys-datasets.zip

4. Download LoRA Adapters

kaggle datasets download -d lizhecheng/lmsys-lora
unzip lmsys-lora.zip

Training

1. In this repo

cd src
cd team gemma / cd team llama
python train_xxx.py

2. Go to the full repo

Click full-training-code

[38th Solution] Lost Gold Medal

1. Code

Check our code at LMSYS GitHub.

2. Methodology

We employ instruction tuning, making the input format crucial. After experimenting with various formats, we identified the optimal approach:

First, we define a maximum length. Then, we concatenate multiple turns of prompt-response pairs within this limit. If a previous prompt-response pair exceeds the maximum length, the new prompt-response pair is placed in a separate row. For example, consider prompts [P1, P2, P3] with corresponding responses [A1, A2, A3] and [B1, B2, B3]. This method allows us to generate two rows: (P1, A1, B1) and (P2, A2, B2, P3, A3, B3), assuming (P1, A1, B1) does not exceed the maximum length. However, for training, we only use the last turn of the prompt-response pair for each ID.

This approach offers two key advantages:

Structuring the input in this way may help the model learn which two responses need to be compared.
Concatenating prompt-response pairs within the maximum length ensures that each input is a complete conversation, avoiding truncation. This reduces the risk of the model making bad choices due to incomplete responses.

<start_of_turn>user
Here are two question-answering dialogues. Compare two models' performance on answering questions, determine which is better.
#Prompt1
xxxxx
#Response
##Model A
xxxxx
##Model B
xxxx

#Prompt2
xxxxx
#Response
............

###options
A. Model A
B. Model B
C. Tie
<end_of_turn>
<start_of_turn>model 
A<eos>

3. Training & Inference Details

4bit QLoRA on gemma-2-9b-it and Meta-Llama-3.1-8B-Instruct, parameters: r = 32, modules = ["q_proj", "k_proj", "v_proj", "o_proj"].
Instruction-tuning instead of classification.
No gradient_checkpointing_enable() to reduce the training time.
Used additional 33k data for fine-tuning and sample 10k data to do TTA.
Great CV split (80% / 20%) to avoid duplicates between train and validation.
GPU: multiple 80GB A100 GPUs + multiple A40 GPUs.
Set temperature = 1.03 for inference.
Submission1: gemma-2-9b-it + llama-3.1-8b-it + gemma-2-2b-it & Submission2: gemma-2-9b-it + llama-3.1-8b-it + tta.

4. Not Work

Pseudo-label and trained by hard label (Maybe should consider use KL-loss to use pseudo-label).
Only calculate [A, B, C] token loss even doing instruction-tuning, the same as classification task.