This repo contains the code, data, and models for TMLR 2024 paper "TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks"
| Other Resources |
|---|
| ? TIGERScore Collections |
| ? Huggingface Demo |
To directly use tigerscore pipeline, you first need to install it as a python package.
pip install git+https://github.com/TIGER-AI-Lab/TIGERScore.gitPlease do check if your torch.cuda.is_available() is True for your local machine.
Besides, to use TIGERScore with vllm detailed here, you need to mannually install vllm following vllm document.
pip install vllm
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121# Replace `cp39` with your Python version (e.g., `cp38`, `cp39`, `cp311`).
pip install https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118If you want to use the training scripts, install the dependencies by running the following command:
pip install -r requirements.txtAfter installation, you are good to score the text generations with the following exmaple python code (see in tigerscore_example_usage.ipynb for more use cases) :
# gpu device setup
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# example
instruction = "Write an apology letter."
input_context = "Reason: You canceled a plan at the last minute due to illness."
hypo_output = "Hey [Recipient],nnI'm really sorry for ditching our plan. I suddenly got an opportunity for a vacation so I took it. I know this might have messed up your plans and I regret that.nnDespite being under the weather, I would rather go for an adventure. I hope you can understand my perspective and I hope this incident doesn't change anything between us.nnWe can reschedule our plan for another time. Sorry again for the trouble.nnPeace out,n[Your Name]nn---"
# Load and evaluate examples in all options in 3 lines of code
from tigerscore import TIGERScorer
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B") # on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", quantized=True) # 4 bit quantization on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", use_vllm=True) # VLLM on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B-GGUF", use_llamacpp=True) # 4 bit quantization on CPU
results = scorer.score([instruction], [hypo_output], [input_context])
# print the results, which is a list of json output containging the automatically parsed results!
print(results)The results is a list of dicts consisting of structured error analysis.
[
{
"num_errors": 3,
"score": -12.0,
"errors": {
"error_0": {
"location": ""I'm really glad for ditching our plan."",
"aspect": "Inappropriate language or tone",
"explanation": "The phrase "ditching our plan" is informal and disrespectful. It should be replaced with a more respectful and apologetic phrase like "cancelling our plan".",
"severity": "Major",
"score_reduction": "4.0"
},
"error_1": {
"location": ""I suddenly got an opportunity for a vacation so I took it."",
"aspect": "Lack of apology or remorse",
"explanation": "This sentence shows no remorse for cancelling the plan at the last minute. It should be replaced with a sentence that expresses regret for the inconvenience caused.",
"severity": "Major",
"score_reduction": "4.0"
},
"error_2": {
"location": ""I would rather go for an adventure."",
"aspect": "Incorrect reason for cancellation",
"explanation": "This sentence implies that the reason for cancelling the plan was to go on an adventure, which is incorrect. The correct reason was illness. This sentence should be replaced with a sentence that correctly states the reason for cancellation.",
"severity": "Major",
"score_reduction": "4.0"
}
},
"raw_output": "..."
}
]scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", use_vllm=True) # VLLM on GPUTIGERScore supports VLLM fast inference. On a single A6000 (48GB) GPU, it only takes 0.2s - 0.3s for TIGERScore-13b to score each instance.
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", quantized=True) # 4 bit quantization on GPUBy setting the initialization parameter quanitzed=True, the model is set to be load in 4-bit version with hugging face load_in_4bit=True option.
Please note that though using quantization would decrease the memory requirement by a large margin. You can run TIGERScore on about a 20+GB memory GPU. However, the inference speed might be slower than using the original bfloat16 version. It depends on you to make an trade-off.
scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B-GGUF", use_llamacpp=True)We also provide the Llamacpp version of TIGERScore-7B/13B. By using the GGUF version we provided, you can run TIGERScore on pure CPU devices. It generally takes 20s for TIGERScore-13b to score each instance.
dataset preprocessing scripts and intermediate results can be found here
folder xgptscore contains all the templates that we used to query ChatGPT or GPT-4 to get the identified errors in the hypothesis output for different tasks that TIGERScore involved. We call these API query methods as XGPTScore for a eXplanainable Scoring method by querying GPT Models.
The overall pipeline of XGPTScore is:
./constants.py)Check xgptscore/README.md for more details. And how to use our query template with a single function xgptscore()
MetricInstruct consists of data from 2 sampling channels, real-world channel and synthetic channel.
generate_distill_data.sh.generate_synthesis_distill_data.sh.
The overall purpose of 2 channel data collection is to make sure we cover as many as error types in the training data so that our model generalize better.After getting these data, we do a series heuristics to filter our bad data and augment data:
check_data.sh)generate_inst_synthetic_data.sh as a supplement to the synthetic channel.You can load our preprocessed data used to finetune TIGERScore-V1 from hugging face ? directly:
from datasets import load_dataset
dataset = load_dataset("TIGER-Lab/MetricInstruct")We provide our training and testing scripts in folder finetune, where we use?
finetune_llama.sh to finetine the model.format_distill_data.sh to transform the data into the format for finetuning, that is, a sinlge instruction and input context with an output.test_llama_vllm.sh to test and compute the correlation as the performance of our finetuned model.
Please check these scripts to know more details of our training and testing process../tigerscore/common/README.md to install the env.Please cite our paper if you fine our data, model or code useful.
@article{Jiang2023TIGERScoreTB,
title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks},
author={Dongfu Jiang and Yishan Li and Ge Zhang and Wenhao Huang and Bill Yuchen Lin and Wenhu Chen},
journal={ArXiv},
year={2023},
volume={abs/2310.00752},
url={https://api.semanticscholar.org/CorpusID:263334281}
}