此存储库包含TMLR 2024论文的代码,数据和模型“ Tigerscore:为所有文本生成任务构建可解释的指标”
| 其他资源 |
|---|
| ? Tigerscore收藏 |
| ?拥抱表演 |
要直接使用Tigerscore管道,您首先需要将其安装为Python软件包。
pip install git+https://github.com/TIGER-AI-Lab/TIGERScore.git请检查您的本地机器是否为torch.cuda.is_available()是True 。
此外,要在此处详细使用的VLLM使用TigerScore,您需要在VLLM文档之后进行人工安装VLLM。
pip install vllm
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121 # Replace `cp39` with your Python version (e.g., `cp38`, `cp39`, `cp311`).
pip install https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118如果要使用培训脚本,请通过运行以下命令来安装依赖项:
pip install -r requirements.txt安装后,您可以使用以下Exmaple Python代码为文本世代进行评分(有关更多用例,请参见tigerscore_example_usage.ipynb 。
# gpu device setup
import os
os . environ [ "CUDA_VISIBLE_DEVICES" ] = "0"
# example
instruction = "Write an apology letter."
input_context = "Reason: You canceled a plan at the last minute due to illness."
hypo_output = "Hey [Recipient], n n I'm really sorry for ditching our plan. I suddenly got an opportunity for a vacation so I took it. I know this might have messed up your plans and I regret that. n n Despite being under the weather, I would rather go for an adventure. I hope you can understand my perspective and I hope this incident doesn't change anything between us. n n We can reschedule our plan for another time. Sorry again for the trouble. n n Peace out, n [Your Name] n n ---"
# Load and evaluate examples in all options in 3 lines of code
from tigerscore import TIGERScorer
scorer = TIGERScorer ( model_name = "TIGER-Lab/TIGERScore-7B" ) # on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", quantized=True) # 4 bit quantization on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B", use_vllm=True) # VLLM on GPU
# scorer = TIGERScorer(model_name="TIGER-Lab/TIGERScore-7B-GGUF", use_llamacpp=True) # 4 bit quantization on CPU
results = scorer . score ([ instruction ], [ hypo_output ], [ input_context ])
# print the results, which is a list of json output containging the automatically parsed results!
print ( results )结果是由结构化误差分析组成的DICT列表。
[
{
"num_errors" : 3 ,
"score" : -12.0 ,
"errors" : {
"error_0" : {
"location" : " " I'm really glad for ditching our plan. " " ,
"aspect" : " Inappropriate language or tone " ,
"explanation" : " The phrase " ditching our plan " is informal and disrespectful. It should be replaced with a more respectful and apologetic phrase like " cancelling our plan " . " ,
"severity" : " Major " ,
"score_reduction" : " 4.0 "
},
"error_1" : {
"location" : " " I suddenly got an opportunity for a vacation so I took it. " " ,
"aspect" : " Lack of apology or remorse " ,
"explanation" : " This sentence shows no remorse for cancelling the plan at the last minute. It should be replaced with a sentence that expresses regret for the inconvenience caused. " ,
"severity" : " Major " ,
"score_reduction" : " 4.0 "
},
"error_2" : {
"location" : " " I would rather go for an adventure. " " ,
"aspect" : " Incorrect reason for cancellation " ,
"explanation" : " This sentence implies that the reason for cancelling the plan was to go on an adventure, which is incorrect. The correct reason was illness. This sentence should be replaced with a sentence that correctly states the reason for cancellation. " ,
"severity" : " Major " ,
"score_reduction" : " 4.0 "
}
},
"raw_output" : " ... "
}
] scorer = TIGERScorer ( model_name = "TIGER-Lab/TIGERScore-7B" , use_vllm = True ) # VLLM on GPUTigersCore支持VLLM快速推断。在单个A6000(48GB)GPU上,TigersCore -13B仅需0.2s -0.3s即可为每个实例评分。
scorer = TIGERScorer ( model_name = "TIGER-Lab/TIGERScore-7B" , quantized = True ) # 4 bit quantization on GPU通过设置初始化参数quanitzed=True ,该模型被设置为4位版本中,带有拥抱face load_in_4bit=True选项。
请注意,尽管使用量化会将内存需求减少较大。您可以在约20+GB内存GPU上运行TigersCore。但是,推理速度可能比使用原始BFLOAT16版本要慢。取决于您的权衡。
scorer = TIGERScorer ( model_name = "TIGER-Lab/TIGERScore-7B-GGUF" , use_llamacpp = True )我们还提供LlamACPP版本的Tigerscore-7b/13b。通过使用我们提供的GGUF版本,您可以在纯CPU设备上运行TigerScore。 Tigerscore-13B通常需要20秒才能得分每个实例。
数据集预处理脚本和中间结果可以在此处找到
文件夹xgptscore包含我们用于查询ChatGpt或GPT-4的所有模板,以获取有关TigerScore所涉及的不同任务的假设输出中确定的错误。我们将这些API查询方法称为XGPTSCORE,用于通过查询GPT模型来为AE X X Planable评分方法。
XGPTSCORE的整体管道是:
./constants.py )检查xgptscore/README.md以获取更多详细信息。以及如何使用单个函数xgptscore()使用我们的查询模板
公制由来自2个采样通道,现实世界通道和合成通道的数据组成。
generate_distill_data.sh生成。generate_synthesis_distill_data.sh生成。 2频道数据收集的总体目的是确保我们涵盖培训数据中的错误类型,以使我们的模型概括得更好。获取这些数据后,我们进行了一系列启发式方法来过滤我们的不良数据和增强数据:
check_data.sh )的项目generate_inst_synthetic_data.sh生成具有自由形式误差ASEPCT的高质量输出,作为合成通道的补充。您可以通过拥抱面孔加载我们用于固定tigerscore-v1的预处理数据吗?直接地:
from datasets import load_dataset
dataset = load_dataset ( "TIGER-Lab/MetricInstruct" )我们在finetune中提供培训和测试脚本,我们在哪里使用?
finetune_llama.sh finetine模型。format_distill_data.sh将数据转换为列式的格式,即带有输出的sinlge指令和输入上下文。test_llama_vllm.sh以测试和计算相关性作为我们的固定模型的性能。请检查这些脚本,以了解我们的培训和测试过程的更多详细信息。./tigerscore/common/README.md以安装Env。 如果您罚款我们的数据,模型或代码有用,请引用我们的论文。
@article{Jiang2023TIGERScoreTB,
title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks},
author={Dongfu Jiang and Yishan Li and Ge Zhang and Wenhao Huang and Bill Yuchen Lin and Wenhu Chen},
journal={ArXiv},
year={2023},
volume={abs/2310.00752},
url={https://api.semanticscholar.org/CorpusID:263334281}
}