alpaca_eval下載alpaca_eval源代碼下載

Alpacaeval：指導遵循語言模型的自動評估器

具有長度控制的勝利率（Paper）的Alpacaeval 2.0與Chatbot Arena的Spearman相關性為0.98 ，而OpenAI積分的成本不到10美元，在不到3分鐘的時間內運行和運行。我們的目標是為CHAT LLM提供一個基準：快速（<5分鐘），便宜（<$ 10），並且與人類高度相關（0.98）。這是與其他基準的比較：

LC Alpacaeval是與聊天競技場最高度相關的基準。

更新：

？默認情況下，長度控制的獲勝率已超出並使用！這將與聊天機器人競技場的相關性從0.93提高到0.98，同時大大降低了連續連續性。原始的獲勝率仍在網站和CLI上顯示。更多詳細信息。

？灰木瓦爾2.0默認情況下使用並使用！我們改進了自動通信器（更好，更便宜），並將GPT-4預覽用作基線。更多詳細信息。對於舊版本，設置您的環境變量IS_ALPACA_EVAL_2=False 。

概述
快速開始
排行榜以及如何解釋它們
- 型號
- 評估者
用例
- 評估模型
- 製作新的排行榜
- 做一個新的評估者
貢獻
- 貢獻一個模型
- 貢獻評估者
- 貢獻評估集
- 貢獻完成功能
限制
分析
- 長度控制的Alpacaeval
- 分析評估者
- 分析評估集
引用
附加信息
- 長度控制的獲勝率
- 羊駝毛2.0
- 數據發布
- 與Alpacafarm的差異
- 相關工作
- 解釋註釋
- 主要更新

概述

指導跟隨模型（例如，chatgpt）的評估通常需要人類互動。這是耗時的，昂貴的且難以復制的。基於LLM的自動評估中的羊駝羊角，該評估是快速，便宜，可複制的，並且可以針對20K人類註釋進行驗證。它對於模型開發特別有用。儘管我們在先前的自動評估管道上有所改善，但仍然存在基本限制，例如偏愛更長的產出。羊駝谷提供以下內容：

排行榜：Alpacaeval評估集上的常見模型的排行榜。謹慎：自動評估者（例如GPT-4）可能會偏向於在評估符基礎的模型上生成更長的輸出和/或對其進行微調的模型（例如GPT-4）。
自動評估器：與人類具有很高一致的自動評估器（在20K註釋中得到驗證）。我們通過測量功能強大的LLM（例如GPT-4）的次數來評估模型，而不是該模型的輸出而不是參考模型的輸出。我們的評估人員默認情況下可以啟用緩存和輸出隨機化。
用於構建自動評估器的工具包：用於構建高級自動評估器的簡單接口（例如，帶有緩存，批處理或多通道器）並分析它們（質量，價格，價格，速度，統計能力，偏見，方差等）。
人類評估數據：Alpacafarm評估集的給定和參考模型之間的20K人類偏好。其中2.5k是跨通道（4人註釋相同的650個例子）。
Alpacaeval數據集：簡化Alpacafarm的評估集，其中“指令”和“輸入”合併到一個字段中，參考輸出更長。詳細信息在這裡。

何時使用而不使用Alpacaeval？

何時使用羊駝毛？我們的自動評估器是一個快速，便宜的代理人，可用於評估簡單的指導跟踪任務。如果您必須在模型開發過程中快速運行許多評估，那麼這將很有用。

何時不使用山帕卡瓦爾？作為任何其他自動評估者，Alpacaeval不應在高風險決策中取代人類評估，例如決定模型釋放。特別是，阿爾帕卡瓦爾受到以下事實的限制：（1）評估集中的指示可能無法代表LLM的高級使用；（2）自動評估者可能有偏見，例如偏愛答案的樣式；（3）Alpacaeval不能衡量模型可能引起的風險。限制的詳細信息。

快速開始

要安裝穩定版本，請運行

pip install alpaca-eval

要安裝夜間版本，請運行

pip install git+https://github.com/tatsu-lab/alpaca_eval

然後，您可以按以下方式使用它：

 export OPENAI_API_KEY= < your_api_key > # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md 
alpaca_eval --model_outputs ' example/outputs.json '

這將將排行榜打印到控制台，並將排行榜和註釋保存到與model_outputs文件相同的目錄中。重要參數如下：

model_outputs ：用於添加到排行榜的模型輸出的JSON文件的路徑。每個字典應包含鍵instruction和output 。
Annotators_config ：這是要使用的註釋者。我們建議使用weighted_alpaca_eval_gpt4_turbo （Alpacaeval 2.0的默認值），該量與我們的人類註釋數據，大的上下文大小具有很高的一致性率，並且非常便宜。有關所有註釋者的比較，請參見此處。
Reference_outputs ：參考模型的輸出。與model_outputs相同的格式。默認情況下，這是Alpacaeval 2.0的gpt4_turbo 。
output_path ：保存註釋和排行榜的路徑。

如果您沒有模型輸出，則可以使用evaluate_from_model並傳遞本地路徑或擁抱面模型的名稱，或者來自標準API（OpenAI，Anthropic，Anthropic，cohere，cohere，Google，...）的模型。其他命令：

>>> alpaca_eval -- --help

 SYNOPSIS
    alpaca_eval COMMAND

COMMANDS
    COMMAND is one of the following:

     evaluate
       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.

     evaluate_from_model
       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.

     make_leaderboard
       Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.

     analyze_evaluators
       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

有關每個功能的更多信息，請使用alpaca_eval <command> -- --help 。

排行榜以及如何解釋它們

型號

我們的排行榜是在Alpacaeval數據集上計算的。我們使用不同的基線模型和自動通道者預先計算重要模型的排行榜。我們的兩個主要排行榜（“ Alpacaeval 2.0”和“ Alpacaeval”）可以在此頁面上找到。 “ Alpacaeval 2.0”使用weighted_alpaca_eval_gpt4_turbo作為註釋器，而gpt4_turbo用於基線。 “ Alpacaeval”使用alpaca_eval_gpt4用於註釋器，而text_davinci_003用於基線。對於所有預先計算的排行榜，請參見此處。稍後，我們還展示瞭如何將您的模型添加到排行榜中，以及如何為您的評估員/數據集製作新的排行榜。有關所有可用型號的配置，請參見此處。

羊角谷山羊角板最小排行榜：

	獲勝率	性病錯誤
GPT4	95.3	0.7
克勞德	88.4	1.1
chatgpt	86.1	1.2
瓜納科65B	71.8	1.6
Vicuna-13b	70.4	1.6
text_davinci_003	50.0	0.0
羊駝農業ppo-human	41.2	1.7
羊駝-7b	26.5	1.5
text_davinci_001	15.2	1.2

這些指標究竟如何計算？

獲勝率：贏率測量模型輸出的時間比參考的輸出（alpacaeval的test-davinci-003 ）和alpacaeval 2.0的gpt4_turbo所優於使用。更具體地說，要計算apacaeval數據集中每個指令中所需模型的輸出對的獲勝率。然後，我們將每個輸出與參考模型的輸出（例如text-davinci-003 ）的輸出配對。然後，我們詢問我們的自動評估器他們更喜歡哪種輸出。參見Alpacaeval的和Alpacaeval 2.0的提示和配置，特別是我們隨機對輸出順序進行隨機，以避免位置偏差。然後，我們平均比數據集中所有指令的偏好平均獲得基線的勝利率。如果兩個輸出完全相同，我們對這兩個模型都使用一半偏好。

標準錯誤：這是獲勝率的標準錯誤（按N-1歸一化），即，與不同指令相比的偏好平均。

有關我們的自動通信器的詳細信息： alpaca_eval_gpt4

我們的alpaca_eval_gpt4 （請參閱配置）註釋者平均比首選項，其中偏好的獲得如下：

它採用指令和一對輸出（來自所需模型和參考模型）
如果偏好是已經計算出的三倍，它將返回它（即使用緩存）
它隨機避免位置偏差
它格式化說明並輸出到以下零射擊提示中，該提示要求按優先順序訂購輸出
它使用temperature=0 GPT4完成提示
它解析了完成的偏好並將其退還

註釋者是Alpacafarm和鳥舍評估者（受高度影響）之間的混合。特別是，我們使用與Alpacafarm相同的代碼（緩存/隨機化/超參數），但使用類似於Aviary的排名提示。我們對鳥舍的提示進行更改，以減少更長輸出的偏差。相關工作的詳細信息。

對於Alpacaeval 2.0，我們使用weighted_alpaca_eval_gpt4_turbo ，它使用logprobs來計算連續的偏好，並將gpt4_turbo用作模型（請參閱配置）。

評估人員

我們通過與收集的2.5k人類註釋進行比較（每個人的4個人類註釋），評估了Alpacaeval設置的不同自動註釋。 Below we show metrics for our suggested evaluators ( weighted_alpaca_eval_gpt4_turbo , alpaca_eval_gpt4 ), for prior automatic evaluators ( alpaca_farm_greedy_gpt4 , aviary_gpt4 , lmsys_gpt4 ), for humans ( humans ), and for different base models with essentially the same prompt ( gpt4 , claude , text_davinci_003 ， chatgpt_fn ， guanaco_33b ， chatgpt ）。請參閱此處的所有評估器的配置，這些評估器及其相關指標可用。

	人類協議	價格[$/1000個示例]	時間[秒/1000個示例]	Spearman Corr。	皮爾遜·柯爾森（Pearson Corr）	偏見	方差	proba。喜歡更長的時間
羊Alpaca_eval_gpt4	69.2	13.6	1455	0.97	0.93	28.4	14.6	0.68
alpaca_eval_cot_gpt4_turbo_fn	68.6	6.3	1989	0.97	0.90	29.3	18.4	0.67
alpaca_eval_gpt4_turbo_fn	68.1	5.5	864	0.93	0.82	30.2	15.6	0.65
Alpaca_eval_llama3_70b_fn	67.5	0.4	209	0.90	0.86	32.3	8.2	0.79
GPT4	66.9	12.5	1037	0.88	0.87	31.5	14.6	0.65
羊Alpaca_farm_greedy_gpt4	66.4	15.3	878	0.85	0.75	30.2	19.3	0.60
alpaca_eval_cot_gpt4_turbo_fn	65.7	4.3	228	0.78	0.77	33.9	23.7	0.61
人類	65.7	300.0	36800	1.00	1.00	0.0	34.3	0.64
克勞德	65.3	3.3	173	0.93	0.90	32.4	18.5	0.66
lmsys_gpt4	65.3	13.9	17982	0.98	0.97	31.6	15.9	0.74
text_davinci_003	64.1	8.7	121	0.85	0.83	33.8	22.7	0.70
最長	62.2	0.0	0	0.27	0.56	37.8	0.0	1.00
chatgpt	57.3	0.8	285	0.72	0.71	39.4	34.1	0.59

這些指標究竟如何計算？

現在，我們用文字解釋如何計算上表中的指標。代碼在這裡。

人類的同意：這衡量了當前註釋者與人類的大多數偏好之間的一致性在我們的跨通道集中的約650個註釋中，該雜交包含4個人類註釋。為了估算單個人（上表中的humans行）與大多數人之間的一致性，我們採用了4個註釋之一，併計算預測其他3個註釋模式時它具有的準確性。然後，我們在所有4個註釋和650個指令中平均達到這種準確性，以獲得人類協議，即，我們計算了預期的（超過人類和样本）保留的協議。如果該模式不是唯一的，我們將隨機採用其中一種模式。我們為自動註釋者執行完全相同的計算，以便最終數字可比。

價格[$/1000示例] ：這是每1000個註釋的平均價格。對於人類而言，這是我們付給機械土耳其人收集這些註釋的價格（每小時21美元）。如果價格取決於用於計算註釋的計算機（例如鳥根），我們將其留空。

時間[秒/1000個示例] ：這是計算1000個註釋所需的平均時間。對於人類而言，每個機械Turker都以1000個例子為例，這是估計的中位時間。對於自動註釋，這是我們運行註釋時所花費的平均時間。請注意，這可能取決於不同用戶不同的API限制以及簇正在處理的請求數量。

Spearman Corr。 ：這可以衡量使用自動通信器的偏好計算的排行榜與以人類偏好計算的排行榜之間的相關性。與Human agreement一樣，我們使用alpacafarm的人類註釋，但現在我們考慮了方法級的協議，而不僅僅是與人類的樣本協議。請注意，我們只使用9個型號，因此相關性不是很可靠。

皮爾遜·柯爾森（Pearson Corr） ：與Spearman corr.但是與皮爾遜相關。

偏見：最有可能的人類標籤與最可能的自動標籤之間的一致。對於自動註釋，我們通過為每個示例對4個不同的註釋進行採樣來估算它。此處的隨機性來自提示中輸出的順序，從LLM進行採樣，如果適用，則在批處理中的指令順序以及在池中選擇註釋器。然後，我們採用4個註釋的模式，並在預測4個人註釋的模式時計算模式的準確性。請注意，如果我們擁有“無限”的盤通道數量，這可能是對我們將得到的真正偏見的高估。低偏見意味著註釋者的期望與人類相同。對於人類，偏見為零。請注意，這與標準的統計偏差無關，但無關，因為我們採用該模式而不是平均註釋，並且我們考慮0-1損失而不是平方損失。

差異：預期協議單個自動偏好，最有可能的偏好。我們以與估計人類的“人類協議”相同的方式估計它，即，在使用第四次註釋預測3個註釋的模式時，我們會在預期的一個出現錯誤中估算。較低的差異意味著註釋與其偏好相一致，即，如果您用不同的種子對其進行採樣，則會產生相同的結果。與偏差一樣，這並不是標準統計差異，因為我們採用模式而不是平均註釋，我們考慮0-1損失而不是平方損失。

請注意，“人類一致”與偏見和差異密切相關。特別是，由於我們僅使用單個註釋，而偏見旨在測量當前註釋者的不可約誤差，因此該方差衡量了誤差。

proba。偏愛更長的時間：這是當兩個輸出之一明顯長於另一個輸出時，註釋者更喜歡較長的輸出（超過30個字符的差異）。

在整個表中，我們還提供以下指標：

proba。優先列表：這是註釋者更喜歡輸出包含列表/項目符號點的概率，而一個輸出則不是另一個輸出。

proba。優先1 ：這是註釋者更喜歡第一對輸出的概率。我們所有提出的註釋者在提示中隨機對輸出進行隨機，因此應為0.5。先前的註釋者，例如lmsys和aviary ，都沒有。

＃解析：這是註釋者能夠解析的示例數。

請注意，如果差異和偏差為空，則意味著由於資源（時間和價格）約束，我們只為每個648示例執行一個單個註釋。這解釋了為什麼#Parsed為648，否則應為2592。

選擇評估者的提示

總體而言，我們建議使用annotators_config=weighted_alpaca_eval_gpt4_turbo如果您想要與人類達成的高度協議，並且如果預算緊張，則annotators_config=chatgpt_fn 。

選擇註釋者時，我們建議您考慮以下內容（前三個很明顯）：

"Human agreement [%]"
"Price [$/1000 examples]"
"Time [seconds/1000 examples]"
"* corr."大約> 0.7。重要的是，相關性不是太低，但是我們不建議將其用作主要指標，因為僅在9個模型上計算相關性。
"Proba. prefer longer" 。 <0.7。確實，我們發現，人類註釋者的大多數偏好對更長的答案具有強大的偏見（如"longest"評估者的高性能= 62.2，始終更喜歡最長的輸出）。這表明它可能與人類註釋者更有偏見。為了避免具有長度偏見的排行榜，我們建議使用低於0.7“ proba。更喜歡更長的自動註釋者”。
"Variance"大約<0.2。我們認為，良好的評估者應具有盡可能少的差異，以使結果大多可再現。請注意，在我們正在模擬人類的情況下，如Alpacafarm中所示，這是可取的。

我們過濾了不滿足上表中這些要求的註釋者（除了人類 / chatgpt / 003 / lmsys以外，出於參考目的）。有關所有結果，請參見此處。通常，我們發現weighted_alpaca_eval_gpt4_turbo是質量 /價格 /時間 /差異 /長度偏差之間的良好權衡。

上述指標是根據人群工人的註釋計算的。儘管有用，但這些註釋並不完美，例如，人群工作者通常偏愛風格而不是事實。因此，我們建議用戶根據自己的說明和人類註釋來驗證自動評估者。限制的詳細信息。

用例

評估模型

>>> alpaca_eval evaluate -- --help

 NAME
    alpaca_eval evaluate - Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.

SYNOPSIS
    alpaca_eval evaluate <flags>

DESCRIPTION
    Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.

FLAGS
    --model_outputs=MODEL_OUTPUTS
        Type: Optional[Union]
        Default: None
        The outputs of the model to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv) or a function to generate those. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. If None, we just print the leaderboard.
    -r, --reference_outputs=REFERENCE_OUTPUTS
        Type: Union
        Default: <func...
        The outputs of the reference model. Same format as `model_outputs`. If None, the reference outputs are a specific set of Davinci 003 outputs on the AlpacaEval set:
    --annotators_config=ANNOTATORS_CONFIG
        Type: Union
        Default: 'alpaca_eval_gpt4_turbo_fn'
        The path the (or list of dict of) the annotator's config file. For details see the docstring of `PairwiseAnnotator`.
    -n, --name=NAME
        Type: Optional[Optional]
        Default: None
        The name of the model to add to the leaderboard. If None we check if `generator is in model_outputs` if not we use "Current model".
    -o, --output_path=OUTPUT_PATH
        Type: Union
        Default: 'auto'
        Path to the directory where the new leaderboard and the annotations should be stored. If None we don't save. If `auto` we use `model_outputs` if it is a path, and otherwise use the directory from which we call the script.
    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD
        Type: Union
        Default: 'auto'
        The precomputed leaderboard or a path to it (json, csv, or tsv). The leaderboard should contain at least the column `win_rate`. If `auto` we will try to use the corresponding leaderboard for the reference outputs (only if in CORRESPONDING_OUTPUTS_LEADERBOARDS). If `None` we won't add other models from the leaderboard.
    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD
        Type: bool
        Default: False
        Whether to overwrite the leaderboard if the model is already in it.
    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT
        Type: Optional
        Default: 'minimal'
        The mode of the leaderboard to use. Only used if the precomputed leaderboard has a column `mode`, in which case it will filter the leaderboard by this mode. If None keeps all.
    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE
        Type: str
        Default: 'community'
        The mode of the leaderboard for the current method.
    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
        Type: bool
        Default: False
        Whether to return the metrics instead of printing the results.
    -f, --fn_metric=FN_METRIC
        Type: Union
        Default: 'pairwise_to_winrate'
        The function or function name in `metrics.py` that will be used to convert preference to metrics. The function should take a sequence of preferences (0 for draw, 1 for base win, 2 when the model to compare wins) and return a dictionary of metrics and the key by which to sort the leaderboard.
    -s, --sort_by=SORT_BY
        Type: str
        Default: 'win_rate'
        The key by which to sort the leaderboard.
    --is_cache_leaderboard=IS_CACHE_LEADERBOARD
        Type: Optional[Optional]
        Default: None
        Whether to save the result leaderboard to `precomputed_leaderboard`. If None we save only if max_instances not None. A preferred way of adding models to the leaderboard is to set `precomputed_leaderboard` to the previously saved leaderboard at `<output_path>/leaderboard.csv`.
    --max_instances=MAX_INSTANCES
        Type: Optional[Optional]
        Default: None
        The maximum number of instances to annotate. Useful for testing.
    --annotation_kwargs=ANNOTATION_KWARGS
        Type: Optional[Optional]
        Default: None
        Additional arguments to pass to `PairwiseAnnotator.annotate_head2head`.
    -A, --Annotator=ANNOTATOR
        Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...
        The annotator class to use.
    Additional flags are accepted.
        Additional arguments to pass to `PairwiseAnnotator`.

>>> alpaca_eval evaluate_from_model -- --help

 NAME
    alpaca_eval evaluate_from_model - Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.

SYNOPSIS
    alpaca_eval evaluate_from_model MODEL_CONFIGS <flags>

DESCRIPTION
    Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.

POSITIONAL ARGUMENTS
    MODEL_CONFIGS
        Type: Union
        A dictionary or path (relative to `models_configs`) to a yaml file containing the configuration of the model to decode from. If a directory,we search for 'configs.yaml' in it. The keys in the first dictionary should be the generator's name, and the value should be a dictionary of the generator's configuration which should have the

FLAGS
    -r, --reference_model_configs=REFERENCE_MODEL_CONFIGS
        Type: Optional[Union]
        Default: None
        Same as in `model_configs` but for the reference model. If None, we use the default Davinci003 outputs.
    -e, --evaluation_dataset=EVALUATION_DATASET
        Type: Union
        Default: <func...
        Path to the evaluation dataset or a function that returns a dataframe. If None, we use the default evaluation
    -a, --annotators_config=ANNOTATORS_CONFIG
        Type: Union
        Default: 'alpaca_eval_gpt4_turbo_fn'
        Path to the annotators configuration or a dictionary. If None, we use the default annotators configuration.
    -o, --output_path=OUTPUT_PATH
        Type: Union
        Default: 'auto'
        Path to save the generations, annotations and leaderboard. If auto saves at `results/<model_name>`
    -m, --max_instances=MAX_INSTANCES
        Type: Optional[int]
        Default: None
        Maximum number of instances to generate and evaluate. If None, we evaluate all instances.
    --is_strip_output=IS_STRIP_OUTPUT
        Type: bool
        Default: True
        Whether to strip trailing and leading whitespaces from the outputs.
    --is_load_outputs=IS_LOAD_OUTPUTS
        Type: bool
        Default: True
        Whether to try to load outputs from the output path. If True and outputs exist we only generate outputs for instructions that don't have outputs yet.
    -c, --chunksize=CHUNKSIZE
        Type: int
        Default: 64
        Number of instances to generate before saving. If None, we save after all generations.
    Additional flags are accepted.
        Other kwargs to `evaluate`

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

要評估一個模型，您需要：

選擇一個評估集，併計算指定為model_outputs輸出。默認情況下，我們使用來自Alpacaeval的805個示例。計算山帕瓦爾的輸出：

 import datasets

eval_set = datasets . load_dataset ( "tatsu-lab/alpaca_eval" , "alpaca_eval" )[ "eval" ]
for example in eval_set :
    # generate here is a placeholder for your models generations
    example [ "output" ] = generate ( example [ "instruction" ])
    example [ "generator" ] = "my_model" # name of your model

如果您的型號是擁抱面模型或來自標準API提供商（OpenAI，Anthropic，Cohere）。然後，您可以直接使用alpaca_eval evaluate_from_model來照顧產生輸出。

計算參考輸出reference_outputs 。默認情況下，我們在Alpacaeval上使用gpt4_turbo的預算輸出。如果要使用其他模型或其他數據集，則遵循與（1.）相同的步驟。
選擇通過annotators_config指定的評估器。我們建議使用alpaca_eval_gpt4_turbo_fn 。有關其他選項和比較，請參見此表。根據評估器，您可能需要在環境中或int client_configs中設置適當的API_KEY。

一起運行：

alpaca_eval --model_outputs ' example/outputs.json ' 
  --annotators_config ' alpaca_eval_gpt4_turbo_fn '

如果您沒有解碼的輸出，則可以使用evaluate_from_model ，以照顧您的解碼（模型和參考）。這是一個例子：

 # need a GPU for local models
alpaca_eval evaluate_from_model 
  --model_configs ' oasst_pythia_12b ' 
  --annotators_config ' alpaca_eval_gpt4_turbo_fn '

在這裡， model_configs和reference_model_configs （可選）是指定提示的目錄的路徑，該路徑指定了提示，模型提供商（此處為HuggingFace）和解碼參數。請參閱此目錄以獲取示例。對於所有可用開箱即用的模型提供商，請參見此處。

有關註釋者的信息

緩存：默認情況下，所有註釋均在caching_path磁盤上緩存。因此，註釋永遠不會重新計算，這使註釋更快，更便宜並允許可重複性。即使評估不同的模型，這也有幫助，因為許多模型具有相同的輸出。
默認情況下，輸出隨機化，我們在輸出示例上隨機進行隨機化，因為我們發現註釋者傾向於更喜歡他們看到的第一個示例。
批處理我們為批處理註釋提供代碼和示例，如果提示很長，這會減少註釋的成本和時間。例如，請參見alpaca_farm_greedy_gpt4。
註釋者池我們提供代碼和示例以使用自動註釋庫進行評估，這有助於復制人類註釋的差異。例如，請參見羊Alpaca_farm。
根據指令，我們根據指令根據指令（輸出順序，批處理，批次，每個註釋者的示例）根據指令進行播種。

製作新的排行榜

>>> alpaca_eval make_leaderboard -- --help

 NAME
    alpaca_eval make_leaderboard - Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.

SYNOPSIS
    alpaca_eval make_leaderboard <flags>

DESCRIPTION
    Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.

FLAGS
    --leaderboard_path=LEADERBOARD_PATH
        Type: Optional[Union]
        Default: None
        The path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists it will
    --annotators_config=ANNOTATORS_CONFIG
        Type: Union
        Default: 'alpaca_eval_gpt4_turbo_fn'
        The path the (or list of dict of) the annotator's config file.
    --all_model_outputs=ALL_MODEL_OUTPUTS
        Type: Union
        Default: <fu...
        The outputs of all models to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv potentially with globbing) or a function to generate those. If the path contains a globbing pattern, we will read all files matching the pattern and concatenate them. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. It should also contain a column `generator` with the name of the current model.
    -r, --reference_outputs=REFERENCE_OUTPUTS
        Type: Union
        Default: <func...
        The outputs of the reference model. Same format as `all_model_outputs` but without needing `generator`. By default, the reference outputs are the 003 outputs on AlpacaEval set.
    -f, --fn_add_to_leaderboard=FN_ADD_TO_LEADERBOARD
        Type: Callable
        Default: 'evaluate'
        The function to use to add a model to the leaderboard. If a string, it should be the name of a function in `main.py`. The function should take the arguments: `model_outputs`, `annotators_config`, `name`, `precomputed_leaderboard`, `is_return_instead_of_print`, `reference_outputs`.
    --leaderboard_mode=LEADERBOARD_MODE
        Type: str
        Default: 'verified'
        The mode of the leaderboard to save all new entries with.
    -i, --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
        Type: bool
        Default: False
        Whether to return the metrics instead of printing the results.
    Additional flags are accepted.
        Additional arguments to pass to `fn_add_to_leaderboard`.

如果要使用單個命令（而不是多個alpaca_eval調用）進行新的排行榜，則可以使用以下內容：

alpaca_eval make_leaderboard 
  --leaderboard_path < path_to_save_leaderboard > 
  --all_model_outputs < model_outputs_path > 
  --reference_outputs < reference_outputs_path > 
  --annotators_config < path_to_config.yaml >

在哪裡：

leaderboard_path ：將排行榜保存到的路徑。排行榜將作為CSV文件保存，如果它已經存在，則將附加。
all_model_outputs ：所有模型輸出的JSON路徑要添加到排行榜中（作為單個文件或通過包圍多個文件）。每個字典應包含在提示中格式化的鍵（ instruction和output ），並包含帶有當前模型名稱的列generator 。例如，請參見此文件。
reference_outputs參考模型輸出的路徑。每個字典應包含提示中格式的鍵（ instruction和output ）。默認情況下，參考輸出是Alpacaeval集合的003輸出。
annotators_config ：註釋器配置文件的路徑。默認為alpaca_eval_gpt4 。

做一個新的評估者

>>> alpaca_eval analyze_evaluators -- --help

 NAME
    alpaca_eval analyze_evaluators - Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

SYNOPSIS
    alpaca_eval analyze_evaluators <flags>

DESCRIPTION
    Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

FLAGS
    --annotators_config=ANNOTATORS_CONFIG
        Type: Union
        Default: 'alpaca_eval_gpt4_turbo_fn'
        The path the (or list of dict of) the annotator's config file.
    -A, --Annotator=ANNOTATOR
        Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...
        The annotator class to use.
    --analyzer_kwargs=ANALYZER_KWARGS
        Type: Optional[Optional]
        Default: None
        Additional arguments to pass to the analyzer.
    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD
        Type: Union
        Default: PosixPath('/Users/yanndubois/Desktop/GitHub/alpaca_eval/src/...
        The precomputed (meta)leaderboard of annotators or a path to it (json, csv, or tsv).
    --is_save_leaderboard=IS_SAVE_LEADERBOARD
        Type: bool
        Default: False
        Whether to save the leaderboard (ie analyzed results).
    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
        Type: bool
        Default: False
        Whether to return the leaderboard (ie analyzed results). If True, it will not print the results.
    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD
        Type: bool
        Default: False
        Whether to overwrite the leaderboard if it already exists.
    -m, --max_instances=MAX_INSTANCES
        Type: Optional[Optional]
        Default: None
        The maximum number of instances to analyze.
    --is_single_annotator=IS_SINGLE_ANNOTATOR
        Type: bool
        Default: False
        Whether to analyze a single annotator. If True, will not be able to estimate the annotator's bias.
    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT
        Type: str
        Default: 'minimal'
        The mode of the leaderboard to print.
    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE
        Type: str
        Default: 'minimal'
        The mode of the leaderboard to save all new entries with.
    -o, --output_path=OUTPUT_PATH
        Type: Union
        Default: 'auto'
        Path to save the leaderboard and annotataions. If None, we don't save.
    Additional flags are accepted.
        Additional arguments to pass to `Annotator`.

Alpacaeval提供了一種製作新評估者的簡單方法。您需要的只是製作一個新的configs.yaml配置文件，然後將其作為--annotators_config <path_to_config.yaml>傳遞給alpaca_eval 。以下是您可以製作新評估者的一些方法：

更改提示符：在文本文件中寫一個新的提示符，並在配置文件的prompt_template中指定路徑。路徑相對於配置文件。
更改解碼參數：在配置文件中completions_kwargs中的所需參數。要查看所有可用參數，請參閱該文件中fn_completions在配置文件中指定的相應函數的DOCSTRINGS。
更改模型：在model_name中指定所需的模型，並在prompt_template中的相應提示。如果該模型來自另一個提供商，則必須更改fn_completions ，以將其映射到該文件中的相應函數。我們提供fn_completions功能，以使用OpenAI，人類，cohere或HuggingFace的模型。要安裝所有提供商所需的軟件包，請使用pip install alpaca_eval[all] 。

配置文件中的其他參數

最簡單的是檢查SinglePairwiseAnnotator的docstrings。這是一些重要的：

 Parameters
----------
prompt_template : path
    A prompt that will be given to `fn_prompter` or path to the prompts. Path is relative to
    `evaluators_configs/`

fn_completion_parser : callable or str
    Function in `completion_parsers.py` to use for parsing the completions into preferences. For each completion,
    the number of preferences should be equal to the batch_size if not we set all the preferences in that batch to
    NaN.

completion_parser_kwargs : dict
    Kwargs for fn_completion_parser.

fn_completions : callable or str
    Function in `decoders.py` to use for decoding the output.

completions_kwargs : dict
    kwargs for fn_completions. E.g. model_name, max_tokens, temperature, top_p, top_k, stop_seq.

is_randomize_output_order : bool
    Whether to randomize output_1, output_2 when formatting.

batch_size : int
    Number of examples that will be added in a single prompt.

製作評估者後，您還可以使用以下命令將其分析並將其添加到評估人員的排行榜中：

alpaca_eval analyze_evaluators --annotators_config ' <path_to_config.yaml> '

為了估計偏見和差異，這將評估4個種子的每個例子，即2.5k評估。如果您想要廉價的評估，則可以使用--is_single_annotator True使用單個種子，這將跳過偏差和差異的估計。

貢獻

除錯誤修復外，我們還接受新模型，評估人員和評估集的PR。我們將定期使用新的社區捐款更新排行榜網站。如果您遇到任何問題並希望向社區詢問幫助，我們還為Alpacaeval創建了一個支持分歧。

要開始，請首先分配存儲庫，然後從源pip install -e .

貢獻一個模型

首先，您需要在Models_configs文件夾中添加模型配置定義。例如，您可以查看Falcon-7b-Instruct YAML。請確保YAML匹配中的文件夾名稱和密鑰名稱。

然後，請按照評估模型的步驟來運行模型的推斷，以在評估集中產生輸出並根據評估者之一對模型進行評分。一個示例命令可能看起來像：

alpaca_eval evaluate_from_model 
  --model_configs ' falcon-7b-instruct '

運行此命令後，您應該在相應的排行榜文件中生成一個輸出JSON和新條目。請使用配置，輸出文件和更新的排行榜進行PR。

具體而言，您應該做類似的事情：

分叉GitHub中的存儲庫
克隆分叉的存儲庫git clone <URL>
在src/alpaca_eval/models_configs/<model_name>上製作模型配置，然後評估它evaluate_from_model --model_configs '<model_name>'
將模型配置，輸出和排行榜輸入添加到分叉存儲庫

git add src/alpaca_eval/models_configs/ < model_name > # add the model config
git add src/alpaca_eval/leaderboards/ # add the actual leaderboard entry
git add src/alpaca_eval/metrics/weights # add the weights for LC
git add -f results/ < model_name > /model_outputs.json # force add the outputs on the dataset
git add -f results/ < model_name > / * /annotations.json # force add the evaluations from the annotators
git commit -m " Add <model_name> to AlpacaEval "
git push

在Alpacaeval上創建拉動請求

注意：如果要在Alpacaeval之外生成輸出，則仍應添加模型配置，但使用fn_completions: null 。有關一個示例，請參見此配置。

驗證您的模型

Alpacaeval的驗證結果表明，核心維護者已解碼了模型的輸出並進行了評估。不幸的是，我們是山地肌維護者，缺乏驗證所有模型的資源，因此我們只能為排行榜前5名的模型做到這一點。對於給您帶來的任何不便，我們深表歉意，並感謝您的理解。要驗證您的模型，請按照以下步驟操作：

請聯繫@yann在Discord上，或者如果您有我們的電子郵件，請給我們發送電子郵件，提供一個簡短的理由，說明為什麼要驗證您的模型。
在進行之前等待我們的回應和批准。
準備一個不需要GPU的模型中解碼的腳本，通常與用於模型貢獻的腳本相同。它應該使用alpaca_eval evaluate_from_model --model_configs '<your_model_name>'運行，而無需本地GPU。
生成用於運行腳本的臨時API鍵並與我們共享。具體來說，我們需要解碼您的模型和評估的鍵（例如，OpenAI或人類密鑰）。
我們將執行alpaca_eval evaluate_from_model --model_configs '<your_model_name>' ，更新結果並通知您，以便您可以撤銷臨時密鑰。

請注意，我們不會重新評估相同的模型。由於採樣差異，結果可能與您的初始結果略有不同。我們將用經過驗證的結果替換您以前的社區結果。

貢獻評估者

請首先遵循製作新評估者的指示。創建註釋器配置後，我們要求您通過評估最小模型集為註釋者創建一個新的排行榜。這些模型的輸出可以通過下載alpaca_eval_all_outputs.json找到。

alpaca_eval make_leaderboard 
  --leaderboard_path src/alpaca_eval/leaderboards/data_AlpacaEval/ < evaluator > _leaderboard.csv 
  --all_model_outputs alpaca_eval_all_outputs.json 
  --annotators_config < evaluator_config >

然後，請使用註釋器配置和排行榜CSV創建PR。

貢獻評估集

為了貢獻新的評估集，您首先需要指定一組文本說明。然後，您需要指定一組參考輸出（根據此參考計算模型贏率）。為了易於使用，您可以使用默認的Text-Davinci-003參考配置。

將它們放在JSON中，每個條目都指定字段instruction ， output和generator 。您可以將alpaca_eval.json視為指南（不需要dataset集字段）。

最後，我們要求您在此新評估集中創建最小的排行榜。您可以使用以下操作：

alpaca_eval make_leaderboard 
  --leaderboard_path < src/alpaca_eval/leaderboards/data_AlpacaEval/your_leaderboard_name.csv > 
  --all_model_outputs alpaca_eval_all_outputs.json 
  --reference_outputs < path_to_json_file >

請與評估集JSON和相應的排行榜CSV一起提交PR。

貢獻完成功能

當前，我們允許不同的完成功能，例如， openai ， anthropic ， huggingface_local ， huggingface_hub_api ...如果您想貢獻一個新的完成函數 /可以執行推斷的新完成功能 / API，請按照以下步驟操作：

在解碼器文件夾中添加一個帶有函數<name>_completions(prompts : Sequence[str], model_name :str, ... ) 。此功能應作為參數提示 + kwargs並返回完成。請查看目錄中的其他完成功能的模板。例如，huggingface_local_completions或人類。
在INIT中添加<name>_completions和依賴項。再次，您可以按照huggingface_local_completions的示例
在setup.py中更新可選依賴項
添加您要在模型配置中評估的模型
使用alpaca_eval evaluate_from_model --model_configs '<model_configs>'評估模型
（可選）遵循這些步驟

隨意提早開始PR，我們將能夠在此過程中提供一些幫助！

限制

與其他當前評估者一樣，羊角谷評估管道具有重要的局限性，因此不應用作重要環境中人類評估的替代，例如決定是否準備部署模型。這些可以大致分為三類：

指令可能無法代表現實意見：alpacaeval集合包含來自各種數據集（自我指導，開放式輔助，Vicuna，vicuna，koala，hh-rlHF）的示例，這些示例可能不代表現實通用和諸如GPT4之類的更好模型的現實使用和高級應用。這可能會使最佳的封閉模型（GPT4 / Claude / Chatgpt / ...）看起來比開放型模型更相似。實際上，這些封閉的模型似乎在更多樣化的數據上審計/填充。例如，有關更複雜的說明的初步結果，請參見此博客。但是，請注意，在Alpacafarm中，我們表明我們的評估集中的獲勝率高度相關（0.97 R2），並在用戶與羊駝演示的用戶互動中進行了贏率。此外，與其他排行榜（例如LMSYS）相比，山帕瓦爾排行榜在開放型和OpenAI模型之間顯示出更大的差距。
自動註釋者的偏見：原始自動註釋者似乎具有隱式偏見。特別是，我們發現它們傾向於更喜歡包含列表的更長輸出和輸出（例如，羊alpaca_eval_gpt4的0.68 / 0.69， claude的0.62 / 0.58）。儘管我們發現人類具有相似的偏見（0.64 / 0.61），但我們認為，這可能更多地是人類註釋管道的限制，我們使用的而不是真正的人類偏見。更普遍的是，通過定性分析，我們發現自動註釋者比其內容（例如，事實）更重要。最後，我們發現自動評估者傾向於喜歡與claude和alpaca_eval_gpt4排行榜上的ChatGPT/GPT4之間的很大差異所暗示的相似模型（可能是在相同數據上進行訓練）的輸出。請注意，在我們的長度控制的勝利率中，長度偏差部分緩解。
缺乏安全性評估：重要的是，Alpacaeval僅評估模型的指令跟踪功能，而不是它們可能造成的危害（例如有毒行為或偏見）。結果，當前Chatgpt和最佳開源模型之間的微小差距不應被解釋，就好像後者準備好部署一樣。

除了對評估管道的限制之外，我們對評估者的驗證以及我們建議的選擇評估集的方法還有局限性。

有關我們驗證管道的限制

首先，我們基於人類交叉通道的評估者的驗證受到以下局限性：（1）我們定性地發現，我們的人群工人也傾向於偏愛諸如長度和列表的風格，而不是事實；（2）這沒有驗證針對參考模型的勝利率首先是良好的評估策略；（3）16名群眾的偏好並不代表所有人類的偏好。

其次，我們建議根據統計能力選擇評估集的建議方法受到以下局限性：（1）統計功率不能確保正確的方向，例如，您可以擁有一套不自然的指令，而羊駝“執行”比更好的模型更好；（2）這可以推動用戶選擇數據以支持他們想要驗證的假設。

附加分析和圖

長度控制的山帕（LCAE）

長度控制的羊石瓦爾可視化：

長度控制的谷泊瓦爾發展：

該筆記本顯示了我們考慮的不同選項，以減輕自動註釋者的長度偏差。

在這裡，我們簡要總結了主要結果。即：

對於Alpacaeval 2.0， LCAE將與聊天舞台的相關性從0.94增加到0.94 。如下圖所示，這使LCAE成為與聊天競技場最高度相關的基準。

LC Alpacaeval是與聊天競技場最高度相關的基準。

LCAE降低了長度的遊戲性， Alpacaeval的主要問題之一是，您可以通過增加輸出的長度來增加贏得率。例如，在Alpacaeval 2.0中，當提示“盡可能多地給出細節”時，基線的贏率（50％）增加到64％，並在提示“盡可能簡潔的同時仍然提供所有必要的信息來回答問題時，將減少到23％”。更概念上，alpacaeval的相對長度的配柴性約為21％，LCAE的相對長度降至約6％，因此迅速長度的遊戲降低了3倍。這在下面的圖中顯示。

LC Alpacaeval降低了基準的長度連接性。

We can predict performance for different baselines One other benefit of using a GLM for controlling for length bias. Is that we now have a model that can predict the win-rate of a model for different baselines. In particular, our GLM has many nice properties, for example win_rate(m,b) = 1 - win_rate(b,m) in [0,1] and win_rate(m,m) = 0.5 . This is shown in the plot below.

Predicted win rate for different baselines

Finally, note that we are only controlling for length bias. There are other known biases that we are not controlling for, such as the fact that auto-annotators prefer outputs similar to their model. Although we could control for that, in practice we have found that to be less of an issue than length bias. For two reasons (1) this mostly a single model in the leaderboard because fine-tuning on outputs from the auto-annotator doesn't seem to have doesn't seem to impact the win-rate as much, and (2) the bias is actually less strong that what one could think. For example we show below a subset of the leaderboards auto-annotated by three different models, and we see that the ranking of models is exactly the same. In particular, claude-3-opus prefers gpt4_preview , and mistral-large prefers the former two.

Leaderboard by different auto-annotators

Analyzing an evaluator

Caution : all the following results are about AlpacaEval 1.0 and have not been updated since

Analyzing evaluators:

As we saw in the evaluator's leaderboard, there are many metrics to consider when selecting an evaluator, eg the quality, price, and speed. To assist with selection of the evaluator we provide a few functions to plot those metrics. The following shows for example the price/time/agreement of the different evaluators.

Here we see that alpaca_eval_gpt4 performs very well and is better than humans on all the considered metrics.

Previously we only considered the agreement with human annotators overall. An additional validation that one could do is checking whether making a leaderboard using our automatic annotator gives similar results as a leaderboard from humans. To enable such analysis, we release human annotations of outputs from 22 methods from AlpacaFarm => 22*805 = ~18K annotations. As a result we can test the correlation between the win-rates of the 22 models as evaluated by the humans and our automatic annotator. Note that this is arguably a better way of selecting an automatic evaluator than using "human agreement [%]" but is expensive given that it requires 18K annotations. The plot below shows such correlation for the alpaca_eval_gpt4 evaluator.

Correlation between humans and alpaca_eval_gpt4

We see that the alpaca_eval_gpt4 leaderboard is highly correlated (0.94 Pearson correlation) to the leaderboard from humans, which further suggests that automatic evaluation is a good proxy for human evaluation. For the code and more analysis, see this notebook, or the colab notebook above.

Analyzing an eval set

Caution : all the following results are about AlpacaEval 1.0 and have not been updated since.

Making evaluation sets:

When creating an evaluation set there are two main factors to consider: how much data to use? and what data?

One way of answering those question is by considering a leaderboard of models that you believe are of different quality and checking what and how much data is needed to distinguish between them in a statistically significant way. We will do so below using a paired t-test to test if the difference in win-rates between every pair of models is statistically significant.

First, let us consider the question of how much data to use. Below we show the number of random samples needed from AlpacaEval for the paired t-test to give a p-value < 0.05 for each pair of models in the minimal alpaca_eval_gpt4 leaderboard. Grey cells correspond to pairs that are not significantly different on the 805 samples. y- and x-axis are ordered by the win-rate of the first and second model respectively.

Number of samples needed to distinguish pairs in the Claude leaderboard

We see that most models can already be distinguished with 50 samples, and that 150 samples allows distinguishing the majority of pairs (74 out of 78). This suggests that we can decrease the evaluation set size by a factor of 4 when testing two models that have similar performance gaps as those on the minimal alpaca_eval_gpt4 leaderboard.

The second question is what data to use. Again we can try to answer this question from a statistical power perspective: what data allows to best distinguish between models. Let's consider this for all the datasets that are part of AlpacaEval, but let us control for the size of the evaluation sets as we only care about the quality of the data. The following plot shows the p-values from the paired t-test of each pairs of models on 80 examples of each subset of AlpacaEval.

We see for example that the self-instruct dataset yields the least statistical power, which suggests that one could remove this dataset from the evaluation set. The exact reason should be analyzed in future work. For the code and more analysis see this notebook, or the colab notebook above.

引用

Please consider citing the following depending on what you are using and referring to:

Code, results, and general benchmark : alpaca_eval (this repo). Specify whether you are using AlpacaEval or AlpacaEval 2.0. For length-controlled win-rates see below.
Length-controlled (LC) win rates : alpaca_eval_length .
Human annotations : dubois2023alpacafarm (AlpacaFarm)
AlpacaEval evaluation set : alpaca_eval and self-instruct, open-assistant, vicuna, koala, hh-rlhf.

Here are the bibtex entries:

 @misc{alpaca_eval,
  author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},
  year = {2023},
  month = {5},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/tatsu-lab/alpaca_eval}}
}

 @article{dubois2024length,
  title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},
  author={Dubois, Yann and Galambosi, Bal{'a}zs and Liang, Percy and Hashimoto, Tatsunori B},
  journal={arXiv preprint arXiv:2404.04475},
  year={2024}
}

 @misc{dubois2023alpacafarm,
  title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback}, 
  author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},
  year={2023},
  eprint={2305.14387},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Length-Controlled Win Rates

Length controlled (LC) win-rates are a debiased version of the win-rates that control for the length of the outputs.

The main idea is that for each model we will fit a logistic regression to predict the preference of the autoannotator given: (1) the instruction, (2) the model, and (3) the difference of length between the baseline and model output. Given such a logistic regression we can then try to predict the counterfactual "what would the preference be if the model's output had the same length as the baseline" by setting the length difference to 0. By averaging over this length-controlled preference, we then obtain the length-controlled win-rate. The exact form of the logistic regression is taken such that the interpretation of LC win rates is similar to the raw win rates, for example for any model m1 and m2 we have win_rate(m1, m2) = 1 - win_rate(m2, m1) in [0,100] and win_rate(m1, m1) = 0.5 . Length controlled win-rates increase the correlation between AlpacaEval's leaderboard and Chat Arena from 0.93 to 0.98 Spearman correlation, while significantly decreasing the length gameability of the annotator . For more information and results about length controlled win-rates see this notebook.

This idea of estimating the controlled direct effect, by predicting the outcome while conditioning on the mediator (the length difference), is common in statistical inference.

To get LC win rates on previously annotated models, you can use the following command:

pip install -U alpaca_eval
alpaca_eval --model_outputs … --is_recompute_metrics_only True

AlpacaEval 2.0

AlpacaEval 2.0 is a new version of AlpacaEval. Here are the differences:

reference: gpt4_turbo : we upgraded the baseline from text-davinci-003 to gpt4_turbo to make the benchmark more challenging and have a metric that better reflects the current state of the art.
annotator: weighted_alpaca_eval_gpt4_turbo : we improved the annotator in quality and price. First, we use the gpt4_turbo model for annotating, which is approximately 2x cheaper than gpt4 . Second, we changed the prompt such that the model outputs a single token, which further reduced cost and speed. Finally, instead of using a binary preference, we used the logprobs to compute a continuous preference, which gives the final weighted win-rate. Note that the latter two changes had the surprising effect of decreasing the annotators' length biased.

By default, AlpacaEval 2.0 will be used from pip install alpaca_eval==0.5 . If you wish to use the old configs by default, you can set IS_ALPACA_EVAL_2=False in your environment.

Data Release

As part of AlpacaEval, we release the following data:

Human annotations (17701) in order to develop and understand automatic evaluators, we release all the human pairwise evaluation that we collected for AlpacaFarm. This contains comparisons between 22 models with the text-davinci-003 reference on the AlpacaFarm evaluation set. Annotations are from a pool of 16 crowd workers on Amazon Mechanical Turk. The different models are: 6 from OpenAI, 2 SFT models from AlpacaFarm, 13 RLHF methods from AlpacaFarm, and LLaMA 7B.
Human cross-annotations (2596) in order to further analyze automatic evaluators we selected (via stratified sampling across models and datasets) 650 examples from the AlpacaFarm evaluation set and collected 4 human annotations per example.
AlpacaEval set (805) we made slight modifications/simplification of the AlpacaFarm evaluation set. In particular, we first merged the instruction and input fields into a single instruction field. This affects 1/4 of the examples in the AlpacaFarm evaluation set, all of which are from the self-instruct evaluation set. Second we regenerated the text-davinci-003 reference outputs without limiting the length of its outputs.

For more details about the human annotations refer to the AlpacaFarm paper.

Differences with AlpacaFarm

AlpacaEval is an improvement and simplification of the automatic pairwise preference simulator from AlpacaFarm. Outside AlpacaFarm, you should be using AlpacaEval. Here are the main differences:

AlpacaEval merges instructions and inputs : The AlpacaEval evaluation is the same as the AlpacaFarm evaluation except that the instruction and input fields are merged as {instruction}nn{input} . This affects 1/4 of the examples in the AlpacaFarm evaluation set (the self-instruct subset). This simplification provides a more fair comparison for models that were not trained by distinguishing between the two fields.
AlpacaEval handles longer generations : Models in AlpacaFarm were limited to a maximum number of 300 tokens for generations. We change this number to 2000 for AlpacaEval. Note that this also affects the reference generations ( text-davinci-003 ), so the results on AlpacaEval are not comparable to those on AlpacaFarm even for examples that had no input field.
AlpacaEval removes intra- and inter-annotator variance : The AlpacaFarm simulator replicates human annotation in terms of both mode behavior and diversity. In particular, AlpacaFarm's simulator uses a pool of models and prompts and adds noise to replicate human intra- and inter-annotator variance. If the goal is to use an automatic annotator for evaluation or simply training better models, then this variance may not be desirable. The default annotators in AlpacaEval thus don't have this variance. We give the option to add it back by using --anotators_config 'alpaca_farm' and --p_label_flip 0.25 when creating an evaluator.

Related work

There have been several work that propose new automatic annotators for instruction-following models. Here we list the ones that we are aware of and discuss how they differ from ours. We evaluated all of those in our evaluator's leaderboard.

Vicuna/lmsys The lmsys annotator ( lmsys_gpt4 ) evaluates the pair by asking the annotator a score from 1-10 for each output, and then selecting the output with the highest score as preferred. They do not randomize over output order and they ask an explanation after the score. Overall, we found that this annotator has strong bias towards longer outputs (0.74) and relatively low correlation with human annotations (63.2).
AlpacaFarm The best AlpacaFarm annotator ( alpaca_farm_greedy_gpt4 ) evaluates the pair by directly asking the annotator which output it prefers. Furthermore, it batches 5 examples together to amortize the length of the prompt and randomizes the order of outputs. Overall, we found that this annotator has much less bias towards longer outputs (0.60) and is faster (878 seconds/1000 examples) than others. It has a slightly higher correlation with the majority of human annotations (66.4) than humans themselves (65.7). However, it is more expensive ($15.3/1000 examples) and doesn't work with very long outputs given the batching.
Aviary The Aviary annotator ( aviary_gpt4 ) asks the annotator to order the output by its preference, rather than simply selecting the preferred output. It does not randomize the order of outputs and uses high temperature for decoding (0.9). Overall, we found that this annotator has relatively strong bias towards longer outputs (0.70) and very high correlation with human annotations (69.1). By decreasing the temperature and randomizing the order of outputs, we further improved the correlation to 69.8 ( improved_aviary_gpt4 ) but this further increased the length bias to 0.73.

Our alpaca_eval_gpt4 is a mix between the AlpacaFarm and Aviary annotators. It asks the annotator to order the outputs by preference, but it uses temperature 0, randomizes over outputs, and made some modifications to the prompt to decrease length bias to 0.68.

Other related work include recent papers which analyze automatic evaluators.例如：

AlpacaFarm Appx C and Large Language Models are not Fair Evaluators both found that automatic annotators have a position bias.
AlpacaFarm Sec. 5.2。 and The False Promise of Imitating Proprietary LLMs both found that automatic annotators favor style (eg use of list, tone, word choice, length) over factuality.

Interpreting annotations

For all models you can find the auto-annotations under results/<model_name>/*/annotations.json . The annotations have the following columns:

instruction : the prompt
generator_1 : the baseline model
output_1 : the output of the baseline model
generator_2 : the model being evaluated
output_2 : the output of the model being evaluated
annotator : the auto-annotator
preference : the result of the auto-annotator. This is a float between 1 and 2. Closer to 1 means that the auto-annotator prefers output_1 , closer to 2 means that it prefers output_2 . For AlpacaEval 2.0, preference-1 corresponds to the probability of output_1 being preferred. For AlpacaEval 1.0, preference is 1 if output_1 is preferred, 2 if output_2 is preferred, and 1.5 if they are the same. The win rate is always (preference -1).mean() .
raw_completion : the raw output of the auto-annotator. This is field contains the completions before de-randomization of the order between output_1 and output_2 ! It is thus much harder to interpret, see below for more information.

Chain of thought

For some annotators, eg alpaca_eval_cot_gpt4_turbo_fn we use chain of thought reasoning to make the models preferences more interpretable. Those can then be found under concise_explanation . To interpret them, you should also look at referenced_models which translates the temporary model name (in the prompt) to the actual output. Below, we provide more explanation as to what is happening behind the scenes.

You can check the raw_annotations["concise_explanation] column in annotations.json (eg here) which contains the chain of thought reasoning of the auto annotator. Note that the raw_annotations is not modified by the randomization of the order of the outputs. In particular, "m" and "M" can sometime refer to the first model (the reference) and sometime to the second model (the model being evaluated). To understand which model is being referred to, you should use the column preference and ordered_models . To make it easier we add a column "referenced_models" mapping the model names to the corresponding outputs. For example in the following annotation we see that the preference is 1.0 (ie output_1 ) and corresponds to model M in concise_explanation (see ordered_models ).

{
  "instruction" : " How did US states get their names? " ,
  "output_1" : " The names of U.S. states are derived from a variety of languages and sources, including indigenous American languages, European languages (primarily English, Spanish, and French), as well as other historical influences. Here is a brief overview of how some states got their names: nn 1. **Indigenous American Languages**: Many state names come from Native American words or names of tribes, often from the languages of the indigenous peoples who lived in or near those areas. n   - **Alabama**: Possibly derived from the Choctaw language, meaning " thicket clearers. "n   - **Connecticut**: From a Mohegan-Pequot word meaning " long tidal river. "n   - **Massachusetts**: [...] " ,
  "generator_1" : " gpt4_1106_preview " ,
  "dataset" : " helpful_base " ,
  "output_2" : " The names of the 50 U.S. states come from a variety of sources, including Native American languages, European languages, and historical figures. Here's a brief overview of how some states got their names: nn 1. Native American origins: Many states have names derived from Native American languages. For example, Alabama comes from the Choctaw word " Albah amo, " meaning " plant gatherers " or " herb gatherers. " Similarly, the name Mississippi comes from the Ojibwe word " Misi-ziibi, " meaning " great river. "nn 2. European languages: [...]. " ,
  "generator_2" : " gpt4 " ,
  "annotator" : " alpaca_eval_cot_gpt4_turbo_fn " ,
  "preference" : 1.0 ,
  "raw_completion" : {
    "concise_explanation" : " Model M provided a more detailed and structured response, including bold headings for each category and a wider range of examples. It also included additional categories such as 'Other European Languages' and 'Combination of Languages and Influences', which added depth to the explanation. Model m's response was accurate but less comprehensive and lacked the clear structure found in Model M's output. " ,
    "ordered_models" : [
      {
        "model" : " M " ,
        "rank" : 1
      },
      {
        "model" : " m " ,
        "rank" : 2
      }
    ]
  },
  "referenced_models" : {
    "M" : " output_1 " ,
    "m" : " output_2 "
  }
}

Major updates

12th March 2024: updated to use length-controlled (LC) win rates. This is a debiased version of the win-rates that control for the length of the outputs.
3rd January 2024: updated to AlpacaEval 2.0, which uses GPT4-turbo as baseline and annotator.
2nd January 2024: added Azure API and more general way of setting client configs.請參閱此處
19th June 2023: add leaderboard chatgpt_fn that anyone can use (no waiting lists).
19th June 2023: update to use OpenAI's function calling. Example: chatgpt_fn or alpaca_eval_gpt4_fn .

展開