具有長度控制的勝利率(Paper)的Alpacaeval 2.0與Chatbot Arena的Spearman相關性為0.98 ,而OpenAI積分的成本不到10美元,在不到3分鐘的時間內運行和運行。我們的目標是為CHAT LLM提供一個基準:快速(<5分鐘),便宜(<$ 10),並且與人類高度相關(0.98)。這是與其他基準的比較:

更新:
?默認情況下,長度控制的獲勝率已超出並使用!這將與聊天機器人競技場的相關性從0.93提高到0.98,同時大大降低了連續連續性。原始的獲勝率仍在網站和CLI上顯示。更多詳細信息。
?灰木瓦爾2.0默認情況下使用並使用!我們改進了自動通信器(更好,更便宜),並將GPT-4預覽用作基線。更多詳細信息。對於舊版本,設置您的環境變量IS_ALPACA_EVAL_2=False 。
指導跟隨模型(例如,chatgpt)的評估通常需要人類互動。這是耗時的,昂貴的且難以復制的。基於LLM的自動評估中的羊駝羊角,該評估是快速,便宜,可複制的,並且可以針對20K人類註釋進行驗證。它對於模型開發特別有用。儘管我們在先前的自動評估管道上有所改善,但仍然存在基本限制,例如偏愛更長的產出。羊駝谷提供以下內容:
何時使用羊駝毛?我們的自動評估器是一個快速,便宜的代理人,可用於評估簡單的指導跟踪任務。如果您必須在模型開發過程中快速運行許多評估,那麼這將很有用。
何時不使用山帕卡瓦爾?作為任何其他自動評估者,Alpacaeval不應在高風險決策中取代人類評估,例如決定模型釋放。特別是,阿爾帕卡瓦爾受到以下事實的限制:(1)評估集中的指示可能無法代表LLM的高級使用; (2)自動評估者可能有偏見,例如偏愛答案的樣式; (3)Alpacaeval不能衡量模型可能引起的風險。限制的詳細信息。
要安裝穩定版本,請運行
pip install alpaca-eval要安裝夜間版本,請運行
pip install git+https://github.com/tatsu-lab/alpaca_eval然後,您可以按以下方式使用它:
export OPENAI_API_KEY= < your_api_key > # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md
alpaca_eval --model_outputs ' example/outputs.json ' 這將將排行榜打印到控制台,並將排行榜和註釋保存到與model_outputs文件相同的目錄中。重要參數如下:
instruction和output 。weighted_alpaca_eval_gpt4_turbo (Alpacaeval 2.0的默認值),該量與我們的人類註釋數據,大的上下文大小具有很高的一致性率,並且非常便宜。有關所有註釋者的比較,請參見此處。model_outputs相同的格式。默認情況下,這是Alpacaeval 2.0的gpt4_turbo 。如果您沒有模型輸出,則可以使用evaluate_from_model並傳遞本地路徑或擁抱面模型的名稱,或者來自標準API(OpenAI,Anthropic,Anthropic,cohere,cohere,Google,...)的模型。其他命令:
>>> alpaca_eval -- --help SYNOPSIS
alpaca_eval COMMAND
COMMANDS
COMMAND is one of the following:
evaluate
Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.
evaluate_from_model
Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.
make_leaderboard
Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.
analyze_evaluators
Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).
有關每個功能的更多信息,請使用alpaca_eval <command> -- --help 。
我們的排行榜是在Alpacaeval數據集上計算的。我們使用不同的基線模型和自動通道者預先計算重要模型的排行榜。我們的兩個主要排行榜(“ Alpacaeval 2.0”和“ Alpacaeval”)可以在此頁面上找到。 “ Alpacaeval 2.0”使用weighted_alpaca_eval_gpt4_turbo作為註釋器,而gpt4_turbo用於基線。 “ Alpacaeval”使用alpaca_eval_gpt4用於註釋器,而text_davinci_003用於基線。對於所有預先計算的排行榜,請參見此處。稍後,我們還展示瞭如何將您的模型添加到排行榜中,以及如何為您的評估員/數據集製作新的排行榜。有關所有可用型號的配置,請參見此處。
羊角谷山羊角板最小排行榜:
| 獲勝率 | 性病錯誤 | |
|---|---|---|
| GPT4 | 95.3 | 0.7 |
| 克勞德 | 88.4 | 1.1 |
| chatgpt | 86.1 | 1.2 |
| 瓜納科65B | 71.8 | 1.6 |
| Vicuna-13b | 70.4 | 1.6 |
| text_davinci_003 | 50.0 | 0.0 |
| 羊駝農業ppo-human | 41.2 | 1.7 |
| 羊駝-7b | 26.5 | 1.5 |
| text_davinci_001 | 15.2 | 1.2 |
獲勝率:贏率測量模型輸出的時間比參考的輸出(alpacaeval的test-davinci-003 )和alpacaeval 2.0的gpt4_turbo所優於使用。更具體地說,要計算apacaeval數據集中每個指令中所需模型的輸出對的獲勝率。然後,我們將每個輸出與參考模型的輸出(例如text-davinci-003 )的輸出配對。然後,我們詢問我們的自動評估器他們更喜歡哪種輸出。參見Alpacaeval的和Alpacaeval 2.0的提示和配置,特別是我們隨機對輸出順序進行隨機,以避免位置偏差。然後,我們平均比數據集中所有指令的偏好平均獲得基線的勝利率。如果兩個輸出完全相同,我們對這兩個模型都使用一半偏好。
標準錯誤:這是獲勝率的標準錯誤(按N-1歸一化),即,與不同指令相比的偏好平均。
alpaca_eval_gpt4我們的alpaca_eval_gpt4 (請參閱配置)註釋者平均比首選項,其中偏好的獲得如下:
temperature=0 GPT4完成提示註釋者是Alpacafarm和鳥舍評估者(受高度影響)之間的混合。特別是,我們使用與Alpacafarm相同的代碼(緩存/隨機化/超參數),但使用類似於Aviary的排名提示。我們對鳥舍的提示進行更改,以減少更長輸出的偏差。相關工作的詳細信息。
對於Alpacaeval 2.0,我們使用weighted_alpaca_eval_gpt4_turbo ,它使用logprobs來計算連續的偏好,並將gpt4_turbo用作模型(請參閱配置)。
我們通過與收集的2.5k人類註釋進行比較(每個人的4個人類註釋),評估了Alpacaeval設置的不同自動註釋。 Below we show metrics for our suggested evaluators ( weighted_alpaca_eval_gpt4_turbo , alpaca_eval_gpt4 ), for prior automatic evaluators ( alpaca_farm_greedy_gpt4 , aviary_gpt4 , lmsys_gpt4 ), for humans ( humans ), and for different base models with essentially the same prompt ( gpt4 , claude , text_davinci_003 , chatgpt_fn , guanaco_33b , chatgpt )。請參閱此處的所有評估器的配置,這些評估器及其相關指標可用。
| 人類協議 | 價格[$/1000個示例] | 時間[秒/1000個示例] | Spearman Corr。 | 皮爾遜·柯爾森(Pearson Corr) | 偏見 | 方差 | proba。喜歡更長的時間 | |
|---|---|---|---|---|---|---|---|---|
| 羊Alpaca_eval_gpt4 | 69.2 | 13.6 | 1455 | 0.97 | 0.93 | 28.4 | 14.6 | 0.68 |
| alpaca_eval_cot_gpt4_turbo_fn | 68.6 | 6.3 | 1989 | 0.97 | 0.90 | 29.3 | 18.4 | 0.67 |
| alpaca_eval_gpt4_turbo_fn | 68.1 | 5.5 | 864 | 0.93 | 0.82 | 30.2 | 15.6 | 0.65 |
| Alpaca_eval_llama3_70b_fn | 67.5 | 0.4 | 209 | 0.90 | 0.86 | 32.3 | 8.2 | 0.79 |
| GPT4 | 66.9 | 12.5 | 1037 | 0.88 | 0.87 | 31.5 | 14.6 | 0.65 |
| 羊Alpaca_farm_greedy_gpt4 | 66.4 | 15.3 | 878 | 0.85 | 0.75 | 30.2 | 19.3 | 0.60 |
| alpaca_eval_cot_gpt4_turbo_fn | 65.7 | 4.3 | 228 | 0.78 | 0.77 | 33.9 | 23.7 | 0.61 |
| 人類 | 65.7 | 300.0 | 36800 | 1.00 | 1.00 | 0.0 | 34.3 | 0.64 |
| 克勞德 | 65.3 | 3.3 | 173 | 0.93 | 0.90 | 32.4 | 18.5 | 0.66 |
| lmsys_gpt4 | 65.3 | 13.9 | 17982 | 0.98 | 0.97 | 31.6 | 15.9 | 0.74 |
| text_davinci_003 | 64.1 | 8.7 | 121 | 0.85 | 0.83 | 33.8 | 22.7 | 0.70 |
| 最長 | 62.2 | 0.0 | 0 | 0.27 | 0.56 | 37.8 | 0.0 | 1.00 |
| chatgpt | 57.3 | 0.8 | 285 | 0.72 | 0.71 | 39.4 | 34.1 | 0.59 |
現在,我們用文字解釋如何計算上表中的指標。代碼在這裡。
人類的同意:這衡量了當前註釋者與人類的大多數偏好之間的一致性在我們的跨通道集中的約650個註釋中,該雜交包含4個人類註釋。為了估算單個人(上表中的humans行)與大多數人之間的一致性,我們採用了4個註釋之一,併計算預測其他3個註釋模式時它具有的準確性。然後,我們在所有4個註釋和650個指令中平均達到這種準確性,以獲得人類協議,即,我們計算了預期的(超過人類和样本)保留的協議。如果該模式不是唯一的,我們將隨機採用其中一種模式。我們為自動註釋者執行完全相同的計算,以便最終數字可比。
價格[$/1000示例] :這是每1000個註釋的平均價格。對於人類而言,這是我們付給機械土耳其人收集這些註釋的價格(每小時21美元)。如果價格取決於用於計算註釋的計算機(例如鳥根),我們將其留空。
時間[秒/1000個示例] :這是計算1000個註釋所需的平均時間。對於人類而言,每個機械Turker都以1000個例子為例,這是估計的中位時間。對於自動註釋,這是我們運行註釋時所花費的平均時間。請注意,這可能取決於不同用戶不同的API限制以及簇正在處理的請求數量。
Spearman Corr。 :這可以衡量使用自動通信器的偏好計算的排行榜與以人類偏好計算的排行榜之間的相關性。與Human agreement一樣,我們使用alpacafarm的人類註釋,但現在我們考慮了方法級的協議,而不僅僅是與人類的樣本協議。請注意,我們只使用9個型號,因此相關性不是很可靠。
皮爾遜·柯爾森(Pearson Corr) :與Spearman corr.但是與皮爾遜相關。
偏見:最有可能的人類標籤與最可能的自動標籤之間的一致。對於自動註釋,我們通過為每個示例對4個不同的註釋進行採樣來估算它。此處的隨機性來自提示中輸出的順序,從LLM進行採樣,如果適用,則在批處理中的指令順序以及在池中選擇註釋器。然後,我們採用4個註釋的模式,並在預測4個人註釋的模式時計算模式的準確性。請注意,如果我們擁有“無限”的盤通道數量,這可能是對我們將得到的真正偏見的高估。低偏見意味著註釋者的期望與人類相同。對於人類,偏見為零。請注意,這與標準的統計偏差無關,但無關,因為我們採用該模式而不是平均註釋,並且我們考慮0-1損失而不是平方損失。
差異:預期協議單個自動偏好,最有可能的偏好。我們以與估計人類的“人類協議”相同的方式估計它,即,在使用第四次註釋預測3個註釋的模式時,我們會在預期的一個出現錯誤中估算。較低的差異意味著註釋與其偏好相一致,即,如果您用不同的種子對其進行採樣,則會產生相同的結果。與偏差一樣,這並不是標準統計差異,因為我們採用模式而不是平均註釋,我們考慮0-1損失而不是平方損失。
請注意,“人類一致”與偏見和差異密切相關。特別是,由於我們僅使用單個註釋,而偏見旨在測量當前註釋者的不可約誤差,因此該方差衡量了誤差。
proba。偏愛更長的時間:這是當兩個輸出之一明顯長於另一個輸出時,註釋者更喜歡較長的輸出(超過30個字符的差異)。
在整個表中,我們還提供以下指標:
proba。優先列表:這是註釋者更喜歡輸出包含列表/項目符號點的概率,而一個輸出則不是另一個輸出。
proba。優先1 :這是註釋者更喜歡第一對輸出的概率。我們所有提出的註釋者在提示中隨機對輸出進行隨機,因此應為0.5。先前的註釋者,例如lmsys和aviary ,都沒有。
#解析:這是註釋者能夠解析的示例數。
請注意,如果差異和偏差為空,則意味著由於資源(時間和價格)約束,我們只為每個648示例執行一個單個註釋。這解釋了為什麼#Parsed為648,否則應為2592。
總體而言,我們建議使用annotators_config=weighted_alpaca_eval_gpt4_turbo如果您想要與人類達成的高度協議,並且如果預算緊張,則annotators_config=chatgpt_fn 。
選擇註釋者時,我們建議您考慮以下內容(前三個很明顯):
"Human agreement [%]""Price [$/1000 examples]""Time [seconds/1000 examples]""* corr."大約> 0.7。重要的是,相關性不是太低,但是我們不建議將其用作主要指標,因為僅在9個模型上計算相關性。"Proba. prefer longer" 。 <0.7。確實,我們發現,人類註釋者的大多數偏好對更長的答案具有強大的偏見(如"longest"評估者的高性能= 62.2,始終更喜歡最長的輸出)。這表明它可能與人類註釋者更有偏見。為了避免具有長度偏見的排行榜,我們建議使用低於0.7“ proba。更喜歡更長的自動註釋者”。"Variance"大約<0.2。我們認為,良好的評估者應具有盡可能少的差異,以使結果大多可再現。請注意,在我們正在模擬人類的情況下,如Alpacafarm中所示,這是可取的。我們過濾了不滿足上表中這些要求的註釋者(除了人類 / chatgpt / 003 / lmsys以外,出於參考目的)。有關所有結果,請參見此處。通常,我們發現weighted_alpaca_eval_gpt4_turbo是質量 /價格 /時間 /差異 /長度偏差之間的良好權衡。
上述指標是根據人群工人的註釋計算的。儘管有用,但這些註釋並不完美,例如,人群工作者通常偏愛風格而不是事實。因此,我們建議用戶根據自己的說明和人類註釋來驗證自動評估者。限制的詳細信息。
>>> alpaca_eval evaluate -- --help NAME
alpaca_eval evaluate - Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.
SYNOPSIS
alpaca_eval evaluate <flags>
DESCRIPTION
Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.
FLAGS
--model_outputs=MODEL_OUTPUTS
Type: Optional[Union]
Default: None
The outputs of the model to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv) or a function to generate those. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. If None, we just print the leaderboard.
-r, --reference_outputs=REFERENCE_OUTPUTS
Type: Union
Default: <func...
The outputs of the reference model. Same format as `model_outputs`. If None, the reference outputs are a specific set of Davinci 003 outputs on the AlpacaEval set:
--annotators_config=ANNOTATORS_CONFIG
Type: Union
Default: 'alpaca_eval_gpt4_turbo_fn'
The path the (or list of dict of) the annotator's config file. For details see the docstring of `PairwiseAnnotator`.
-n, --name=NAME
Type: Optional[Optional]
Default: None
The name of the model to add to the leaderboard. If None we check if `generator is in model_outputs` if not we use "Current model".
-o, --output_path=OUTPUT_PATH
Type: Union
Default: 'auto'
Path to the directory where the new leaderboard and the annotations should be stored. If None we don't save. If `auto` we use `model_outputs` if it is a path, and otherwise use the directory from which we call the script.
-p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD
Type: Union
Default: 'auto'
The precomputed leaderboard or a path to it (json, csv, or tsv). The leaderboard should contain at least the column `win_rate`. If `auto` we will try to use the corresponding leaderboard for the reference outputs (only if in CORRESPONDING_OUTPUTS_LEADERBOARDS). If `None` we won't add other models from the leaderboard.
--is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD
Type: bool
Default: False
Whether to overwrite the leaderboard if the model is already in it.
-l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT
Type: Optional
Default: 'minimal'
The mode of the leaderboard to use. Only used if the precomputed leaderboard has a column `mode`, in which case it will filter the leaderboard by this mode. If None keeps all.
-c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE
Type: str
Default: 'community'
The mode of the leaderboard for the current method.
--is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
Type: bool
Default: False
Whether to return the metrics instead of printing the results.
-f, --fn_metric=FN_METRIC
Type: Union
Default: 'pairwise_to_winrate'
The function or function name in `metrics.py` that will be used to convert preference to metrics. The function should take a sequence of preferences (0 for draw, 1 for base win, 2 when the model to compare wins) and return a dictionary of metrics and the key by which to sort the leaderboard.
-s, --sort_by=SORT_BY
Type: str
Default: 'win_rate'
The key by which to sort the leaderboard.
--is_cache_leaderboard=IS_CACHE_LEADERBOARD
Type: Optional[Optional]
Default: None
Whether to save the result leaderboard to `precomputed_leaderboard`. If None we save only if max_instances not None. A preferred way of adding models to the leaderboard is to set `precomputed_leaderboard` to the previously saved leaderboard at `<output_path>/leaderboard.csv`.
--max_instances=MAX_INSTANCES
Type: Optional[Optional]
Default: None
The maximum number of instances to annotate. Useful for testing.
--annotation_kwargs=ANNOTATION_KWARGS
Type: Optional[Optional]
Default: None
Additional arguments to pass to `PairwiseAnnotator.annotate_head2head`.
-A, --Annotator=ANNOTATOR
Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...
The annotator class to use.
Additional flags are accepted.
Additional arguments to pass to `PairwiseAnnotator`.
>>> alpaca_eval evaluate_from_model -- --help NAME
alpaca_eval evaluate_from_model - Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.
SYNOPSIS
alpaca_eval evaluate_from_model MODEL_CONFIGS <flags>
DESCRIPTION
Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.
POSITIONAL ARGUMENTS
MODEL_CONFIGS
Type: Union
A dictionary or path (relative to `models_configs`) to a yaml file containing the configuration of the model to decode from. If a directory,we search for 'configs.yaml' in it. The keys in the first dictionary should be the generator's name, and the value should be a dictionary of the generator's configuration which should have the
FLAGS
-r, --reference_model_configs=REFERENCE_MODEL_CONFIGS
Type: Optional[Union]
Default: None
Same as in `model_configs` but for the reference model. If None, we use the default Davinci003 outputs.
-e, --evaluation_dataset=EVALUATION_DATASET
Type: Union
Default: <func...
Path to the evaluation dataset or a function that returns a dataframe. If None, we use the default evaluation
-a, --annotators_config=ANNOTATORS_CONFIG
Type: Union
Default: 'alpaca_eval_gpt4_turbo_fn'
Path to the annotators configuration or a dictionary. If None, we use the default annotators configuration.
-o, --output_path=OUTPUT_PATH
Type: Union
Default: 'auto'
Path to save the generations, annotations and leaderboard. If auto saves at `results/<model_name>`
-m, --max_instances=MAX_INSTANCES
Type: Optional[int]
Default: None
Maximum number of instances to generate and evaluate. If None, we evaluate all instances.
--is_strip_output=IS_STRIP_OUTPUT
Type: bool
Default: True
Whether to strip trailing and leading whitespaces from the outputs.
--is_load_outputs=IS_LOAD_OUTPUTS
Type: bool
Default: True
Whether to try to load outputs from the output path. If True and outputs exist we only generate outputs for instructions that don't have outputs yet.
-c, --chunksize=CHUNKSIZE
Type: int
Default: 64
Number of instances to generate before saving. If None, we save after all generations.
Additional flags are accepted.
Other kwargs to `evaluate`
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
要評估一個模型,您需要:
model_outputs輸出。默認情況下,我們使用來自Alpacaeval的805個示例。計算山帕瓦爾的輸出: import datasets
eval_set = datasets . load_dataset ( "tatsu-lab/alpaca_eval" , "alpaca_eval" )[ "eval" ]
for example in eval_set :
# generate here is a placeholder for your models generations
example [ "output" ] = generate ( example [ "instruction" ])
example [ "generator" ] = "my_model" # name of your model如果您的型號是擁抱面模型或來自標準API提供商(OpenAI,Anthropic,Cohere)。然後,您可以直接使用alpaca_eval evaluate_from_model來照顧產生輸出。
reference_outputs 。默認情況下,我們在Alpacaeval上使用gpt4_turbo的預算輸出。如果要使用其他模型或其他數據集,則遵循與(1.)相同的步驟。annotators_config指定的評估器。我們建議使用alpaca_eval_gpt4_turbo_fn 。有關其他選項和比較,請參見此表。根據評估器,您可能需要在環境中或int client_configs中設置適當的API_KEY。一起運行:
alpaca_eval --model_outputs ' example/outputs.json '
--annotators_config ' alpaca_eval_gpt4_turbo_fn '如果您沒有解碼的輸出,則可以使用evaluate_from_model ,以照顧您的解碼(模型和參考)。這是一個例子:
# need a GPU for local models
alpaca_eval evaluate_from_model
--model_configs ' oasst_pythia_12b '
--annotators_config ' alpaca_eval_gpt4_turbo_fn ' 在這裡, model_configs和reference_model_configs (可選)是指定提示的目錄的路徑,該路徑指定了提示,模型提供商(此處為HuggingFace)和解碼參數。請參閱此目錄以獲取示例。對於所有可用開箱即用的模型提供商,請參見此處。
caching_path磁盤上緩存。因此,註釋永遠不會重新計算,這使註釋更快,更便宜並允許可重複性。即使評估不同的模型,這也有幫助,因為許多模型具有相同的輸出。>>> alpaca_eval make_leaderboard -- --help NAME
alpaca_eval make_leaderboard - Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.
SYNOPSIS
alpaca_eval make_leaderboard <flags>
DESCRIPTION
Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.
FLAGS
--leaderboard_path=LEADERBOARD_PATH
Type: Optional[Union]
Default: None
The path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists it will
--annotators_config=ANNOTATORS_CONFIG
Type: Union
Default: 'alpaca_eval_gpt4_turbo_fn'
The path the (or list of dict of) the annotator's config file.
--all_model_outputs=ALL_MODEL_OUTPUTS
Type: Union
Default: <fu...
The outputs of all models to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv potentially with globbing) or a function to generate those. If the path contains a globbing pattern, we will read all files matching the pattern and concatenate them. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. It should also contain a column `generator` with the name of the current model.
-r, --reference_outputs=REFERENCE_OUTPUTS
Type: Union
Default: <func...
The outputs of the reference model. Same format as `all_model_outputs` but without needing `generator`. By default, the reference outputs are the 003 outputs on AlpacaEval set.
-f, --fn_add_to_leaderboard=FN_ADD_TO_LEADERBOARD
Type: Callable
Default: 'evaluate'
The function to use to add a model to the leaderboard. If a string, it should be the name of a function in `main.py`. The function should take the arguments: `model_outputs`, `annotators_config`, `name`, `precomputed_leaderboard`, `is_return_instead_of_print`, `reference_outputs`.
--leaderboard_mode=LEADERBOARD_MODE
Type: str
Default: 'verified'
The mode of the leaderboard to save all new entries with.
-i, --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
Type: bool
Default: False
Whether to return the metrics instead of printing the results.
Additional flags are accepted.
Additional arguments to pass to `fn_add_to_leaderboard`.
如果要使用單個命令(而不是多個alpaca_eval調用)進行新的排行榜,則可以使用以下內容:
alpaca_eval make_leaderboard
--leaderboard_path < path_to_save_leaderboard >
--all_model_outputs < model_outputs_path >
--reference_outputs < reference_outputs_path >
--annotators_config < path_to_config.yaml >在哪裡:
leaderboard_path :將排行榜保存到的路徑。排行榜將作為CSV文件保存,如果它已經存在,則將附加。all_model_outputs :所有模型輸出的JSON路徑要添加到排行榜中(作為單個文件或通過包圍多個文件)。每個字典應包含在提示中格式化的鍵( instruction和output ),並包含帶有當前模型名稱的列generator 。例如,請參見此文件。reference_outputs參考模型輸出的路徑。每個字典應包含提示中格式的鍵( instruction和output )。默認情況下,參考輸出是Alpacaeval集合的003輸出。annotators_config :註釋器配置文件的路徑。默認為alpaca_eval_gpt4 。 >>> alpaca_eval analyze_evaluators -- --help NAME
alpaca_eval analyze_evaluators - Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).
SYNOPSIS
alpaca_eval analyze_evaluators <flags>
DESCRIPTION
Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).
FLAGS
--annotators_config=ANNOTATORS_CONFIG
Type: Union
Default: 'alpaca_eval_gpt4_turbo_fn'
The path the (or list of dict of) the annotator's config file.
-A, --Annotator=ANNOTATOR
Default: <class 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...
The annotator class to use.
--analyzer_kwargs=ANALYZER_KWARGS
Type: Optional[Optional]
Default: None
Additional arguments to pass to the analyzer.
-p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD
Type: Union
Default: PosixPath('/Users/yanndubois/Desktop/GitHub/alpaca_eval/src/...
The precomputed (meta)leaderboard of annotators or a path to it (json, csv, or tsv).
--is_save_leaderboard=IS_SAVE_LEADERBOARD
Type: bool
Default: False
Whether to save the leaderboard (ie analyzed results).
--is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT
Type: bool
Default: False
Whether to return the leaderboard (ie analyzed results). If True, it will not print the results.
--is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD
Type: bool
Default: False
Whether to overwrite the leaderboard if it already exists.
-m, --max_instances=MAX_INSTANCES
Type: Optional[Optional]
Default: None
The maximum number of instances to analyze.
--is_single_annotator=IS_SINGLE_ANNOTATOR
Type: bool
Default: False
Whether to analyze a single annotator. If True, will not be able to estimate the annotator's bias.
-l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT
Type: str
Default: 'minimal'
The mode of the leaderboard to print.
-c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE
Type: str
Default: 'minimal'
The mode of the leaderboard to save all new entries with.
-o, --output_path=OUTPUT_PATH
Type: Union
Default: 'auto'
Path to save the leaderboard and annotataions. If None, we don't save.
Additional flags are accepted.
Additional arguments to pass to `Annotator`.
Alpacaeval提供了一種製作新評估者的簡單方法。您需要的只是製作一個新的configs.yaml配置文件,然後將其作為--annotators_config <path_to_config.yaml>傳遞給alpaca_eval 。以下是您可以製作新評估者的一些方法:
prompt_template中指定路徑。路徑相對於配置文件。completions_kwargs中的所需參數。要查看所有可用參數,請參閱該文件中fn_completions在配置文件中指定的相應函數的DOCSTRINGS。model_name中指定所需的模型,並在prompt_template中的相應提示。如果該模型來自另一個提供商,則必須更改fn_completions ,以將其映射到該文件中的相應函數。我們提供fn_completions功能,以使用OpenAI,人類,cohere或HuggingFace的模型。要安裝所有提供商所需的軟件包,請使用pip install alpaca_eval[all] 。最簡單的是檢查SinglePairwiseAnnotator的docstrings。這是一些重要的:
Parameters
----------
prompt_template : path
A prompt that will be given to `fn_prompter` or path to the prompts. Path is relative to
`evaluators_configs/`
fn_completion_parser : callable or str
Function in `completion_parsers.py` to use for parsing the completions into preferences. For each completion,
the number of preferences should be equal to the batch_size if not we set all the preferences in that batch to
NaN.
completion_parser_kwargs : dict
Kwargs for fn_completion_parser.
fn_completions : callable or str
Function in `decoders.py` to use for decoding the output.
completions_kwargs : dict
kwargs for fn_completions. E.g. model_name, max_tokens, temperature, top_p, top_k, stop_seq.
is_randomize_output_order : bool
Whether to randomize output_1, output_2 when formatting.
batch_size : int
Number of examples that will be added in a single prompt.
製作評估者後,您還可以使用以下命令將其分析並將其添加到評估人員的排行榜中:
alpaca_eval analyze_evaluators --annotators_config ' <path_to_config.yaml> ' 為了估計偏見和差異,這將評估4個種子的每個例子,即2.5k評估。如果您想要廉價的評估,則可以使用--is_single_annotator True使用單個種子,這將跳過偏差和差異的估計。
除錯誤修復外,我們還接受新模型,評估人員和評估集的PR。我們將定期使用新的社區捐款更新排行榜網站。如果您遇到任何問題並希望向社區詢問幫助,我們還為Alpacaeval創建了一個支持分歧。
要開始,請首先分配存儲庫,然後從源pip install -e .
首先,您需要在Models_configs文件夾中添加模型配置定義。例如,您可以查看Falcon-7b-Instruct YAML。請確保YAML匹配中的文件夾名稱和密鑰名稱。
然後,請按照評估模型的步驟來運行模型的推斷,以在評估集中產生輸出並根據評估者之一對模型進行評分。一個示例命令可能看起來像:
alpaca_eval evaluate_from_model
--model_configs ' falcon-7b-instruct '運行此命令後,您應該在相應的排行榜文件中生成一個輸出JSON和新條目。請使用配置,輸出文件和更新的排行榜進行PR。
具體而言,您應該做類似的事情:
git clone <URL>src/alpaca_eval/models_configs/<model_name>上製作模型配置,然後評估它evaluate_from_model --model_configs '<model_name>'git add src/alpaca_eval/models_configs/ < model_name > # add the model config
git add src/alpaca_eval/leaderboards/ # add the actual leaderboard entry
git add src/alpaca_eval/metrics/weights # add the weights for LC
git add -f results/ < model_name > /model_outputs.json # force add the outputs on the dataset
git add -f results/ < model_name > / * /annotations.json # force add the evaluations from the annotators
git commit -m " Add <model_name> to AlpacaEval "
git push注意:如果要在Alpacaeval之外生成輸出,則仍應添加模型配置,但使用fn_completions: null 。有關一個示例,請參見此配置。

Alpacaeval的驗證結果表明,核心維護者已解碼了模型的輸出並進行了評估。不幸的是,我們是山地肌維護者,缺乏驗證所有模型的資源,因此我們只能為排行榜前5名的模型做到這一點。對於給您帶來的任何不便,我們深表歉意,並感謝您的理解。要驗證您的模型,請按照以下步驟操作:
@yann在Discord上,或者如果您有我們的電子郵件,請給我們發送電子郵件,提供一個簡短的理由,說明為什麼要驗證您的模型。alpaca_eval evaluate_from_model --model_configs '<your_model_name>'運行,而無需本地GPU。alpaca_eval evaluate_from_model --model_configs '<your_model_name>' ,更新結果並通知您,以便您可以撤銷臨時密鑰。請注意,我們不會重新評估相同的模型。由於採樣差異,結果可能與您的初始結果略有不同。我們將用經過驗證的結果替換您以前的社區結果。
請首先遵循製作新評估者的指示。創建註釋器配置後,我們要求您通過評估最小模型集為註釋者創建一個新的排行榜。這些模型的輸出可以通過下載alpaca_eval_all_outputs.json找到。
alpaca_eval make_leaderboard
--leaderboard_path src/alpaca_eval/leaderboards/data_AlpacaEval/ < evaluator > _leaderboard.csv
--all_model_outputs alpaca_eval_all_outputs.json
--annotators_config < evaluator_config >然後,請使用註釋器配置和排行榜CSV創建PR。
為了貢獻新的評估集,您首先需要指定一組文本說明。然後,您需要指定一組參考輸出(根據此參考計算模型贏率)。為了易於使用,您可以使用默認的Text-Davinci-003參考配置。
將它們放在JSON中,每個條目都指定字段instruction , output和generator 。您可以將alpaca_eval.json視為指南(不需要dataset集字段)。
最後,我們要求您在此新評估集中創建最小的排行榜。您可以使用以下操作:
alpaca_eval make_leaderboard
--leaderboard_path < src/alpaca_eval/leaderboards/data_AlpacaEval/your_leaderboard_name.csv >
--all_model_outputs alpaca_eval_all_outputs.json
--reference_outputs < path_to_json_file >請與評估集JSON和相應的排行榜CSV一起提交PR。
當前,我們允許不同的完成功能,例如, openai , anthropic , huggingface_local , huggingface_hub_api ...如果您想貢獻一個新的完成函數 /可以執行推斷的新完成功能 / API,請按照以下步驟操作:
<name>_completions(prompts : Sequence[str], model_name :str, ... ) 。此功能應作為參數提示 + kwargs並返回完成。請查看目錄中的其他完成功能的模板。例如,huggingface_local_completions或人類。<name>_completions和依賴項。再次,您可以按照huggingface_local_completions的示例alpaca_eval evaluate_from_model --model_configs '<model_configs>'評估模型隨意提早開始PR,我們將能夠在此過程中提供一些幫助!
與其他當前評估者一樣,羊角谷評估管道具有重要的局限性,因此不應用作重要環境中人類評估的替代,例如決定是否準備部署模型。這些可以大致分為三類:
指令可能無法代表現實意見:alpacaeval集合包含來自各種數據集(自我指導,開放式輔助,Vicuna,vicuna,koala,hh-rlHF)的示例,這些示例可能不代表現實通用和諸如GPT4之類的更好模型的現實使用和高級應用。這可能會使最佳的封閉模型(GPT4 / Claude / Chatgpt / ...)看起來比開放型模型更相似。實際上,這些封閉的模型似乎在更多樣化的數據上審計/填充。例如,有關更複雜的說明的初步結果,請參見此博客。但是,請注意,在Alpacafarm中,我們表明我們的評估集中的獲勝率高度相關(0.97 R2),並在用戶與羊駝演示的用戶互動中進行了贏率。此外,與其他排行榜(例如LMSYS)相比,山帕瓦爾排行榜在開放型和OpenAI模型之間顯示出更大的差距。
自動註釋者的偏見:原始自動註釋者似乎具有隱式偏見。特別是,我們發現它們傾向於更喜歡包含列表的更長輸出和輸出(例如,羊alpaca_eval_gpt4的0.68 / 0.69, claude的0.62 / 0.58)。儘管我們發現人類具有相似的偏見(0.64 / 0.61),但我們認為,這可能更多地是人類註釋管道的限制,我們使用的而不是真正的人類偏見。更普遍的是,通過定性分析,我們發現自動註釋者比其內容(例如,事實)更重要。最後,我們發現自動評估者傾向於喜歡與claude和alpaca_eval_gpt4排行榜上的ChatGPT/GPT4之間的很大差異所暗示的相似模型(可能是在相同數據上進行訓練)的輸出。請注意,在我們的長度控制的勝利率中,長度偏差部分緩解。
缺乏安全性評估:重要的是,Alpacaeval僅評估模型的指令跟踪功能,而不是它們可能造成的危害(例如有毒行為或偏見)。結果,當前Chatgpt和最佳開源模型之間的微小差距不應被解釋,就好像後者準備好部署一樣。
除了對評估管道的限制之外,我們對評估者的驗證以及我們建議的選擇評估集的方法還有局限性。
首先,我們基於人類交叉通道的評估者的驗證受到以下局限性:(1)我們定性地發現,我們的人群工人也傾向於偏愛諸如長度和列表的風格,而不是事實; (2)這沒有驗證針對參考模型的勝利率首先是良好的評估策略; (3)16名群眾的偏好並不代表所有人類的偏好。
其次,我們建議根據統計能力選擇評估集的建議方法受到以下局限性:(1)統計功率不能確保正確的方向,例如,您可以擁有一套不自然的指令,而羊駝“執行”比更好的模型更好; (2)這可以推動用戶選擇數據以支持他們想要驗證的假設。
長度控制的羊石瓦爾可視化:
長度控制的谷泊瓦爾發展:
該筆記本顯示了我們考慮的不同選項,以減輕自動註釋者的長度偏差。
在這裡,我們簡要總結了主要結果。即:


win_rate(m,b) = 1 - win_rate(b,m) in [0,1] and win_rate(m,m) = 0.5 . This is shown in the plot below. 
Finally, note that we are only controlling for length bias. There are other known biases that we are not controlling for, such as the fact that auto-annotators prefer outputs similar to their model. Although we could control for that, in practice we have found that to be less of an issue than length bias. For two reasons (1) this mostly a single model in the leaderboard because fine-tuning on outputs from the auto-annotator doesn't seem to have doesn't seem to impact the win-rate as much, and (2) the bias is actually less strong that what one could think. For example we show below a subset of the leaderboards auto-annotated by three different models, and we see that the ranking of models is exactly the same. In particular, claude-3-opus prefers gpt4_preview , and mistral-large prefers the former two.

Caution : all the following results are about AlpacaEval 1.0 and have not been updated since
Analyzing evaluators:
As we saw in the evaluator's leaderboard, there are many metrics to consider when selecting an evaluator, eg the quality, price, and speed. To assist with selection of the evaluator we provide a few functions to plot those metrics. The following shows for example the price/time/agreement of the different evaluators.

Here we see that alpaca_eval_gpt4 performs very well and is better than humans on all the considered metrics.
Previously we only considered the agreement with human annotators overall. An additional validation that one could do is checking whether making a leaderboard using our automatic annotator gives similar results as a leaderboard from humans. To enable such analysis, we release human annotations of outputs from 22 methods from AlpacaFarm => 22*805 = ~18K annotations. As a result we can test the correlation between the win-rates of the 22 models as evaluated by the humans and our automatic annotator. Note that this is arguably a better way of selecting an automatic evaluator than using "human agreement [%]" but is expensive given that it requires 18K annotations. The plot below shows such correlation for the alpaca_eval_gpt4 evaluator.

We see that the alpaca_eval_gpt4 leaderboard is highly correlated (0.94 Pearson correlation) to the leaderboard from humans, which further suggests that automatic evaluation is a good proxy for human evaluation. For the code and more analysis, see this notebook, or the colab notebook above.
Caution : all the following results are about AlpacaEval 1.0 and have not been updated since.
Making evaluation sets:
When creating an evaluation set there are two main factors to consider: how much data to use? and what data?
One way of answering those question is by considering a leaderboard of models that you believe are of different quality and checking what and how much data is needed to distinguish between them in a statistically significant way. We will do so below using a paired t-test to test if the difference in win-rates between every pair of models is statistically significant.
First, let us consider the question of how much data to use. Below we show the number of random samples needed from AlpacaEval for the paired t-test to give a p-value < 0.05 for each pair of models in the minimal alpaca_eval_gpt4 leaderboard. Grey cells correspond to pairs that are not significantly different on the 805 samples. y- and x-axis are ordered by the win-rate of the first and second model respectively.

We see that most models can already be distinguished with 50 samples, and that 150 samples allows distinguishing the majority of pairs (74 out of 78). This suggests that we can decrease the evaluation set size by a factor of 4 when testing two models that have similar performance gaps as those on the minimal alpaca_eval_gpt4 leaderboard.
The second question is what data to use. Again we can try to answer this question from a statistical power perspective: what data allows to best distinguish between models. Let's consider this for all the datasets that are part of AlpacaEval, but let us control for the size of the evaluation sets as we only care about the quality of the data. The following plot shows the p-values from the paired t-test of each pairs of models on 80 examples of each subset of AlpacaEval.

We see for example that the self-instruct dataset yields the least statistical power, which suggests that one could remove this dataset from the evaluation set. The exact reason should be analyzed in future work. For the code and more analysis see this notebook, or the colab notebook above.
Please consider citing the following depending on what you are using and referring to:
alpaca_eval (this repo). Specify whether you are using AlpacaEval or AlpacaEval 2.0. For length-controlled win-rates see below.alpaca_eval_length .dubois2023alpacafarm (AlpacaFarm)alpaca_eval and self-instruct, open-assistant, vicuna, koala, hh-rlhf.Here are the bibtex entries:
@misc{alpaca_eval,
author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},
year = {2023},
month = {5},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {url{https://github.com/tatsu-lab/alpaca_eval}}
}
@article{dubois2024length,
title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},
author={Dubois, Yann and Galambosi, Bal{'a}zs and Liang, Percy and Hashimoto, Tatsunori B},
journal={arXiv preprint arXiv:2404.04475},
year={2024}
}
@misc{dubois2023alpacafarm,
title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback},
author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},
year={2023},
eprint={2305.14387},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Length controlled (LC) win-rates are a debiased version of the win-rates that control for the length of the outputs.
The main idea is that for each model we will fit a logistic regression to predict the preference of the autoannotator given: (1) the instruction, (2) the model, and (3) the difference of length between the baseline and model output. Given such a logistic regression we can then try to predict the counterfactual "what would the preference be if the model's output had the same length as the baseline" by setting the length difference to 0. By averaging over this length-controlled preference, we then obtain the length-controlled win-rate. The exact form of the logistic regression is taken such that the interpretation of LC win rates is similar to the raw win rates, for example for any model m1 and m2 we have win_rate(m1, m2) = 1 - win_rate(m2, m1) in [0,100] and win_rate(m1, m1) = 0.5 . Length controlled win-rates increase the correlation between AlpacaEval's leaderboard and Chat Arena from 0.93 to 0.98 Spearman correlation, while significantly decreasing the length gameability of the annotator . For more information and results about length controlled win-rates see this notebook.
This idea of estimating the controlled direct effect, by predicting the outcome while conditioning on the mediator (the length difference), is common in statistical inference.
To get LC win rates on previously annotated models, you can use the following command:
pip install -U alpaca_eval
alpaca_eval --model_outputs … --is_recompute_metrics_only TrueAlpacaEval 2.0 is a new version of AlpacaEval. Here are the differences:
gpt4_turbo : we upgraded the baseline from text-davinci-003 to gpt4_turbo to make the benchmark more challenging and have a metric that better reflects the current state of the art.weighted_alpaca_eval_gpt4_turbo : we improved the annotator in quality and price. First, we use the gpt4_turbo model for annotating, which is approximately 2x cheaper than gpt4 . Second, we changed the prompt such that the model outputs a single token, which further reduced cost and speed. Finally, instead of using a binary preference, we used the logprobs to compute a continuous preference, which gives the final weighted win-rate. Note that the latter two changes had the surprising effect of decreasing the annotators' length biased. By default, AlpacaEval 2.0 will be used from pip install alpaca_eval==0.5 . If you wish to use the old configs by default, you can set IS_ALPACA_EVAL_2=False in your environment.
As part of AlpacaEval, we release the following data:
text-davinci-003 reference on the AlpacaFarm evaluation set. Annotations are from a pool of 16 crowd workers on Amazon Mechanical Turk. The different models are: 6 from OpenAI, 2 SFT models from AlpacaFarm, 13 RLHF methods from AlpacaFarm, and LLaMA 7B.For more details about the human annotations refer to the AlpacaFarm paper.
AlpacaEval is an improvement and simplification of the automatic pairwise preference simulator from AlpacaFarm. Outside AlpacaFarm, you should be using AlpacaEval. Here are the main differences:
{instruction}nn{input} . This affects 1/4 of the examples in the AlpacaFarm evaluation set (the self-instruct subset). This simplification provides a more fair comparison for models that were not trained by distinguishing between the two fields.text-davinci-003 ), so the results on AlpacaEval are not comparable to those on AlpacaFarm even for examples that had no input field.--anotators_config 'alpaca_farm' and --p_label_flip 0.25 when creating an evaluator. There have been several work that propose new automatic annotators for instruction-following models. Here we list the ones that we are aware of and discuss how they differ from ours. We evaluated all of those in our evaluator's leaderboard.
lmsys_gpt4 ) evaluates the pair by asking the annotator a score from 1-10 for each output, and then selecting the output with the highest score as preferred. They do not randomize over output order and they ask an explanation after the score. Overall, we found that this annotator has strong bias towards longer outputs (0.74) and relatively low correlation with human annotations (63.2).alpaca_farm_greedy_gpt4 ) evaluates the pair by directly asking the annotator which output it prefers. Furthermore, it batches 5 examples together to amortize the length of the prompt and randomizes the order of outputs. Overall, we found that this annotator has much less bias towards longer outputs (0.60) and is faster (878 seconds/1000 examples) than others. It has a slightly higher correlation with the majority of human annotations (66.4) than humans themselves (65.7). However, it is more expensive ($15.3/1000 examples) and doesn't work with very long outputs given the batching.aviary_gpt4 ) asks the annotator to order the output by its preference, rather than simply selecting the preferred output. It does not randomize the order of outputs and uses high temperature for decoding (0.9). Overall, we found that this annotator has relatively strong bias towards longer outputs (0.70) and very high correlation with human annotations (69.1). By decreasing the temperature and randomizing the order of outputs, we further improved the correlation to 69.8 ( improved_aviary_gpt4 ) but this further increased the length bias to 0.73. Our alpaca_eval_gpt4 is a mix between the AlpacaFarm and Aviary annotators. It asks the annotator to order the outputs by preference, but it uses temperature 0, randomizes over outputs, and made some modifications to the prompt to decrease length bias to 0.68.
Other related work include recent papers which analyze automatic evaluators.例如:
For all models you can find the auto-annotations under results/<model_name>/*/annotations.json . The annotations have the following columns:
instruction : the promptgenerator_1 : the baseline modeloutput_1 : the output of the baseline modelgenerator_2 : the model being evaluatedoutput_2 : the output of the model being evaluatedannotator : the auto-annotatorpreference : the result of the auto-annotator. This is a float between 1 and 2. Closer to 1 means that the auto-annotator prefers output_1 , closer to 2 means that it prefers output_2 . For AlpacaEval 2.0, preference-1 corresponds to the probability of output_1 being preferred. For AlpacaEval 1.0, preference is 1 if output_1 is preferred, 2 if output_2 is preferred, and 1.5 if they are the same. The win rate is always (preference -1).mean() .raw_completion : the raw output of the auto-annotator. This is field contains the completions before de-randomization of the order between output_1 and output_2 ! It is thus much harder to interpret, see below for more information.Chain of thought
For some annotators, eg alpaca_eval_cot_gpt4_turbo_fn we use chain of thought reasoning to make the models preferences more interpretable. Those can then be found under concise_explanation . To interpret them, you should also look at referenced_models which translates the temporary model name (in the prompt) to the actual output. Below, we provide more explanation as to what is happening behind the scenes.
You can check the raw_annotations["concise_explanation] column in annotations.json (eg here) which contains the chain of thought reasoning of the auto annotator. Note that the raw_annotations is not modified by the randomization of the order of the outputs. In particular, "m" and "M" can sometime refer to the first model (the reference) and sometime to the second model (the model being evaluated). To understand which model is being referred to, you should use the column preference and ordered_models . To make it easier we add a column "referenced_models" mapping the model names to the corresponding outputs. For example in the following annotation we see that the preference is 1.0 (ie output_1 ) and corresponds to model M in concise_explanation (see ordered_models ).
{
"instruction" : " How did US states get their names? " ,
"output_1" : " The names of U.S. states are derived from a variety of languages and sources, including indigenous American languages, European languages (primarily English, Spanish, and French), as well as other historical influences. Here is a brief overview of how some states got their names: nn 1. **Indigenous American Languages**: Many state names come from Native American words or names of tribes, often from the languages of the indigenous peoples who lived in or near those areas. n - **Alabama**: Possibly derived from the Choctaw language, meaning " thicket clearers. "n - **Connecticut**: From a Mohegan-Pequot word meaning " long tidal river. "n - **Massachusetts**: [...] " ,
"generator_1" : " gpt4_1106_preview " ,
"dataset" : " helpful_base " ,
"output_2" : " The names of the 50 U.S. states come from a variety of sources, including Native American languages, European languages, and historical figures. Here's a brief overview of how some states got their names: nn 1. Native American origins: Many states have names derived from Native American languages. For example, Alabama comes from the Choctaw word " Albah amo, " meaning " plant gatherers " or " herb gatherers. " Similarly, the name Mississippi comes from the Ojibwe word " Misi-ziibi, " meaning " great river. "nn 2. European languages: [...]. " ,
"generator_2" : " gpt4 " ,
"annotator" : " alpaca_eval_cot_gpt4_turbo_fn " ,
"preference" : 1.0 ,
"raw_completion" : {
"concise_explanation" : " Model M provided a more detailed and structured response, including bold headings for each category and a wider range of examples. It also included additional categories such as 'Other European Languages' and 'Combination of Languages and Influences', which added depth to the explanation. Model m's response was accurate but less comprehensive and lacked the clear structure found in Model M's output. " ,
"ordered_models" : [
{
"model" : " M " ,
"rank" : 1
},
{
"model" : " m " ,
"rank" : 2
}
]
},
"referenced_models" : {
"M" : " output_1 " ,
"m" : " output_2 "
}
}chatgpt_fn that anyone can use (no waiting lists).chatgpt_fn or alpaca_eval_gpt4_fn .