根據任務,評估託管的OpenAI GPT / Google Vertex AI Palm2 / Gemini或本地Ollama模型。
將任意任務分配給本地或託管語言模型。在模型中,Forge任務通過:代理,可選的後處理器和評估人員分解。任務具有最高級別的提示 - 實際工作。例如,您可以將以下內容用作任務提示:“在C中使用以下簽名:void* malloc void* malloc(size_t size)在C中實現一個簡單的示例”。接下來,您可以將後處理器請求包括到本地模型中,以僅從代理的響應中提取程序的源代碼。最後,您的評估者將被指示在理想的任務中充當任務的專家,其中包括基於COT(思想鏈)的示例。
.jsonpython src/main.py git clone https://github.com/Brandon7CC/MODELFORGE
cd MODELFORGE/
python -m venv forge-env
source forge-env/bin/activate
pip install -r requirements.txt
python src/main.py -h
echo " Done! Next, you can try FizzBuzz with Ollama locally!npython src/main.py task_configs/FizzBuzz.yaml FizzBuzz是一個經典的“您可以編碼”問題。這很簡單,但可以提供有關開發人員通過問題如何思考的洞察力。例如,在Python中,使用控制流,Lambdas等。這是問題陳述:
編寫一個程序以顯示從1到n的數字。對於三個的倍數,請打印“ Fizz”而不是數字,對於五個的倍數,打印“嗡嗡聲”。對於三個和五個倍數的數字,請打印“ FizzBuzz”。
接下來,我們將製作我們的任務配置文件(這已經在task_configs/FizzBuzz.yaml中為您完成),但是我們會引導您瀏覽它。為此,我們將定義一個名為“ FizzBuzz”的頂級任務,請及時及時及時使用我們希望他模型解決問題的次數。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
Write a program to display numbers from 1 to n. For multiples of three, print "Fizz"
instead of the number, and for the multiples of five, print "Buzz". For numbers which
are multiples of both three and five, print "FizzBuzz".
Let's think step by step.現在,我們將定義我們的“代理” - 該模型將充當完成我們任務的專家。模型可以是任何受支持的託管 /本地Ollama型號(例如Google的雙子座,OpenAI的GPT-4或Mismtral AI的Mixtral8x7b通過Ollama)。
tasks :
- name : FizzBuzz
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : mixtral:8x7b-instruct-v0.1-q4_1
temperature : 0.98
system_prompt : |
You're an expert Python developer. Follow these requirement **exactly**:
- The code you produce is at the principal level;
- You follow modern object oriented programming patterns;
- You list your requirements and design a simple test before implementing.
Review the user's request and follow these requirements.您可以選擇創建一個“後處理器” 。我們只希望對代理完成代碼進行評估,因此在這裡我們將使後處理器模型從代理的響應中提取源代碼。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : gpt-4-1106-preview
temperature : 0.98
system_prompt : |
...
postprocessor :
base_model : mistral
temperature : 0.1
system_prompt : |
You have one job: return the source code provided in the user's message.
**ONLY** return the exact source code. Your response is not read by a human.最後,您需要一個“評估者”模型,該模型將充當審查代理/後處理器的輸出的專家。評估者的工作是返回true / false。此外,我們最多可以失敗10次- 重新詢問代理商。這是一些魔術的來臨- 我們將簡要介紹失敗的嘗試 - 在對代理商的下一個查詢中的批評。這使代理可以以更有效的方式迭代自己。在這裡,我們將希望我們的評估員審查FizzBuzz的實現。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : codellama
temperature : 0.98
system_prompt : |
...
postprocessor :
base_model : gemini-pro
temperature : 0.1
system_prompt : |
...
# Evaluators have defined system prompts to only return true / false for their domain.
evaluator :
base_model : gpt-4-1106-preview
temperature : 0.1
system_prompt : |
Assess if a given sample program correctly implements Fizz Buzz.
The program should display numbers from 1 to n. For multiples of three, it should
print "Fizz" instead of the number, for the multiples of five, it should print "Buzz",
and for numbers which are multiples of both three and five, it should print "FizzBuzz".
Guidelines for Evaluation
- Correctness: Verify that the program outputs "Fizz" for multiples of 3, "Buzz" for
multiples of 5, and "FizzBuzz" for numbers that are multiples of both 3 and 5. For
all other numbers, it should output the number itself.
- Range Handling: Check if the program correctly handles the range from 1 to n, where
n is the upper limit provided as input.
- Error Handling: Assess if the program includes basic error handling, such as ensuring
the input is a positive integer.這項工作的靈感來自Google DeepMind的FunSearch方法,可以找到解決上限設置問題的新穎解決方案。在宏觀層面上,這是通過開發基於COT(思想鏈)的示例來完成的,並反复促使Palm2生成大量程序,然後在幾個級別上評估這些程序。