根据任务,评估托管的OpenAI GPT / Google Vertex AI Palm2 / Gemini或本地Ollama模型。
将任意任务分配给本地或托管语言模型。在模型中,Forge任务通过:代理,可选的后处理器和评估人员分解。任务具有最高级别的提示 - 实际工作。例如,您可以将以下内容用作任务提示:“在C中使用以下签名:void* malloc void* malloc(size_t size)在C中实现一个简单的示例”。接下来,您可以将后处理器请求包括到本地模型中,以仅从代理的响应中提取程序的源代码。最后,您的评估者将被指示在理想的任务中充当任务的专家,其中包括基于COT(思想链)的示例。
.jsonpython src/main.py git clone https://github.com/Brandon7CC/MODELFORGE
cd MODELFORGE/
python -m venv forge-env
source forge-env/bin/activate
pip install -r requirements.txt
python src/main.py -h
echo " Done! Next, you can try FizzBuzz with Ollama locally!npython src/main.py task_configs/FizzBuzz.yaml FizzBuzz是一个经典的“您可以编码”问题。这很简单,但可以提供有关开发人员通过问题如何思考的洞察力。例如,在Python中,使用控制流,Lambdas等。这是问题陈述:
编写一个程序以显示从1到n的数字。对于三个的倍数,请打印“ Fizz”而不是数字,对于五个的倍数,打印“嗡嗡声”。对于三个和五个倍数的数字,请打印“ FizzBuzz”。
接下来,我们将制作我们的任务配置文件(这已经在task_configs/FizzBuzz.yaml中为您完成),但是我们会引导您浏览它。为此,我们将定义一个名为“ FizzBuzz”的顶级任务,请及时及时及时使用我们希望他模型解决问题的次数。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
Write a program to display numbers from 1 to n. For multiples of three, print "Fizz"
instead of the number, and for the multiples of five, print "Buzz". For numbers which
are multiples of both three and five, print "FizzBuzz".
Let's think step by step.现在,我们将定义我们的“代理” - 该模型将充当完成我们任务的专家。模型可以是任何受支持的托管 /本地Ollama型号(例如Google的双子座,OpenAI的GPT-4或Mismtral AI的Mixtral8x7b通过Ollama)。
tasks :
- name : FizzBuzz
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : mixtral:8x7b-instruct-v0.1-q4_1
temperature : 0.98
system_prompt : |
You're an expert Python developer. Follow these requirement **exactly**:
- The code you produce is at the principal level;
- You follow modern object oriented programming patterns;
- You list your requirements and design a simple test before implementing.
Review the user's request and follow these requirements.您可以选择创建一个“后处理器” 。我们只希望对代理完成代码进行评估,因此在这里我们将使后处理器模型从代理的响应中提取源代码。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : gpt-4-1106-preview
temperature : 0.98
system_prompt : |
...
postprocessor :
base_model : mistral
temperature : 0.1
system_prompt : |
You have one job: return the source code provided in the user's message.
**ONLY** return the exact source code. Your response is not read by a human.最后,您需要一个“评估者”模型,该模型将充当审查代理/后处理器的输出的专家。评估者的工作是返回true / false。此外,我们最多可以失败10次- 重新询问代理商。这是一些魔术的来临- 我们将简要介绍失败的尝试 - 在对代理商的下一个查询中的批评。这使代理可以以更有效的方式迭代自己。在这里,我们将希望我们的评估员审查FizzBuzz的实现。
tasks :
- name : FizzBuzz
# If a run count is not provided then the task will only run until evaluator success.
run_count : 5
prompt : |
...
agent :
# We'll generate a custom model for each base model
base_model : codellama
temperature : 0.98
system_prompt : |
...
postprocessor :
base_model : gemini-pro
temperature : 0.1
system_prompt : |
...
# Evaluators have defined system prompts to only return true / false for their domain.
evaluator :
base_model : gpt-4-1106-preview
temperature : 0.1
system_prompt : |
Assess if a given sample program correctly implements Fizz Buzz.
The program should display numbers from 1 to n. For multiples of three, it should
print "Fizz" instead of the number, for the multiples of five, it should print "Buzz",
and for numbers which are multiples of both three and five, it should print "FizzBuzz".
Guidelines for Evaluation
- Correctness: Verify that the program outputs "Fizz" for multiples of 3, "Buzz" for
multiples of 5, and "FizzBuzz" for numbers that are multiples of both 3 and 5. For
all other numbers, it should output the number itself.
- Range Handling: Check if the program correctly handles the range from 1 to n, where
n is the upper limit provided as input.
- Error Handling: Assess if the program includes basic error handling, such as ensuring
the input is a positive integer.这项工作的灵感来自Google DeepMind的FunSearch方法,可以找到解决上限设置问题的新颖解决方案。在宏观层面上,这是通过开发基于COT(思想链)的示例来完成的,并反复促使Palm2生成大量程序,然后在几个级别上评估这些程序。