unit minions下載 - unit minions源代碼下載

《AI 研發提效研究：自己動手訓練LoRA》

PS：代碼補全、文檔生成相關的微調見：https://github.com/unit-mesh/unit-eval

聲明：本項目提供的數據集、LoRA 二進制，皆為OpenAI 生成或網上公開項目。我們僅提供了模型訓練相關教程，使用者實際訓練的內容所造成的一切後果由使用者本人負責。

對於工程師而言，我們可以顯而易見的看到ChatGPT 等大語言模型帶來的影響，藉此我們展開了AI 對於研發效能提升的研究—— 訓練了幾個LLaMA LoRA、ChatGLM LoRA 用來研究研發效能提升的方法。

這個項目是我們的研究成果，包括了一些視頻介紹、訓練好的模型、訓練代碼、訓練數據、訓練過程中的一些記錄。

訓練完的LoRA 見Release。

訓練Notebook：

LLaMA Alpaca LoRA
ChatGLM Tuning LoRA

LLaMA 系列在線視頻：

《代碼輔助生成》
《測試代碼生成》
《詳細需求生成》
《文本轉SQL》

ChatGLM 系列在線視頻：

《LoRA 大比拼：ChatGLM vs LLaMA，誰更會寫需求文檔？》

《AI 研發提效研究：自己動手訓練LoRA》
1. Introduction
  1. Roadmap
  2. Sponsors
總結設計：流程標準化
1. 研發效能
2. Unit Mesh
數據準備
1. 折分任務+ 用戶故事生成
  1. 步驟1. 生成用戶任務
  2. 步驟2. 分解用戶任務為用戶故事
2. 代碼輔助生成
  1. 步驟1. 準備數據
  2. 步驟2. 生成指令
  3. 類信息格式
  4. 其它：核心代碼邏輯
3. 測試代碼生成
  1. 步驟1. 生成測試代碼
  2. 步驟2. 借助OpenAI Davinci 編寫實現代碼（可選）
4. 文本轉代碼
5. 文本生成repository
  1. 數據準備
  2. 輸出示例：
6. 領域知識
訓練與結果
1. 基於Meta 的Llama 訓練LoRA
  1. 訓練1：測試代碼生成
  2. 訓練2：拆分用戶故事
  3. 訓練3：代碼輔助
  4. SQL 轉代碼
2. 基於清華大學的ChatGLM 訓練LoRA
  1. 代碼生成
  2. 測試生成
  3. 用戶故事生成

Introduction

相關數據轉換參見：https://github.com/unit-mesh/minions-data-prepare

PS：如果你需要的是更好的代碼生成，建議採用：https://huggingface.co/Salesforce/codegen-16B-mono

雲GPU

我們使用的是OpenBayes 提供的雲GPU：https://openbayes.com/console/signup?r=phodal_uVxU

OpenBayes 模型可以使用：

llamba-7b-hf：https://openbayes.com/console/open-tutorials/models/LHney50G1TB/1/overview
chatglm-6b: https://openbayes.com/console/open-tutorials/models/D24PPO2ItU4/1/overview

Roadmap

Roadmap：

訓練：領域知識（Done）
訓練：測試代碼生成（Done）
訓練：生成用戶故事（Done）
訓練：代碼輔助生成（Done）
訓練：SQL 轉換（Done）
訓練：文本轉代碼（Done）
訓練：……
訓練：生成Unit Mesh 的代碼塊

總結設計：流程標準化

AI 感性提效依賴於對研發效能的標準化，並儘可能細地拆分每一個步驟。

研發效能

為了訓練的結果更加準確，我們詳細拆分了軟件開發的步驟，以確保每一步生成的是準確，進而驅動出準確的結果。如下是我們早期拆分的一小部分細流程的示例：

split_user_story_tasks
create_agile_user_story
design_restful_api
design_plantuml_java_datastructure
implementation_mock_mvc_test
implementation_spring_controller
implementation_controller_test
implementation_spring_service
….

我們需要拆分到每一個盡可能小的步驟，在每一個細化的步驟裡，餵入數據，才會讓AI 產生最大的複讀機效果。

Unit Mesh

Todos

數據準備

我們使用非常簡單的instruct，並儘可能提供，以便於集成到工具中使用。如下：

領域知識。 instruction：領域知識。
拆分任務。 instruction：split user story tasks，input：折分用戶故事任務
需求細化。 instruction：create Agile user story for following topic，input：功能的基本信息
代碼生成。 instruction：Implement the method xxx，input：類的基本信息
測試生成。 instruction：Write test for follow code，input：類的基本信息
SQL 生成。 instruction：text to sql，input：問題
文本轉Java 代碼。 instruction：text to java code，input：問題

對應的功能介紹：

需求細化。 AI 輔助將模糊的需求轉變為的需求設計，比如“註冊” 功能，生成為：”作為一個用戶xxx，填入用戶名、密碼信息等等，接著由人類去檢查和完善。
代碼生成。 AI 輔助將詳細的需求設計翻譯為目標的代碼，再接著由人類去檢查和完善。
測試生成。 AI 輔助根據生成的代碼生成對應的測試代碼，再接著由人類去檢查和完善。

從測試結果來看，隨著數據量的增多，比如20000 個代碼用例比10000 個代碼用例更加的“聰明”。

折分任務+ 用戶故事生成

基本思路：

結合常見的領域（如在線網站），借用OpenAI 生成用戶任務（如登錄、瀏覽列表等等）
根據用戶故事生成用戶故事。
訓練Alpaca LoRA。

如下所示：

步驟1. 生成用戶任務

調用OpenAI 按分類創建用戶任務。 prompt 如下：

Design a User Story Mapping for ${domain} application based on your understanding. Here are the requirements:

1 . Your user story map should include only user tasks to demonstrate how users will interact with the application.
2 . Our user story map should be based on your understanding of the ${domain} application and its users, and should be
   designed to address their needs and pain points.
3 . You may use any tools or formats you choose to create your user story map, but it should be easily shareable and
   understandable by stakeholders.
4 . Your expression should be more concise and clear.
5 . Your return should be like as follows:

###

User Tasks:

1 . ...

###

示例輸出：

 User Tasks:
1. Browse and search for animations and comics
2. View details of animations and comics
3. Create an account
4. Log in to the account
5. Add animations and comics to favorites
6. Download animations and comics
7. Share animations and comics with friends
8. Rate and review animations and comics
9. Create and upload animations and comics
10. Participate in online forums and discussions

步驟2. 分解用戶任務為用戶故事

調用OpenAI 根據用戶任務創建用戶故事。 prompt 如下：

为下面的需求编写用户故事：${domain} 应用的 ${story_name} 功能。 要求如下：

1 . 必须要考虑尽可能考虑各种异常场景，添加更多的 AC。
2 . 你的返回模板如下所示：

###

用户故事：可以选择宝贝出行服务
作为 莉莉妈
我想 在滴滴打车的手机客户端里选择宝贝出行服务
以便于 我能够带宝宝打车出行的时候打到有儿童座椅的车

AC 1:  莉莉妈可以选择宝贝出行服务
假设 xxx
当 xxx
于是 xxx

###

示例輸出：

用户故事：可以创建和上传动画和漫画
作为一个 Animation and Comics 应用的用户
我想要创建和上传动画和漫画
以便于我可以分享我的作品给其他用户

AC 1: 用户可以创建和上传动画和漫画
假设 用户已经登录到 Animation and Comics 应用
当 用户点击创建和上传动画和漫画按钮
于是 用户可以创建和上传动画和漫画

代碼輔助生成

基本思路：

Copilot Element

步驟1. 準備數據

下載GitHub 上的項目（需要包含測試用例）
遍歷src/main下的Java 文件。
生成每個類的基本信息：

 PostService(PostRepository, UserRepository, ImageService)
- fields: postRepository:PostRepository, userRepository:UserRepository, userPosts:Set<Post>, imageService:ImageService
- methods: findAll(): List<Post>, addNewPost(Post): Post, saveImageToPost(String, MultipartFile, Post): int

步驟2. 生成指令

預期JSON 格式：

instruction: Implement the method ${methodName}
input: ${classInformation}
output: ${code}

示例：

 {"instruction":"Implement the method action","input":"com.thoughtworks.go.config.rules.AbstractDirective(DirectiveType, DirectiveType, String, String, String)n- fields: action:String, type:String, resource:String, configErrors:ConfigErrors, directiveType:DirectiveTypen- methods: validate(ValidationContext): void, isInvalid(String, List<String>): boolean, matchesAction(String): boolean, matchesType(Class<? extends Validatable>): boolean, matchesResource(String): boolean, errors(): ConfigErrors, addError(String, String): void, equals(Object): boolean, action(): String, type(): String, resource(): String, hashCode(): int, hasErrors(): boolean, getDirectiveType(): DirectiveType","output":"public abstract class AbstractDirective implements Directive {nn    @ConfigAttribute(value = "action", optional = false)n    protected String action;nn    @ConfigAttribute(value = "type", optional = false)n    protected String type;nn    @ConfigValuen    private String resource;nn    private final ConfigErrors configErrors = new ConfigErrors();nn    private DirectiveType directiveType;nn    public AbstractDirective(DirectiveType allow) {n        this.directiveType = allow;n    }nn    public AbstractDirective(DirectiveType allow, String action, String type, String resource) {n        this.directiveType = allow;n        this.action = action;n        this.type = type;n        this.resource = resource;n    }nn    @Overriden    public String action() {n        return this.action;n    }n}n"}

類信息格式

格式規範：

包名.类名(类的构造类型)
- fields：成员变量集合（变量名:类型）
- methods：方法抽象集合（方法名(参数类型): 输出类型）

結果：

 com.thoughtworks.go.config.rules.AbstractDirective(DirectiveType, DirectiveType, String, String, String)
- fields: action:String, type:String, resource:String, configErrors:ConfigErrors, directiveType:DirectiveType
- methods: validate(ValidationContext): void, isInvalid(String, List<String>): boolean, matchesAction(String): boolean, matchesType(Class<? extends Validatable>): boolean, matchesResource(String): boolean, errors(): ConfigErrors, addError(String, String): void, equals(Object): boolean, action(): String, type(): String, resource(): String, hashCode(): int, hasErrors(): boolean, getDirectiveType(): DirectiveType

其它：核心代碼邏輯

 val javaProcessor = JavaProcessor (file.readText())
val shotClass = javaProcessor.toShortClass() ? : return @forEach

javaProcessor
   .removePackage()
   .removeAllImport()
   .removeLicenseInfoBeforeImport()

javaProcessor.splitMethods().forEach { (key, value) ->
   CodegenPrompt (
       instruction = " Implement the method $key " ,
       input = shotClass.toString(),
       output = value
   ). let { prompt ->
       val output = Json .encodeToString(prompt)
       File ( " $targetPath${key} .json " ).writeText(output)
   }
}

測試代碼生成

基本思路

語法分析思路：

方式1 - 在時間有限的情況下，基於OpenAI 的數據來完善。但是，OpenAI 編寫的測試用例不一定靠譜，所以讓他生成業務代碼。
方式2 - 在時間充裕的情況下，可以分析AST 來合併第一和第二步，也是比較合理的方案，畢竟OpenAI 的API 很貴。

步驟1. 生成測試代碼

下載GitHub 上的項目（需要包含測試用例）
建立每個項目的src/main下的Java 文件map，如果同時存在對應的測試文件，則拉入的數據集中。
並生成每個測試對應的類的基本信息（以減少OpenAI Token 使用）：

 org.unitmesh.processor.TestClass(String, Int)
- fields: field1:String, field2:Int
- methods: method1(String, Int): String, method2(): Int

按測試用例（即@Test 方法）拆分每個測試文件，拆成N 個（即test1、test2 是兩個不同的數據）

 class TestProcessorTest {
    @ Test
    void test1 () {
    }
    
    @ Test
    void test2 () {
    }
}

最後，生成的數據如下：

{"classInfo": "com.thoughtworks.go.security.AESEncrypter(AESCipherProvider)n- fields: ENCODER:Base64.Encoder, DECODER: Base64.Decoder, cipherProvider:AESCipherProvider, ivProvider:IVProvidern- methods: createIVProviderInstance(): IVProvider, canDecrypt(String): boolean, encrypt(String): String, decrypt(String): String, createSecretKeySpec(): SecretKeySpec", "testMethod": "public class AESEncrypterTest {nn private AESEncrypter aesEncrypter;nn @Testn public void shouldGenerateEncryptedText() throws CryptoException {n String encrypt = aesEncrypter.encrypt("p@ssw0rd");n assertThat(encrypt).startsWith("AES");n assertThat(encrypt.split(":")).hasSize(3);n }n}n", "id": "task_0"}

步驟2. 借助OpenAI Davinci 編寫實現代碼（可選）

詳細代碼見：test-to-code.py

將上面的數據轉換為JSONL，合併成prompt。
讓Davinci 完成填空題。

最後生成的prompt 示例如下：

You are a programmer and implementation a method with TDD. Here are the requirements:

1 . According follows class information and tests code to write a method.
2 . Try you best to thinking corner case.
3 . You only return the code, no explain.

class information:

###  

io.github.robwin.swagger.test.AbstractContractValidator()

- methods: findExpectedPaths(Swagger, SwaggerAssertionConfig): Map<String,Path>, getPathsIncludingBasePath(Swagger):
  Map<String,Path>, getPathsWithPrefix(Swagger, String): Map<String,Path>, isBlankOrSlash(String): boolean

###

test code:

###  

/ **

* Tests AbstractContractValidator.
  * /
  @ RunWith (Enclosed.class)
  public class AbstractContractValidatorTest {

  / **
    * Tests getPathsIncludingBasePath().
      * /
      public static class GetPathsIncludingBasePath {

      @ Test
      public void shouldReturnPathsPrefixedIfBasePathSet() {
      // given
      Swagger swagger = buildSwaggerFrom("/swagger.json");
      // when
      Map<String, Path> paths = new DummyValidator().getPathsIncludingBasePath(swagger);
      // then
      paths.entrySet().forEach(e -> assertThat(e.getKey(), startsWith(swagger.getBasePath())));
      }
      }

  / **
    * Tests findExpectedPaths().
      * /
      public static class FindExpectedPaths {
      }

  private static class DummyValidator extends AbstractContractValidator {
  }
  }

###

文本轉代碼

使用的是已有的Datasets，包括：

text-to-sql - 用於將自然語言轉換為SQL 語句的數據集
text-to-code - 用於將自然語言轉換為代碼的數據集

不過，這兩個代碼集質量都不高，但是基本可用。

文本生成repository

數據準備

解析Kotlin 項目代碼，提取出所有的類和方法。
建立Repository 方法與類型的對應關係。
生成Repository 方法的基本信息。
調用OpenAI 生成數據

格式如下：

 instruction:
我想查找特定月份（monthly_id）下在某个日期（date）之前的所有费用（expense），以方便了解特定月份内的开销情况。
input:
data class ExpenseEntity(....)

output:
@Query("SELECT * FROM expense WHERE monthly_id = :recurringExpenseId AND date < :beforeDate")
suspend fun getAllExpensesForRecurringExpenseBeforeDate(recurringExpenseId: Long, beforeDate: LocalDate): List<ExpenseEntity>

輸出示例：

 evaluate("text to kotlin repository with class", "我想查询指定年龄的用户（User）的博客数量。n ###data class User(var age: Int, 
val blogId: Int) data class Post(val title: String)###", 0.1, 0.75, 40, 4, 512)

@Query("SELECT COUNT(*) FROM User WHERE age = :age")
abstract fun getBlogCount(age: Int): Long

領域知識

訓練1：PDF

基本思路：

將PDF 文件轉換為文本
將文本按標題的方式拆分成instruction和output兩部分， input為null。

示例：

 instruction: 介绍一下财通财通宝的基金管理人、基金托管人在履行各自职责的过程中，违反《基金法》?

（一）基金管理人、基金托管人在履行各自职责的过程中，违反《基金法》等法律法规的规定或者基金合同约定， 给基金财产或者基金份额持有人造成损害的，
应当分别对各自的行为依法承担赔偿责任；因共同行为给基金财产或者基金份额持有人造成损害的，应当承担连带赔偿责任，对 损失的赔偿，仅限于直接损失。
但是发生下列情况，当事人可以免责：  1.基金管理人及基金托管人按照中国证监会的规定或当时有效的法律法规的作为或不作为而造成的损失等； 
 2.基金管理人由于按照基金合同规定的投资原则而行使或不行使其投资权而造成的损失等；  3.不可抗力。

訓練與結果

基於Meta 的Llama 訓練LoRA

訓練：

方式1：直接使用：alpaca-lora.ipynb
方式2：直接使用：OpenBayes 容器（PS：使用我的專用邀請鏈接，註冊OpenBayes，雙方各獲得60 分鐘RTX 3090 使用時長，支持累積，永久有效： https://openbayes.com/console/signup?r=phodal_uVxU) ）

訓練時間：

3400+ 用戶故事，約25 分鐘
8000+ 測試代碼，約1 小時
10000+ 代碼輔助生成，約1.5 小時
20000+ 代碼輔助生成，約3 小時
40000+ 代碼輔助生成，約6 小時

相關背景：

基礎模型：Meta 開源LLaMA 系列模型：llama-7b-hf
Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
Alpaca Lora：https://github.com/tloen/alpaca-lora

由於，我們的目標不是對模型調優、生成通用的模型，而是訓練特定用途的Lora。因此，我們沒有加入：Stanford Alpaca 中的數據，故所需要時間更短，如8000+ 測試代碼只需要1 小時，3400+ 的用戶故事只需要25 分鐘。

訓練1：測試代碼生成

訓練結果（LoRA）見：https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.1

 print(evaluate("write test", """
public Set<String> listFilesUsingJavaIO(String dir) {
    return Stream.of(new File(dir).listFiles())
      .filter(file -> !file.isDirectory())
      .map(File::getName)
      .collect(Collectors.toSet());
}
""", 0.1, 0.75, 40, 4, 512))

## Output:

public class ListFilesUsingJavaIOTest {
    @Test
    public void shouldListFilesUsingJavaIO() {
        Set<String> files = ListFilesUsingJavaIO.listFilesUsingJavaIO(new File("dir"));
        assertThat(files, containsInAnyOrder("file1", "file2", "file3"));
    }
}

訓練2：拆分用戶故事

訓練結果（LoRA）見：https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.2

注意：

如果數據集太小，設置一下val_set_size，比如： --val_set_size 500
如果數據集太小，設置一下val_set_size，比如： --val_set_size 500
如果數據集太小，設置一下val_set_size，比如： --val_set_size 500

示例輸出：

 print(evaluate("create Agile user story for following topic: ", "购买电影票", 0.1, 0.75, 40, 4, 512))

用户故事：可以购买电影票
作为一个电影票购买者
我想在购买电影票应用中购买电影票
以便于我可以购买电影票
AC 1: 电影票购买者可以在购买电影票应用中购买电影票
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“购买电影票”按钮
于是 电影票购买者可以在购买电影票应用中购买电影票
AC 2: 电影票购买者可以在购买电影票应用中查看购买记录
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“查看购买记录”按钮
于是 电影票购买者可以在购买电影票应用中查看购买记录
AC 3: 电影票购买者可以在购买电影票应用中查看购买记录
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“查看购买记录”按��

訓練3：代碼輔助

訓練結果（LoRA）見：https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.3

示例輸出：

 print(evaluate("Implement the method rmFiles", "FileUtils:", 0.1, 0.75, 40, 4, 512))

public class FileUtils {

    private FileUtils() {
    }

    public static void rmFiles(String path) {
        File dir = new File(path);
        File[] files = dir.listFiles();
        if (files != null) {
            for (File file : files) {
                if (file.isFile()) {
                    file.delete();
                } else if (file.isDirectory()) {
                    FileUtils.rmFiles(file.getAbsolutePath());
                }
            }
        }
    }
}

SQL 轉代碼

轉換腳本：code/text-to-sql.py

直接從：datasets/sql/sql-train.jsonl

示例輸出：

 evaluate("text to sql", "谁是最美丽的人", 0.1, 0.75, 40, 4, 512)
evaluate("text to sql", "小明今年几岁", 0.1, 0.75, 40, 4, 512)
evaluate("text to sql", "What hand guard system is used with a gas piston commando?", 0.1, 0.75, 40, 4, 512)

基於清華大學的ChatGLM 訓練LoRA

訓練：

直接使用：chatglm-tuning.ipynb

代碼生成

測試生成

用戶故事生成

示例1：

 evaluate("Instruction: 创建用户故事nInput: 电影网站: 查看订单详情 nAnswer:")

Instruction: 创建用户故事
Input: 电影网站: 查看订单详情 
Answer:

用户故事:可以查看订单详情
作为一个电影网站的用户
我想查看订单详情
以便于我可以了解订单的详细信息,如订单号、订单状态、订单内容等。

AC 1: 用户可以查看订单详情
假设 用户已经登录
当 用户点击查看订单详情按钮
于是 用户可以查看订单详情,包括订单号、订单状态、订单内容等。

AC 2: 用户可以查看订单详情
假设 用户没有登录
当 用户点击查看订单详情按钮
于是 用户被要求先登录,然后才能查看订单详情,包括订单号、订单状态、订单内容等。

示例2：

 evaluate("Instruction: 创建用户故事nInput: 团购网站:用户注册nAnswer:")

Instruction: 创建用户故事
Input: 团购网站:用户注册
Answer:
团购网站的用户注册故事:

作为一个团购网站的用户

我想注册一个团购网站的用户账号

以便于我可以使用团购服务,如购买商品、服务等。

AC 1: 团购网站的用户注册账号
假设 用户输入了注册信息
当 用户提交注册信息
于是 用户可以成功注册一个团购网站的用户账号

AC 2: 团购网站的用户注册账号
假设 用户输入了错误的注册信息
当 用户提交注册信息
于是 用户可以被提示输入正确的注册信息,并重新提交注册信息

展開