unit minions Download - unit minions Source code download

"AI R&D Efficiency Improvement Research: Do-it-yourself training LoRA"

PS: See https://github.com/unit-mesh/unit-eval for fine-tuning related to code completion and document generation:

Statement: The data sets and LoRA binary provided by this project are all created by OpenAI or published online. We only provide tutorials related to model training, and all consequences caused by the actual training content of the user is the responsibility of the user himself.

For engineers, we can clearly see the impact of large language models such as ChatGPT, and thus we have carried out research on the improvement of R&D efficiency of AI - trained several methods used to study R&D efficiency of LLaMA LoRA and ChatGLM LoRA.

This project is our research result, including some video introductions, trained models, training code, training data, and some records during the training process.

See Release for training LoRA.

Training Notebook:

LLaMA Alpaca LoRA
ChatGLM Tuning LoRA

LLaMA Series Online Video:

"Code-assisted Generation"
Test Code Generation
"Detailed Requirements Generation"
"Text to SQL"

ChatGLM Series Online Video:

"LoRA Competition: ChatGLM vs LLaMA, who is better able to write requirements documents? 》

Table of contents:

"AI R&D Efficiency Improvement Research: Do-it-yourself training LoRA"
1. Introduction
  1. Roadmap
  2. Sponsors
Summary design: process standardization
1. R&D efficiency
2. Unit Mesh
Data preparation
1. Disclaimer task + user story generation
  1. Step 1. Generate user tasks
  2. Step 2. Decompose user tasks into user stories
2. Code-assisted generation
  1. Step 1. Prepare the data
  2. Step 2. Generate instructions
  3. Class information format
  4. Others: Core code logic
3. Test code generation
  1. Step 1. Generate test code
  2. Step 2. Write implementation code with OpenAI Davinci (optional)
4. Text to code
5. Text generation repository
  1. Data preparation
  2. Output example:
6. Domain Knowledge
Training and results
1. Meta-based Llama training LoRA
  1. Training 1: Test code generation
  2. Training 2: Split user stories
  3. Training 3: Code Assist
  4. SQL to code
2. ChatGLM training LoRA based on Tsinghua University
  1. Code generation
  2. Test generation
  3. User Story Generation

Introduction

Related data conversion is available at: https://github.com/unit-mesh/minions-data-prepare

PS: If you need better code generation, it is recommended to use: https://huggingface.co/Salesforce/codegen-16B-mono

Cloud GPU

We are using the cloud GPU provided by OpenBayes: https://openbayes.com/console/signup?r=phodal_uVxU

The OpenBayes model can be used:

llamba-7b-hf: https://openbayes.com/console/open-tutorials/models/LHney50G1TB/1/overview
chatglm-6b: https://openbayes.com/console/open-tutorials/models/D24PPO2ItU4/1/overview

Roadmap

Roadmap:

Training: Domain Knowledge (Done)
Training: Test code generation (Done)
Training: Generate user stories (Done)
Training: Code-assisted generation (Done)
Training: SQL Transformation (Done)
Training: Text to code (Done)
train:……
Training: Generate code blocks of Unit Mesh

Summary design: process standardization

AI’s perceptual efficiency improvement depends on the standardization of R&D efficiency and splitting each step as carefully as possible.

R&D efficiency

In order to make the training results more accurate, we split the steps of software development in detail to ensure that each step produces accurate results, thereby driving accurate results. Here is an example of a small part of the detailed process we had in our early split:

split_user_story_tasks
create_agile_user_story
design_restful_api
design_plantuml_java_datastructure
implementation_mock_mvc_test
implementation_spring_controller
implementation_controller_test
implementation_spring_service
….

We need to split it into each step as small as possible, feeding data in each refined step to make the AI produce the greatest repeater effect.

Unit Mesh

Todos

Data preparation

We use very simple instructs and provide them as much as possible for easy integration into the tool. as follows:

Domain knowledge. instruction: Domain knowledge.
Split the task. instruction: split user story tasks, input: split user story tasks
Demand refinement. instruction: create Agile user story for following topic, input: basic information about the function
Code generation. Instruction:Implement the method xxx, input: Basic information of the class
Test generation. instruction: Write test for follow code, input: basic information of the class
SQL generation. Instruction: text to sql, input: Problem
Text to Java code. Instruction: text to java code, input: Problem

Corresponding functions introduction:

Demand refinement. AI assists in transforming vague requirements into demand design, such as the "registration" function, and generates: "As a user xxx, fill in username, password information, etc., and then humans check and improve it.
Code generation. AI assists the detailed requirements design and translate the code into target, and then humans check and improve it.
Test generation. AI assists in generating corresponding test code based on the generated code, and then humans check and improve it.

Judging from the test results, as the amount of data increases, for example, 20,000 code use cases are more "smart" than 10,000 code use cases.

Disclaimer task + user story generation

Basic ideas:

Combining common areas (such as online websites), borrow OpenAI to generate user tasks (such as login, browsing lists, etc.)
Generate user stories based on user stories.
Training Alpaca LoRA.

As shown below:

Step 1. Generate user tasks

Call OpenAI to create user tasks by category. prompt as follows:

Design a User Story Mapping for ${domain} application based on your understanding. Here are the requirements:

1 . Your user story map should include only user tasks to demonstrate how users will interact with the application.
2 . Our user story map should be based on your understanding of the ${domain} application and its users, and should be
   designed to address their needs and pain points.
3 . You may use any tools or formats you choose to create your user story map, but it should be easily shareable and
   understandable by stakeholders.
4 . Your expression should be more concise and clear.
5 . Your return should be like as follows:

###

User Tasks:

1 . ...

###

Sample output:

 User Tasks:
1. Browse and search for animations and comics
2. View details of animations and comics
3. Create an account
4. Log in to the account
5. Add animations and comics to favorites
6. Download animations and comics
7. Share animations and comics with friends
8. Rate and review animations and comics
9. Create and upload animations and comics
10. Participate in online forums and discussions

Step 2. Decompose user tasks into user stories

Call OpenAI to create user stories based on user tasks. prompt as follows:

为下面的需求编写用户故事：${domain} 应用的 ${story_name} 功能。 要求如下：

1 . 必须要考虑尽可能考虑各种异常场景，添加更多的 AC。
2 . 你的返回模板如下所示：

###

用户故事：可以选择宝贝出行服务
作为 莉莉妈
我想 在滴滴打车的手机客户端里选择宝贝出行服务
以便于 我能够带宝宝打车出行的时候打到有儿童座椅的车

AC 1:  莉莉妈可以选择宝贝出行服务
假设 xxx
当 xxx
于是 xxx

###

Sample output:

用户故事：可以创建和上传动画和漫画
作为一个 Animation and Comics 应用的用户
我想要创建和上传动画和漫画
以便于我可以分享我的作品给其他用户

AC 1: 用户可以创建和上传动画和漫画
假设 用户已经登录到 Animation and Comics 应用
当 用户点击创建和上传动画和漫画按钮
于是 用户可以创建和上传动画和漫画

Code-assisted generation

Basic ideas:

Copilot Element

Step 1. Prepare the data

Download the project on GitHub (need to include test cases)
Traversing the Java file under src/main .
Generate basic information for each class:

 PostService(PostRepository, UserRepository, ImageService)
- fields: postRepository:PostRepository, userRepository:UserRepository, userPosts:Set<Post>, imageService:ImageService
- methods: findAll(): List<Post>, addNewPost(Post): Post, saveImageToPost(String, MultipartFile, Post): int

Step 2. Generate instructions

Expected JSON format:

instruction: Implement the method ${methodName}
input: ${classInformation}
output: ${code}

Example:

 {"instruction":"Implement the method action","input":"com.thoughtworks.go.config.rules.AbstractDirective(DirectiveType, DirectiveType, String, String, String)n- fields: action:String, type:String, resource:String, configErrors:ConfigErrors, directiveType:DirectiveTypen- methods: validate(ValidationContext): void, isInvalid(String, List<String>): boolean, matchesAction(String): boolean, matchesType(Class<? extends Validatable>): boolean, matchesResource(String): boolean, errors(): ConfigErrors, addError(String, String): void, equals(Object): boolean, action(): String, type(): String, resource(): String, hashCode(): int, hasErrors(): boolean, getDirectiveType(): DirectiveType","output":"public abstract class AbstractDirective implements Directive {nn    @ConfigAttribute(value = "action", optional = false)n    protected String action;nn    @ConfigAttribute(value = "type", optional = false)n    protected String type;nn    @ConfigValuen    private String resource;nn    private final ConfigErrors configErrors = new ConfigErrors();nn    private DirectiveType directiveType;nn    public AbstractDirective(DirectiveType allow) {n        this.directiveType = allow;n    }nn    public AbstractDirective(DirectiveType allow, String action, String type, String resource) {n        this.directiveType = allow;n        this.action = action;n        this.type = type;n        this.resource = resource;n    }nn    @Overriden    public String action() {n        return this.action;n    }n}n"}

Class information format

Format specification:

包名.类名(类的构造类型)
- fields：成员变量集合（变量名:类型）
- methods：方法抽象集合（方法名(参数类型): 输出类型）

result:

 com.thoughtworks.go.config.rules.AbstractDirective(DirectiveType, DirectiveType, String, String, String)
- fields: action:String, type:String, resource:String, configErrors:ConfigErrors, directiveType:DirectiveType
- methods: validate(ValidationContext): void, isInvalid(String, List<String>): boolean, matchesAction(String): boolean, matchesType(Class<? extends Validatable>): boolean, matchesResource(String): boolean, errors(): ConfigErrors, addError(String, String): void, equals(Object): boolean, action(): String, type(): String, resource(): String, hashCode(): int, hasErrors(): boolean, getDirectiveType(): DirectiveType

Others: Core code logic

 val javaProcessor = JavaProcessor (file.readText())
val shotClass = javaProcessor.toShortClass() ? : return @forEach

javaProcessor
   .removePackage()
   .removeAllImport()
   .removeLicenseInfoBeforeImport()

javaProcessor.splitMethods().forEach { (key, value) ->
   CodegenPrompt (
       instruction = " Implement the method $key " ,
       input = shotClass.toString(),
       output = value
   ). let { prompt ->
       val output = Json .encodeToString(prompt)
       File ( " $targetPath${key} .json " ).writeText(output)
   }
}

Test code generation

Basic ideas

Grammatical analysis ideas:

Method 1 - Improved based on OpenAI data in case of limited time. However, the test cases written by OpenAI are not necessarily reliable, so let it generate business code.
Method 2 - In the case of plenty of time, it is a more reasonable solution to analyze AST to merge the first and second steps. After all, OpenAI API is very expensive.

Step 1. Generate test code

Download the project on GitHub (need to include test cases)
Create a Java file map under src/main for each project. If the corresponding test file exists at the same time, it will be pulled into the dataset.
And generate basic information for each test corresponding class (to reduce OpenAI Token usage):

 org.unitmesh.processor.TestClass(String, Int)
- fields: field1:String, field2:Int
- methods: method1(String, Int): String, method2(): Int

Split each test file according to the test case (i.e. @Test method) and split it into N (i.e. test1 and test2 are two different data)

 class TestProcessorTest {
    @ Test
    void test1 () {
    }
    
    @ Test
    void test2 () {
    }
}

Finally, the generated data is as follows:

{"classInfo": "com.thoughtworks.go.security.AESEncrypter(AESCipherProvider)n- fields: ENCODER:Base64.Encoder, DECODER: Base64.Decoder, cipherProvider:AESCipherProvider, ivProvider:IVProvidern- methods: createIVProviderInstance(): IVProvider, canDecrypt(String): boolean, encrypt(String): String, decrypt(String): String, createSecretKeySpec(): SecretKeySpec", "testMethod": "public class AESEncryptterTest {nn private AESEncryptter aesEncryptter;nn @Testn public void shouldGenerateEncryptedText() throws CryptoException {n String encrypt = aesEncryptter.encrypt("p@ssw0rd");n assertThat(encrypt).startsWith("AES");n assertThat(encrypt.split(":")).hasSize(3);n }n}n", "id": "task_0"}

Step 2. Write implementation code with OpenAI Davinci (optional)

For detailed code, see: test-to-code.py

Convert the above data to JSONL and merge into propt.
Let Davinci complete the fill-in-the-blank questions.

The last generated propt example is as follows:

You are a programmer and implementation a method with TDD. Here are the requirements:

1 . According follows class information and tests code to write a method.
2 . Try you best to thinking corner case.
3 . You only return the code, no explain.

class information:

###  

io.github.robwin.swagger.test.AbstractContractValidator()

- methods: findExpectedPaths(Swagger, SwaggerAssertionConfig): Map<String,Path>, getPathsIncludingBasePath(Swagger):
  Map<String,Path>, getPathsWithPrefix(Swagger, String): Map<String,Path>, isBlankOrSlash(String): boolean

###

test code:

###  

/ **

* Tests AbstractContractValidator.
  * /
  @ RunWith (Enclosed.class)
  public class AbstractContractValidatorTest {

  / **
    * Tests getPathsIncludingBasePath().
      * /
      public static class GetPathsIncludingBasePath {

      @ Test
      public void shouldReturnPathsPrefixedIfBasePathSet() {
      // given
      Swagger swagger = buildSwaggerFrom("/swagger.json");
      // when
      Map<String, Path> paths = new DummyValidator().getPathsIncludingBasePath(swagger);
      // then
      paths.entrySet().forEach(e -> assertThat(e.getKey(), startsWith(swagger.getBasePath())));
      }
      }

  / **
    * Tests findExpectedPaths().
      * /
      public static class FindExpectedPaths {
      }

  private static class DummyValidator extends AbstractContractValidator {
  }
  }

###

Text to code

Use existing Datasets, including:

text-to-sql - Dataset for converting natural language into SQL statements
text-to-code - Dataset for converting natural language into code

However, these two code sets are not of high quality, but are basically available.

Text generation repository

Data preparation

Parses the Kotlin project code and extracts all classes and methods.
Establish the correspondence between the Repository method and the type.
Generate basic information about the Repository method.
Call OpenAI to generate data

The format is as follows:

 instruction:
我想查找特定月份（monthly_id）下在某个日期（date）之前的所有费用（expense），以方便了解特定月份内的开销情况。
input:
data class ExpenseEntity(....)

output:
@Query("SELECT * FROM expense WHERE monthly_id = :recurringExpenseId AND date < :beforeDate")
suspend fun getAllExpensesForRecurringExpenseBeforeDate(recurringExpenseId: Long, beforeDate: LocalDate): List<ExpenseEntity>

Output example:

 evaluate("text to kotlin repository with class", "我想查询指定年龄的用户（User）的博客数量。n ###data class User(var age: Int, 
val blogId: Int) data class Post(val title: String)###", 0.1, 0.75, 40, 4, 512)

@Query("SELECT COUNT(*) FROM User WHERE age = :age")
abstract fun getBlogCount(age: Int): Long

Domain Knowledge

Training 1: PDF

Basic ideas:

Convert PDF files to text
Split the text into two parts in the form of the instruction , and input output null.

Example:

 instruction: 介绍一下财通财通宝的基金管理人、基金托管人在履行各自职责的过程中，违反《基金法》?

（一）基金管理人、基金托管人在履行各自职责的过程中，违反《基金法》等法律法规的规定或者基金合同约定， 给基金财产或者基金份额持有人造成损害的，
应当分别对各自的行为依法承担赔偿责任；因共同行为给基金财产或者基金份额持有人造成损害的，应当承担连带赔偿责任，对 损失的赔偿，仅限于直接损失。
但是发生下列情况，当事人可以免责：  1.基金管理人及基金托管人按照中国证监会的规定或当时有效的法律法规的作为或不作为而造成的损失等； 
 2.基金管理人由于按照基金合同规定的投资原则而行使或不行使其投资权而造成的损失等；  3.不可抗力。

Training and results

Meta-based Llama training LoRA

train:

Method 1: Directly use: alpaca-lora.ipynb
Method 2: Use directly: OpenBayes container (PS: Use my dedicated invitation link to register OpenBayes, each party gets 60 minutes of RTX 3090 usage time, supports accumulation, permanently valid: https://openbayes.com/console/signup?r=phodal_uVxU) )

Training time:

3400+ user stories, about 25 minutes
8000+ test code, about 1 hour
10000+ code assisted generation, about 1.5 hours
20000+ code assisted generation, about 3 hours
40000+ code assisted generation, about 6 hours

Related background:

Basic model: Meta open source LLaMA series model: llama-7b-hf
Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
Alpaca Lora: https://github.com/tloen/alpaca-lora

Since, our goal is not to tune the model and generate a general model, but to train Lora for a specific purpose. Therefore, we did not join the data in Stanford Alpaca, so the time required is shorter. For example, the 8000+ test code only takes 1 hour, and the 3400+ user stories only take 25 minutes.

Training 1: Test code generation

Training results (LoRA) are available at: https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.1

 print(evaluate("write test", """
public Set<String> listFilesUsingJavaIO(String dir) {
    return Stream.of(new File(dir).listFiles())
      .filter(file -> !file.isDirectory())
      .map(File::getName)
      .collect(Collectors.toSet());
}
""", 0.1, 0.75, 40, 4, 512))

## Output:

public class ListFilesUsingJavaIOTest {
    @Test
    public void shouldListFilesUsingJavaIO() {
        Set<String> files = ListFilesUsingJavaIO.listFilesUsingJavaIO(new File("dir"));
        assertThat(files, containsInAnyOrder("file1", "file2", "file3"));
    }
}

Training 2: Split user stories

Training results (LoRA) are available at: https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.2

Notice:

If the dataset is too small, set val_set_size, for example: --val_set_size 500
If the dataset is too small, set val_set_size, for example: --val_set_size 500
If the dataset is too small, set val_set_size, for example: --val_set_size 500

Sample output :

 print(evaluate("create Agile user story for following topic: ", "购买电影票", 0.1, 0.75, 40, 4, 512))

用户故事：可以购买电影票
作为一个电影票购买者
我想在购买电影票应用中购买电影票
以便于我可以购买电影票
AC 1: 电影票购买者可以在购买电影票应用中购买电影票
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“购买电影票”按钮
于是 电影票购买者可以在购买电影票应用中购买电影票
AC 2: 电影票购买者可以在购买电影票应用中查看购买记录
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“查看购买记录”按钮
于是 电影票购买者可以在购买电影票应用中查看购买记录
AC 3: 电影票购买者可以在购买电影票应用中查看购买记录
假设 电影票购买者已经登录购买电影票应用
当 电影票购买者点击“查看购买记录”按��

Training 3: Code Assist

Training results (LoRA) are available at: https://github.com/unit-mesh/unit-minions/releases/tag/v0.0.3

Sample output :

 print(evaluate("Implement the method rmFiles", "FileUtils:", 0.1, 0.75, 40, 4, 512))

public class FileUtils {

    private FileUtils() {
    }

    public static void rmFiles(String path) {
        File dir = new File(path);
        File[] files = dir.listFiles();
        if (files != null) {
            for (File file : files) {
                if (file.isFile()) {
                    file.delete();
                } else if (file.isDirectory()) {
                    FileUtils.rmFiles(file.getAbsolutePath());
                }
            }
        }
    }
}

SQL to code

Conversion script: code/text-to-sql.py

Directly from: datasets/sql/sql-train.jsonl

Sample output :

 evaluate("text to sql", "谁是最美丽的人", 0.1, 0.75, 40, 4, 512)
evaluate("text to sql", "小明今年几岁", 0.1, 0.75, 40, 4, 512)
evaluate("text to sql", "What hand guard system is used with a gas piston commando?", 0.1, 0.75, 40, 4, 512)

ChatGLM training LoRA based on Tsinghua University

train:

Directly use: chatglm-tuning.ipynb

Code generation

Test generation

User Story Generation

Example 1:

 evaluate("Instruction: 创建用户故事nInput: 电影网站: 查看订单详情 nAnswer:")

Instruction: 创建用户故事
Input: 电影网站: 查看订单详情 
Answer:

用户故事:可以查看订单详情
作为一个电影网站的用户
我想查看订单详情
以便于我可以了解订单的详细信息,如订单号、订单状态、订单内容等。

AC 1: 用户可以查看订单详情
假设 用户已经登录
当 用户点击查看订单详情按钮
于是 用户可以查看订单详情,包括订单号、订单状态、订单内容等。

AC 2: 用户可以查看订单详情
假设 用户没有登录
当 用户点击查看订单详情按钮
于是 用户被要求先登录,然后才能查看订单详情,包括订单号、订单状态、订单内容等。

Example 2:

 evaluate("Instruction: 创建用户故事nInput: 团购网站:用户注册nAnswer:")

Instruction: 创建用户故事
Input: 团购网站:用户注册
Answer:
团购网站的用户注册故事:

作为一个团购网站的用户

我想注册一个团购网站的用户账号

以便于我可以使用团购服务,如购买商品、服务等。

AC 1: 团购网站的用户注册账号
假设 用户输入了注册信息
当 用户提交注册信息
于是 用户可以成功注册一个团购网站的用户账号

AC 2: 团购网站的用户注册账号
假设 用户输入了错误的注册信息
当 用户提交注册信息
于是 用户可以被提示输入正确的注册信息,并重新提交注册信息

Expand