huozi Download - huozi Source code download

huozi

Other source code

Release huozi 3.5

Download

Universal large model of movable type

? renew

[2024-09-12] Publish movable type version 3.5
[2024-02-09] Publish movable type version 3.5 and Chinese MT-Bench dataset
[2023-08-06] Publish movable type 1.0 and movable type 2.0 versions
[2023-05-04] Release the "ChatGPT Research Report"; Internal Test of movable type large model

? Table of contents

chapter	illustrate
??‍♂ Open Source List	List of open source projects in this warehouse
Model introduction	Brief introduction to the structure and training process of movable type model
? Model download	movable type model download link
Model reasoning	Examples of movable type model inference, including the usage process of inference frameworks such as vLLM, llama.cpp, and Ollama.
? Model performance	Performance of movable type model on mainstream evaluation tasks
? Generate sample	Examples of actual generation effect of movable type model

??‍♂ Open Source List

Movable Type 3.5 : [Model Weight]
- Movable Type 3.5 is a new model that further optimizes performance based on Movable Type 3.0 and Chinese-Mixtral-8x7B.
Movable Type 3.0 : [Model Weight] [Online Demo]
- Movable Type 3.0 is a sparse hybrid expert model that supports 32K context, with rich Chinese and English knowledge and powerful mathematical reasoning and code generation capabilities. The older version of movable type 3.0 has stronger command compliance and security.
Chinese MT-Bench : [Dataset]
- This dataset is the Chinese version of the English MT-Bench dialogue capability evaluation dataset. It contains a series of multi-round dialogue questions, each of which has been carefully proofreaded and necessary adjustments are made to adapt to the Chinese context.
"ChatGPT Research Report" : [PDF]
- The Harbin Institute of Natural Language Processing at Natural Language Processing organized many teachers and classmates to write this research report, and gave ChatGPT a detailed introduction and summary from technical principles, application scenarios, future development and other aspects.
Movable Type 2.0 : [Model Weight] [RLHF Data]
- Based on movable type 1.0, the quality of model responses is further optimized through reinforcement learning (RLHF) with human feedback, making it more in line with human preferences. Compared with the previous version, the average length is significantly improved, and the ability to follow instructions is stronger and the logic is clearer.
- 16.9k manual labeling of preference data, reply from movable type model, can be used to train reward models.
Movable Type 1.0 : [Model Weight]
- Based on the Bloom model, the model obtained by fine-tuning instruction training on about 15 billion tokens has stronger instruction compliance and better security.

Model introduction

Large-scale language model (LLM) has made significant progress in the field of natural language processing and has demonstrated its strong potential in a wide range of application scenarios. This technology not only attracted widespread attention from the academic community, but also became a hot topic in the industry. Against this background, the Center for Social Computing and Information Retrieval of Harbin Institute of Technology (HIT-SCIR) recently launched the latest achievement - movable type 3.5 , committed to providing more possibilities and choices for the research and practical application of natural language processing.

Movable type 3.5 is a model obtained by further performance enhancement based on Movable type 3.0 and Chinese-Mixtral-8x7B. movable type 3.5 supports 32K long context , inherits the powerful comprehensive capabilities of movable type 3.0, and achieves performance improvements in many aspects such as Chinese and English knowledge , mathematical reasoning , code generation , instruction compliance capabilities , content security, etc.

Important

The movable type series model may still generate misleading replies containing factual errors or harmful content that contains bias/discrimination. Please be careful to identify and use the generated content and do not spread the generated harmful content to the Internet.

Please see the documentation for movable type 1.0 and movable type 2.0 here. Please see here for documentation on movable type 3.0 and Chinese MT-Bench.

Model structure

Movable Type 3.5 is a sparse hybrid expert model (SMoE), each expert layer contains 8 FFNs, and each forward calculation is sparsely activated by top-2. The movable type 3.5 has a total of 46.7B parameters. Thanks to its sparse activation characteristics, only 13B parameters need to be activated during actual reasoning, which effectively improves the computing efficiency and processing speed.

Training process

Movable Type 3.5 has undergone multiple steps of training, as shown in the figure below:

The training process is:

[Chinese word list incremental pre-training]: Since the Mixtral-8x7B word list does not support Chinese, the encoding and decoding efficiency of Chinese is low, which limits the practicality in Chinese scenarios. We first conducted incremental pre-training of Chinese word list based on Mixtral-8x7B, which significantly improved the model's encoding and decoding efficiency of Chinese, and enabled the model to have strong Chinese generation and understanding capabilities. We have open sourced model weights and training code in the Chinese-Mixtral-8x7B code repository.
[Volatile Type 3.0 Training]: We fine-tune it on about 300,000 lines of instruction data based on Chinese-Mixtral-8x7B and obtained the movable Type 3.0 model. The dataset used is explained here. Movable Type 3.0 inherits the rich Chinese and English knowledge of the base model and has powerful performance in tasks such as mathematical reasoning and code generation. After fine-tuning of instructions, movable type 3.0 has also achieved significant improvements in command compliance capabilities and security.
[Fine-tuning of movable type 1.0 dataset]: We tried to fine-tune Chinese-Mixtral-8x7B with the movable type 1.0 dataset. The intermediate checkpoint 1 obtained performed excellently in Chinese and English knowledge (such as C-Eval, CMMLU, MMLU and other tasks), and even surpassed movable type 3.0. However, this model lags behind movable type 3.0 in terms of instruction compliance capabilities and security.
[Instruction compliance capability enhancement]: In view of the shortcomings of intermediate checkpoint 1 in terms of instruction compliance capability, we have introduced additional data sets for strengthening. Furthermore, according to the experience of Longxu Dou et al., we used the BPE Dropout technology during the training process to further increase the robustness of the model to the instructions. This process training obtains intermediate checkpoint 2 .
[Model Fusion]: We refer to the method of Yiming Cui et al. to fuse the intermediate checkpoint 1 , intermediate checkpoint 2 and movable type 3.0 models to generate intermediate checkpoint 3 .
[Model Post-Fusion Training]: Based on the fused model, we further fine-tuned the instructions and finally launched movable type 3.5. This version has improved its Chinese and English knowledge, command compliance capabilities and security responses.

? Model download

Model name	File size	Download address	Remark
huozi3.5	88GB	?HuggingFace ModelScope	Movable Type 3.5 Complete Model
huozi3.5-ckpt-1	88GB	?HuggingFace ModelScope	Movable Type 3.5 Intermediate Checkpoint 1
huozi3.5-ckpt-2	88GB	?HuggingFace ModelScope	Movable Type 3.5 Intermediate Checkpoint 2
huozi3.5-ckpt-3	88GB	?HuggingFace ModelScope	Movable type 3.5 Intermediate checkpoint 3

If you want to fine-tune movable type 3.5 or Chinese-Mixtral-8x7B, please refer to the training code here.

Model reasoning

Quick Start

Movable Type 3.5 uses the ChatML format propt template, the format is:

 <|beginofutterance|>系统
{system prompt}<|endofutterance|>
<|beginofutterance|>用户
{input}<|endofutterance|>
<|beginofutterance|>助手
{output}<|endofutterance|>

The example code for reasoning using movable type 3.5 is as follows:

 # quickstart.py

import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

model_id = "HIT-SCIR/huozi3.5"

tokenizer = AutoTokenizer . from_pretrained ( model_id )
model = AutoModelForCausalLM . from_pretrained (
    model_id ,
    attn_implementation = "flash_attention_2" ,
    torch_dtype = torch . bfloat16 ,
    device_map = "auto" ,
)

text = """<|beginofutterance|>系统
你是一个智能助手<|endofutterance|>
<|beginofutterance|>用户
请你用python写一段快速排序的代码<|endofutterance|>
<|beginofutterance|>助手
"""

inputs = tokenizer ( text , return_tensors = "pt" ). to ( 0 )

outputs = model . generate (
    ** inputs ,
    eos_token_id = 57001 ,
    temperature = 0.8 ,
    top_p = 0.9 ,
    max_new_tokens = 2048 ,
)
print ( tokenizer . decode ( outputs [ 0 ], skip_special_tokens = False ))

Movable Type 3.5 supports all Mixtral model ecosystems, including Transformers, vLLM, llama.cpp, Ollama, Text generation web UI and other frameworks.

If you have network issues while downloading your model, you can use the checkpoints we provide on ModelScope.

Transformers Model Inference + Streaming Generation

transformers support adding chat templates for tokenizer and supports streaming generation. The sample code is as follows:

 # example/transformers-stream/stream.py

import torch
from transformers import AutoModelForCausalLM , AutoTokenizer , TextStreamer

model_id = "HIT-SCIR/huozi3.5"

model = AutoModelForCausalLM . from_pretrained (
    model_id ,
    attn_implementation = "flash_attention_2" ,
    torch_dtype = torch . bfloat16 ,
    device_map = "auto" ,
)

tokenizer = AutoTokenizer . from_pretrained ( model_id )
tokenizer . chat_template = """{% for message in messages %}{{'<|beginofutterance|>' + message['role'] + ' n ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|endofutterance|>' + ' n '}}{% endif %}{% endfor %}
{% if add_generation_prompt and messages[-1]['role'] != '助手' %}{{ '<|beginofutterance|>助手n ' }}{% endif %}"""

chat = [
    { "role" : "系统" , "content" : "你是一个智能助手" },
    { "role" : "用户" , "content" : "请你用python写一段快速排序的代码" },
]

inputs = tokenizer . apply_chat_template (
    chat ,
    tokenize = True ,
    add_generation_prompt = True ,
    return_tensors = "pt" ,
). to ( 0 )

stream_output = model . generate (
    inputs ,
    streamer = TextStreamer ( tokenizer , skip_prompt = True , skip_special_tokens = True ),
    eos_token_id = 57001 ,
    temperature = 0.8 ,
    top_p = 0.9 ,
    max_new_tokens = 2048 ,
)

ModelScope Model Inference

ModelScope's interface is very similar to Transformers, just replace transformers with model scope:

 # example/modelscope-generate/generate.py

import torch
- from transformers import AutoModelForCausalLM, AutoTokenizer
+ from modelscope import AutoTokenizer, AutoModelForCausalLM

model_id = "HIT-SCIR/huozi3.5"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

text = """<|beginofutterance|>系统
你是一个智能助手<|endofutterance|>
<|beginofutterance|>用户
请你用python写一段快速排序的代码<|endofutterance|>
<|beginofutterance|>助手
"""

inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(
    **inputs,
    eos_token_id=57001,
    temperature=0.8,
    top_p=0.9,
    max_new_tokens=2048,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

vLLM Inference Acceleration

Variable Type 3.5 supports the implementation of inference acceleration through vLLM, and the sample code is as follows:

 # example/vllm-generate/generate.py

from vllm import LLM , SamplingParams

prompts = [
    """<|beginofutterance|>系统
你是一个智能助手<|endofutterance|>
<|beginofutterance|>用户
请你用python写一段快速排序的代码<|endofutterance|>
<|beginofutterance|>助手
""" ,
]

sampling_params = SamplingParams (
    temperature = 0.8 , top_p = 0.95 , stop_token_ids = [ 57001 ], max_tokens = 2048
)
llm = LLM (
    model = "HIT-SCIR/huozi3.5" ,
    tensor_parallel_size = 4 ,
)
outputs = llm . generate ( prompts , sampling_params )

for output in outputs :
    prompt = output . prompt
    generated_text = output . outputs [ 0 ]. text
    print ( generated_text )

Deploy OpenAI API Server

Variety Type 3.5 can be deployed as a service that supports the OpenAI API protocol, which allows Variety Type 3.5 to be called directly through the OpenAI API.

Environmental preparation:

$ pip install vllm openai

Start the service:

$ python -m vllm.entrypoints.openai.api_server --model /path/to/huozi3.5/checkpoint --served-model-name huozi --chat-template template.jinja --tensor-parallel-size 8 --response-role 助手 --max-model-len 2048

Send requests using OpenAI API:

 # example/openai-api/openai-client.py

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI (
    api_key = openai_api_key ,
    base_url = openai_api_base ,
)

chat_response = client . chat . completions . create (
    model = "huozi" ,
    messages = [
        { "role" : "系统" , "content" : "你是一个智能助手" },
        { "role" : "用户" , "content" : "请你用python写一段快速排序的代码" },
    ],
    extra_body = { "stop_token_ids" : [ 57001 ]},
)
print ( "Chat response:" , chat_response . choices [ 0 ]. message . content )

Here is a sample code that uses OpenAI API + Gradio + streaming:

 # example/openai-api/openai-client-gradio.py

from openai import OpenAI
import gradio as gr

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI (
    api_key = openai_api_key ,
    base_url = openai_api_base ,
)


def predict ( message , history ):
    history_openai_format = [
        { "role" : "系统" , "content" : "你是一个智能助手" },
    ]
    for human , assistant in history :
        history_openai_format . append ({ "role" : "用户" , "content" : human })
        history_openai_format . append ({ "role" : "助手" , "content" : assistant })
    history_openai_format . append ({ "role" : "用户" , "content" : message })
    models = client . models . list ()

    stream = client . chat . completions . create (
        model = models . data [ 0 ]. id ,
        messages = history_openai_format ,
        temperature = 0.8 ,
        stream = True ,
        extra_body = { "repetition_penalty" : 1 , "stop_token_ids" : [ 57001 ]},
    )

    partial_message = ""
    for chunk in stream :
        partial_message += chunk . choices [ 0 ]. delta . content or ""
        yield partial_message


gr . ChatInterface ( predict ). queue (). launch ()

Quantitative reasoning

GGUF format

The GGUF format is designed to quickly load and save models. It is launched by the llama.cpp team and is suitable for frameworks such as llama.cpp, Ollama, etc. You can manually convert movable type 3.5 in HuggingFace format to GGUF format.

Step 1 Environment preparation

First, you need to download the source code of llama.cpp. We provide the submodule of llama.cpp in the repository. This version of llama.cpp has been tested and can successfully infer:

$ git clone --recurse-submodules https://github.com/HIT-SCIR/huozi
$ cd examples/llama.cpp

You can also download the latest version of llama.cpp source code:

$ git clone https://github.com/ggerganov/llama.cpp.git
$ cd llama.cpp

Then it needs to be compiled. There are subtle differences in compilation commands depending on your hardware platform:

$ make  # 用于纯CPU推理
$ make LLAMA_CUBLAS=1  # 用于GPU推理
$ LLAMA_METAL=1 make  # 用于Apple Silicon，暂未经过测试

Step 2 Format Conversion (optional)

The following command needs to be in llama.cpp/ directory:

 # 转换为GGUF格式
$ python convert.py --outfile /path/to/huozi-gguf/huozi3.5.gguf /path/to/huozi3.5
# 进行GGUF格式的q4_0量化
$ quantize /path/to/huozi-gguf/huozi3.5.gguf /path/to/huozi-gguf/huozi3.5-q4_0.gguf q4_0

Step 3 Start to reason

The following command needs to be in llama.cpp/ directory:

$ main -m /path/to/huozi-gguf/huozi3.5-q4_0.gguf --color --interactive-first -c 2048 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 --in-prefix " <|beginofutterance|>用户n " --in-suffix " <|endofutterance|>n<|beginofutterance|>助手" -r " <|endofutterance|> "

The -ngl parameter indicates the number of layers of offload to the GPU. Reducing this value can alleviate the GPU video memory pressure. After our actual test, the q2_k quantized model has a 16-layer offload, and the memory usage can be reduced to 9.6GB, which can run the model on consumer GPUs:

$ main -m /path/to/huozi-gguf/huozi3.5-q2_k.gguf --color --interactive-first -c 2048 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 16 --in-prefix " <|beginofutterance|>用户n " --in-suffix " <|endofutterance|>n<|beginofutterance|>助手" -r " <|endofutterance|> "

For more parameters of main , you can refer to the official documentation of llama.cpp.

Use the Ollama framework for reasoning, you can refer to Ollama's README instructions.

? Model performance

For the comprehensive ability evaluation of large models, we used the following evaluation dataset to evaluate movable type 3.5 respectively:

C-Eval: A comprehensive Chinese basic model evaluation suite. It contains 13,948 multiple-choice questions, covering 52 different subjects and four difficulty levels.
CMMLU: A comprehensive Chinese evaluation benchmark dedicated to evaluating the knowledge and reasoning ability of language models in the Chinese context, covering 67 topics from basic disciplines to advanced professional levels.
GAOKAO: A dataset of Chinese college entrance examination questions aims to provide an assessment framework that is aligned with humans, intuitively and efficiently evaluates the language comprehension and logical reasoning ability of big models.
MMLU: An English evaluation dataset containing 57 multi-select tasks, covering elementary mathematics, American history, computer science, law, etc., with difficulty covering high school level to expert level. It is one of the mainstream LLM evaluation datasets at present.
HellaSwag: A very challenging NLI evaluation dataset in English. Each question requires a deep understanding of the context and cannot be answered based on common sense.
GSM8K: A dataset of high-quality primary school mathematics application problems that require 2 to 8 steps to solve. The solution mainly involves the use of basic arithmetic operations that can be used to evaluate multi-step mathematical reasoning ability.
HumanEval: A dataset of 164 original programming problems that evaluate language comprehension, algorithms and simple mathematical abilities by measuring the functional correctness of programs generated from document strings.
MT-Bench: An open set of English questions, including 80 multi-round conversation tasks, is used to evaluate chatbots’ multi-round conversation and command compliance capabilities, and to score model responses through the Big Model Referee (GPT-4).
MT-Bench-zh: We use the Chinese question set translated by MT-Bench, and each set of questions has been manually proofread and appropriately adjusted in the Chinese context. We have open sourced the MT-Bench-zh dataset here.
MT-Bench-safety: Our hand-constructed security datasets, including violence, pornography, sensitivity and other risk content. This dataset is a closed dataset.

Movable Type 3.5 activates only 13B parameters when inference. The following table shows the results of Chinese models of movable type 3.5 and other 13B scales and the old version of movable type on each evaluation dataset:

We use 5-shot in C-Eval, CMMLU, and MMLU, GSM8K uses 4-shot, HellaSwag and HumanEval use 0-shot, and HumanEval uses pass@1 indicator. All tests were greedy strategy.
We use OpenCompass as the evaluation framework and the commit hash is 4c87e77. The review code is located here.
In the performance evaluation of movable type 3.0, we used the base model evaluation method in HumanEval incorrectly, and the correct evaluation results have been updated in the above table.

According to the test results in the above table, movable type 3.5 has achieved a relatively stable performance improvement compared with movable type 3.0, and the Chinese and English knowledge , mathematical reasoning , code generation , Chinese instruction compliance ability , and Chinese content security of movable type 3.5 have been strengthened.

? Generate sample

The following is the generation effect of movable type 3.5 on the MT-Bench-zh evaluation set:

Open Source Protocol

The use of this repository source code is subject to the open source license agreement Apache 2.0.

Mobile type is commercially available. If you use the movable type model or its derivatives for commercial purposes, please contact the licensor as follows to register and apply for written authorization from the licensor: Contact email: [email protected].

Citation

Movable type model

@misc{huozi,
    author = {Huozi-Team}.
    title = {Huozi: Leveraging Large Language Models for Enhanced Open-Domain Chatting}
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository}
    howpublished = { url {https://github.com/HIT-SCIR/huozi}}
}

Star History

Expand

Additional Information

Version Release huozi 3.5
Type Other source code
Update Time 2025-04-18
size 12.39MB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All