uniem Download - uniem Source code download

uniem

Other source code

v0.3.3

Download

Uniem

The goal of the uniem project is to create the best universal text embedding model in Chinese.

This project mainly includes model training, fine-tuning and evaluation code. Models and data sets will be open sourced on the HuggingFace community.

? Important updates

➿ 2023.07.11 , released uniem 0.3.0. In addition to M3E, FineTuner also supports fine-tuning of sentence_transformers , text2vec and other models. It also supports SGPT training of GPT series models and Prefix Tuning. The API initialized by FineTuner has undergone slight changes and cannot be compatible with 0.2.0
➿ 2023.06.17 , released uniem 0.2.1, implementing FineTuner to support model fine-tuning with native support, a few lines of code, instant adaptation !
2023.06.17 , the official version of MTEB-zh is released, supporting 6 major categories of Embedding models, supporting 4 major categories of tasks, and a total of 9 data sets automated evaluation
? 2023.06.08 , M3E models are released, which are better than openai text-embedding-ada-002 in Chinese text classification and text retrieval. For details, please refer to M3E models README.

? Use M3E

The M3E series models are fully compatible with sentence-transformers. You can seamlessly use M3E Models in all projects that support sentence-transformers by replacing the model name , such as chroma, guidance, semantic-kernel.

Install

pip install sentence-transformers

use

 from sentence_transformers import SentenceTransformer

model = SentenceTransformer ( "moka-ai/m3e-base" )
embeddings = model . encode ([ 'Hello World!' , '你好,世界!' ])

? Fine-tuning the model

uniem provides a very easy-to-use finetune interface, with a few lines of code, instant adaptation!

 from datasets import load_dataset

from uniem . finetuner import FineTuner

dataset = load_dataset ( 'shibing624/nli_zh' , 'STS-B' )
# 指定训练的模型为 m3e-small
finetuner = FineTuner . from_pretrained ( 'moka-ai/m3e-small' , dataset = dataset )
finetuner . run ( epochs = 3 )

For details on fine-tuning models, please refer to Uniem fine-tuning tutorial or

If you want to run locally, you need to run the following command to prepare the environment

conda create -n uniem python=3.10
pip install uniem

? MTEB-zh

The Chinese Embedding model lacks a unified evaluation standard, so we refer to MTEB and constructed the Chinese evaluation standard MTEB-zh. At present, 6 models have been horizontally evaluated on various data sets. For detailed evaluation methods and codes, please refer to MTEB-zh.

Text classification

Dataset selection, select 6 text classification datasets open source on HuggingFace, including news, e-commerce reviews, stock reviews, long texts, etc.
Evaluation method: Use MTEB to evaluate and report Accuracy.

	text2vec	m3e-small	m3e-base	m3e-large-0619	openai	DMetaSoul	uer	erlangshen
TNews	0.43	0.4443	0.4827	0.4866	0.4594	0.3084	0.3539	0.4361
JDIphone	0.8214	0.8293	0.8533	0.8692	0.746	0.7972	0.8283	0.8356
GubaEastmony	0.7472	0.712	0.7621	0.7663	0.7574	0.735	0.7534	0.7787
TYQSentiment	0.6099	0.6596	0.7188	0.7247	0.68	0.6437	0.6662	0.6444
StockComSentiment	0.4307	0.4291	0.4363	0.4475	0.4819	0.4309	0.4555	0.4482
IFlyTek	0.414	0.4263	0.4409	0.4445	0.4486	0.3969	0.3762	0.4241
Average	0.5755	0.5834	0.6157	0.6231	0.5956	0.552016667	0.57225	0.594516667

Search sort

Dataset selection, using the T2Ranking dataset. Since the T2Ranking dataset is too large, the time cost and API cost of openai are a bit high, so we only selected the first 10,000 articles in T2Ranking.
Evaluation method, use MTEB to evaluate, report map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10

	text2vec	openai-ada-002	m3e-small	m3e-base	m3e-large-0619	DMetaSoul	uer	erlangshen
map@1	0.4684	0.6133	0.5574	0.626	0.6256	0.25203	0.08647	0.25394
map@10	0.5877	0.7423	0.6878	0.7656	0.7627	0.33312	0.13008	0.34714
mrr@1	0.5345	0.6931	0.6324	0.7047	0.7063	0.29258	0.10067	0.29447
mrr@10	0.6217	0.7668	0.712	0.7841	0.7827	0.36287	0.14516	0.3751
ndcg@1	0.5207	0.6764	0.6159	0.6881	0.6884	0.28358	0.09748	0.28578
ndcg@10	0.6346	0.7786	0.7262	0.8004	0.7974	0.37468	0.15783	0.39329

? Contributing

If you want to add evaluation datasets or models to MTEB-zh, please feel free to issue or PR. I will support you as soon as possible and look forward to your contribution!

License

uniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.

? Citation

Please cite this model using the following format:

@software {Moka Massive Mixed Embedding, author = {Wang Yuxin,Sun Qingxuan,He sicheng}, title = {M3E: Moka Massive Mixed Embedding Model}, year = {2023} }

Expand

Additional Information

Version v0.3.3
Type Other source code
Update Time 2025-04-19
size 12.82MB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All