The goal of the uniem project is to create the best universal text embedding model in Chinese.
This project mainly includes model training, fine-tuning and evaluation code. Models and data sets will be open sourced on the HuggingFace community.
FineTuner also supports fine-tuning of sentence_transformers , text2vec and other models. It also supports SGPT training of GPT series models and Prefix Tuning. The API initialized by FineTuner has undergone slight changes and cannot be compatible with 0.2.0FineTuner to support model fine-tuning with native support, a few lines of code, instant adaptation !openai text-embedding-ada-002 in Chinese text classification and text retrieval. For details, please refer to M3E models README. The M3E series models are fully compatible with sentence-transformers. You can seamlessly use M3E Models in all projects that support sentence-transformers by replacing the model name , such as chroma, guidance, semantic-kernel.
Install
pip install sentence-transformersuse
from sentence_transformers import SentenceTransformer
model = SentenceTransformer ( "moka-ai/m3e-base" )
embeddings = model . encode ([ 'Hello World!' , '你好,世界!' ]) uniem provides a very easy-to-use finetune interface, with a few lines of code, instant adaptation!
from datasets import load_dataset
from uniem . finetuner import FineTuner
dataset = load_dataset ( 'shibing624/nli_zh' , 'STS-B' )
# 指定训练的模型为 m3e-small
finetuner = FineTuner . from_pretrained ( 'moka-ai/m3e-small' , dataset = dataset )
finetuner . run ( epochs = 3 )For details on fine-tuning models, please refer to Uniem fine-tuning tutorial or
If you want to run locally, you need to run the following command to prepare the environment
conda create -n uniem python=3.10
pip install uniemThe Chinese Embedding model lacks a unified evaluation standard, so we refer to MTEB and constructed the Chinese evaluation standard MTEB-zh. At present, 6 models have been horizontally evaluated on various data sets. For detailed evaluation methods and codes, please refer to MTEB-zh.
| text2vec | m3e-small | m3e-base | m3e-large-0619 | openai | DMetaSoul | uer | erlangshen | |
|---|---|---|---|---|---|---|---|---|
| TNews | 0.43 | 0.4443 | 0.4827 | 0.4866 | 0.4594 | 0.3084 | 0.3539 | 0.4361 |
| JDIphone | 0.8214 | 0.8293 | 0.8533 | 0.8692 | 0.746 | 0.7972 | 0.8283 | 0.8356 |
| GubaEastmony | 0.7472 | 0.712 | 0.7621 | 0.7663 | 0.7574 | 0.735 | 0.7534 | 0.7787 |
| TYQSentiment | 0.6099 | 0.6596 | 0.7188 | 0.7247 | 0.68 | 0.6437 | 0.6662 | 0.6444 |
| StockComSentiment | 0.4307 | 0.4291 | 0.4363 | 0.4475 | 0.4819 | 0.4309 | 0.4555 | 0.4482 |
| IFlyTek | 0.414 | 0.4263 | 0.4409 | 0.4445 | 0.4486 | 0.3969 | 0.3762 | 0.4241 |
| Average | 0.5755 | 0.5834 | 0.6157 | 0.6231 | 0.5956 | 0.552016667 | 0.57225 | 0.594516667 |
| text2vec | openai-ada-002 | m3e-small | m3e-base | m3e-large-0619 | DMetaSoul | uer | erlangshen | |
|---|---|---|---|---|---|---|---|---|
| map@1 | 0.4684 | 0.6133 | 0.5574 | 0.626 | 0.6256 | 0.25203 | 0.08647 | 0.25394 |
| map@10 | 0.5877 | 0.7423 | 0.6878 | 0.7656 | 0.7627 | 0.33312 | 0.13008 | 0.34714 |
| mrr@1 | 0.5345 | 0.6931 | 0.6324 | 0.7047 | 0.7063 | 0.29258 | 0.10067 | 0.29447 |
| mrr@10 | 0.6217 | 0.7668 | 0.712 | 0.7841 | 0.7827 | 0.36287 | 0.14516 | 0.3751 |
| ndcg@1 | 0.5207 | 0.6764 | 0.6159 | 0.6881 | 0.6884 | 0.28358 | 0.09748 | 0.28578 |
| ndcg@10 | 0.6346 | 0.7786 | 0.7262 | 0.8004 | 0.7974 | 0.37468 | 0.15783 | 0.39329 |
If you want to add evaluation datasets or models to MTEB-zh, please feel free to issue or PR. I will support you as soon as possible and look forward to your contribution!
uniem is licensed under the Apache-2.0 License. See the LICENSE file for more details.
Please cite this model using the following format:
@software {Moka Massive Mixed Embedding, author = {Wang Yuxin,Sun Qingxuan,He sicheng}, title = {M3E: Moka Massive Mixed Embedding Model}, year = {2023} }