


Model2Vec是一種將任何句子變壓器變成非常小的靜態模型的技術,可將模型大小降低15倍,並使模型更快地提高到500倍,並且性能下降較小。我們的最佳模型是世界上最出色的靜態嵌入模型。在此處查看我們的結果,或潛入您的工作原理。
安裝包裹:
pip install model2vec這將安裝基本推理軟件包,該軟件包僅取決於numpy和其他一些次要依賴性。如果要蒸餾自己的型號,則可以安裝蒸餾量表以下方式:
pip install model2vec[distill]開始使用Model2VEC的最簡單方法是從HuggingFace Hub加載我們的旗艦模型之一。這些模型已進行了預訓練,可以使用。以下代碼片段顯示瞭如何加載模型並製作嵌入:
from model2vec import StaticModel
# Load a model from the HuggingFace hub (in this case the potion-base-8M model)
model = StaticModel . from_pretrained ( "minishlab/potion-base-8M" )
# Make embeddings
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])
# Make sequences of token embeddings
token_embeddings = model . encode_as_sequence ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])就是這樣。您可以使用該模型對文本進行分類,群集或構建抹布系統。
您還可以從句子變壓器模型中提取自己的Model2VEC模型,而不是使用我們的模型之一。以下代碼段顯示瞭如何提煉模型:
from model2vec . distill import distill
# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
m2v_model = distill ( model_name = "BAAI/bge-base-en-v1.5" , pca_dims = 256 )
# Save the model
m2v_model . save_pretrained ( "m2v_model" )蒸餾真的很快,在CPU上只需要30秒。最重要的是,蒸餾不需要培訓數據。
有關高級用法,例如在“句子變形金剛庫中使用Model2Vec”,請參閱使用段。
numpy 。from_pretrained和push_to_hub ,可以輕鬆地共享和加載模型。我們自己的型號可以在這裡找到。隨時分享自己的。 Model2Vec創建了一個小型,快速且功能強大的模型,該模型在我們可以找到的所有任務上的大幅度優於其他靜態嵌入模型,同時比傳統的靜態嵌入模型(如手套)更快地創建。像BPEMB一樣,它可以創建子字嵌入,但性能更好。蒸餾不需要任何數據,只是詞彙和模型。
基本模型2VEC技術通過通過句子變壓器模型傳遞詞彙,然後使用PCA降低所得嵌入的維度,最後使用ZIPF權重加權嵌入。在推論期間,我們只採用句子中發生的所有令牌嵌入的均值。
我們的藥水模型是使用Tokenlearn進行預訓練的,Tokenlearn是一種預先培訓模型2VEC蒸餾模型的技術。這些模型是通過以下步驟創建的:
smooth inverse frequency (SIF)加權嵌入嵌入: w = 1e-3 / (1e-3 + proba) 。在這裡, proba是我們用於培訓的語料庫中令牌的概率。有關更廣泛的深色,請參閱我們的Model2Vec博客文章和我們的Tokenlearn博客文章。
推理的作用如下。該示例顯示了我們自己的一個模型,但是您也可以加載一個本地的模型,或者從集線器上加載另一個模型。
from model2vec import StaticModel
# Load a model from the Hub. You can optionally pass a token when loading a private model
model = StaticModel . from_pretrained ( model_name = "minishlab/potion-base-8M" , token = None )
# Make embeddings
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])
# Make sequences of token embeddings
token_embeddings = model . encode_as_sequence ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])以下代碼段顯示瞭如何在句子變形金剛庫中使用Model2VEC模型。如果要在句子變形金剛管道中使用模型,這將很有用。
from sentence_transformers import SentenceTransformer
from sentence_transformers . models import StaticEmbedding
# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding . from_model2vec ( "minishlab/potion-base-8M" )
model = SentenceTransformer ( modules = [ static_embedding ])
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])以下代碼可用於從句子變壓器中提煉模型。如上所述,這會導致非常小的模型,這種模型的性能可能較低。
from model2vec . distill import distill
# Distill a Sentence Transformer model
m2v_model = distill ( model_name = "BAAI/bge-base-en-v1.5" , pca_dims = 256 )
# Save the model
m2v_model . save_pretrained ( "m2v_model" )如果您已經加載了模型,或者需要以某種特殊的方式加載模型,我們還提供了一個接口來提煉內存中的模型。
from transformers import AutoModel , AutoTokenizer
from model2vec . distill import distill_from_model
# Assuming a loaded model and tokenizer
model_name = "baai/bge-base-en-v1.5"
model = AutoModel . from_pretrained ( model_name )
tokenizer = AutoTokenizer . from_pretrained ( model_name )
m2v_model = distill_from_model ( model = model , tokenizer = tokenizer , pca_dims = 256 )
m2v_model . save_pretrained ( "m2v_model" )以下代碼段顯示瞭如何使用句子變形金剛庫提煉模型。如果要在句子變形金剛管道中使用模型,這將很有用。
from sentence_transformers import SentenceTransformer
from sentence_transformers . models import StaticEmbedding
static_embedding = StaticEmbedding . from_distillation ( "BAAI/bge-base-en-v1.5" , device = "cpu" , pca_dims = 256 )
model = SentenceTransformer ( modules = [ static_embedding ])
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])如果您通過詞彙量,您將獲得一組靜態單詞嵌入,以及一個定制的令牌儀以符合該詞彙。這與您如何使用手套或傳統Word2Vec相媲美,但實際上並不需要語料庫或數據。
from model2vec . distill import distill
# Load a vocabulary as a list of strings
vocabulary = [ "word1" , "word2" , "word3" ]
# Distill a Sentence Transformer model with the custom vocabulary
m2v_model = distill ( model_name = "BAAI/bge-base-en-v1.5" , vocabulary = vocabulary )
# Save the model
m2v_model . save_pretrained ( "m2v_model" )
# Or push it to the hub
m2v_model . push_to_hub ( "my_organization/my_model" , token = "<it's a secret to everybody>" )默認情況下,這將用子詞令牌將模型提煉,將模型(子字)詞彙與新詞彙結合在一起。如果您想獲得一個單詞級令牌(僅使用傳遞的詞彙),則可以將use_subword參數設置為False ,例如:
m2v_model = distill ( model_name = model_name , vocabulary = vocabulary , use_subword = False )重要說明:我們假設傳遞的詞彙以等級頻率排序。即,我們不在乎實際的單詞頻率,但確實假設最常見的單詞是第一個,而最少的單詞是最後的。如果您不確定是否是情況,請將apply_zipf設置為False 。這可以禁用加權,但也會使性能變得更糟。
可以使用我們的評估軟件包評估我們的模型。安裝評估包:
pip install git+https://github.com/MinishLab/evaluation.git@main以下代碼片段顯示瞭如何評估模型2VEC模型:
from model2vec import StaticModel
from evaluation import CustomMTEB , get_tasks , parse_mteb_results , make_leaderboard , summarize_results
from mteb import ModelMeta
# Get all available tasks
tasks = get_tasks ()
# Define the CustomMTEB object with the specified tasks
evaluation = CustomMTEB ( tasks = tasks )
# Load the model
model_name = "m2v_model"
model = StaticModel . from_pretrained ( model_name )
# Optionally, add model metadata in MTEB format
model . mteb_model_meta = ModelMeta (
name = model_name , revision = "no_revision_available" , release_date = None , languages = None
)
# Run the evaluation
results = evaluation . run ( model , eval_splits = [ "test" ], output_folder = f"results" )
# Parse the results and summarize them
parsed_results = parse_mteb_results ( mteb_results = results , model_name = model_name )
task_scores = summarize_results ( parsed_results )
# Print the results in a leaderboard format
print ( make_leaderboard ( task_scores ))Model2Vec可以使用StaticEmbedding模塊直接在句子變壓器中使用。
以下代碼段顯示瞭如何將Model2VEC模型加載到句子變壓器模型中:
from sentence_transformers import SentenceTransformer
from sentence_transformers . models import StaticEmbedding
# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding . from_model2vec ( "minishlab/potion-base-8M" )
model = SentenceTransformer ( modules = [ static_embedding ])
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])以下代碼片段顯示瞭如何將模型直接提煉成句子變壓器模型:
from sentence_transformers import SentenceTransformer
from sentence_transformers . models import StaticEmbedding
static_embedding = StaticEmbedding . from_distillation ( "BAAI/bge-base-en-v1.5" , device = "cpu" , pca_dims = 256 )
model = SentenceTransformer ( modules = [ static_embedding ])
embeddings = model . encode ([ "It's dangerous to go alone!" , "It's a secret to everybody." ])有關更多文檔,請參閱“句子變形金剛”文檔。
Model2Vec可以在TXTAI中用於文本嵌入,最近的鄰居搜索以及TXTAI提供的任何其他功能。以下代碼片段顯示瞭如何在TXTAI中使用Model2Vec:
from txtai import Embeddings
# Load a model2vec model
embeddings = Embeddings ( path = "minishlab/potion-base-8M" , method = "model2vec" , backend = "numpy" )
# Create some example texts
texts = [ "Enduring Stew" , "Hearty Elixir" , "Mighty Mushroom Risotto" , "Spicy Meat Skewer" , "Chilly Fruit Salad" ]
# Create embeddings for downstream tasks
vectors = embeddings . batchtransform ( texts )
# Or create a nearest-neighbors index and search it
embeddings . index ( texts )
result = embeddings . search ( "Risotto" , 1 ) Model2Vec是Chonkie語義塊的默認模型。要在Chonkie中使用Model2Vec進行語義塊,只需使用pip install chonkie[semantic] ,然後使用SemanticChunker類中的一種potion模型。以下代碼片段顯示瞭如何在Chonkie中使用Model2Vec:
from chonkie import SDPMChunker
# Create some example text to chunk
text = "It's dangerous to go alone! Take this."
# Initialize the SemanticChunker with a potion model
chunker = SDPMChunker (
embedding_model = "minishlab/potion-base-8M" ,
similarity_threshold = 0.3
)
# Chunk the text
chunks = chunker . chunk ( text )要在變形金剛中使用Model2VEC模型,以下代碼片段可以用作起點:
import { AutoModel , AutoTokenizer , Tensor } from '@huggingface/transformers' ;
const modelName = 'minishlab/potion-base-8M' ;
const modelConfig = {
config : { model_type : 'model2vec' } ,
dtype : 'fp32' ,
revision : 'refs/pr/1'
} ;
const tokenizerConfig = {
revision : 'refs/pr/2'
} ;
const model = await AutoModel . from_pretrained ( modelName , modelConfig ) ;
const tokenizer = await AutoTokenizer . from_pretrained ( modelName , tokenizerConfig ) ;
const texts = [ 'hello' , 'hello world' ] ;
const { input_ids } = await tokenizer ( texts , { add_special_tokens : false , return_tensor : false } ) ;
const cumsum = arr => arr . reduce ( ( acc , num , i ) => [ ... acc , num + ( acc [ i - 1 ] || 0 ) ] , [ ] ) ;
const offsets = [ 0 , ... cumsum ( input_ids . slice ( 0 , - 1 ) . map ( x => x . length ) ) ] ;
const flattened_input_ids = input_ids . flat ( ) ;
const modelInputs = {
input_ids : new Tensor ( 'int64' , flattened_input_ids , [ flattened_input_ids . length ] ) ,
offsets : new Tensor ( 'int64' , offsets , [ offsets . length ] )
} ;
const { embeddings } = await model ( modelInputs ) ;
console . log ( embeddings . tolist ( ) ) ; // output matches python version請注意,這要求Model2Vec具有model.onnx文件和幾個必需的Tokenizers文件。為了生成這些模型,可以使用以下代碼片段:
python scripts/export_to_onnx.py --model_path < path-to-a-model2vec-model > --save_path " <path-to-save-the-onnx-model> " 我們提供了可以開箱即用的許多型號。這些模型可在HuggingFace Hub上使用,可以使用from_pretrained方法加載。這些模型在下面列出。
| 模型 | 語言 | 詞彙 | 句子變壓器 | 令牌類型 | 參數 | tokenlearn |
|---|---|---|---|---|---|---|
| 電池基8m | 英語 | 輸出 | BGE-BASE-EN-V1.5 | 子字 | 7.5m | ✅ |
| 電池基4m | 英語 | 輸出 | BGE-BASE-EN-V1.5 | 子字 | 3.7m | ✅ |
| 電鹼-2M | 英語 | 輸出 | BGE-BASE-EN-V1.5 | 子字 | 18m | ✅ |
| m2v_multlingual_output | 多種語言 | 輸出 | Labse | 子字 | 471m |
我們已經進行了廣泛的實驗,以評估Model2VEC模型的性能。結果記錄在結果文件夾中。結果在以下各節中介紹:
麻省理工學院
如果您在研究中使用Model2Vec,請引用以下內容:
@software { minishlab2024model2vec ,
authors = { Stephan Tulkens, Thomas van Dongen } ,
title = { Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World } ,
year = { 2024 } ,
url = { https://github.com/MinishLab/model2vec } ,
}