VektorDB
1.0.0
用於教育目的的最小矢量數據庫。
標語:保持簡單,他們會學習...
想了解更多嗎?查看下面的“參考”部分?

在此示例中,我們將使用小學數學8K(GSM8K)數據集
from datasets import load_dataset
# Number of samples we want to process
N_SAMPLES = 100
# Load dataset
# https://huggingface.co/datasets/openai/gsm8k
ds = load_dataset ( "openai/gsm8k" , "main" , split = "train" )[: N_SAMPLES ]
questions = ds [ 'question' ]
answers = ds [ 'answer' ]其中包含“高質量的語言多樣化的小學數學單詞問題”,以question-answer的形式如下所示
### Question
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?
### Answer
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72
我們的目標是將這些question-answer對的嵌入到嵌入式中,將它們存儲在Vektordb中並執行一些操作。
嵌入只是一條信息的數值表示,通常是以向量的形式。您可以將任何類型的數據變成嵌入(例如???),它們將保留原始數據的含義。如果您想了解有關嵌入式的更多信息,請查看映射嵌入:從含義到向量和背部。
讓我們定義一個輔助功能,以通過亞馬遜基岩調用Cohere嵌入模型
import json
import boto3
# Initialize Bedrock client
bedrock = boto3 . client ( "bedrock-runtime" )
def embed ( texts : list , model_id = "cohere.embed-english-v3" ):
"""Generates embeddings for an array of strings using Cohere Embed models."""
model_provider = model_id . split ( '.' )[ 0 ]
assert model_provider == "cohere" ,
f"Invalid model provider (Got: { model_provider } , Expected: cohere)"
# Prepare payload
accept = "*/*"
content_type = "application/json"
body = json . dumps ({
'texts' : texts ,
'input_type' : "search_document"
})
# Call model
response = bedrock . invoke_model (
body = body ,
modelId = model_id ,
accept = accept ,
contentType = content_type
)
# Process response
response_body = json . loads ( response . get ( 'body' ). read ())
return response_body . get ( 'embeddings' )並使用它來生成一小部分數據子集的嵌入(目前僅是答案)
from tqdm import tqdm
# Text call limit for Cohere Embed models via Amazon Bedrock
# https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html
MAX_TEXTS_PER_CALL = 96
embeddings = []
for idx in tqdm ( range ( 0 , len ( answers ), MAX_TEXTS_PER_CALL ), "Generating embeddings" ):
embeddings += embed ( answers [ idx : idx + MAX_TEXTS_PER_CALL ])我們現在準備初始化vektordb並開始加載數據
from vektordb import ANNVectorDatabase
from vektordb . types import Vector
# Initialize database
vector_db = ANNVectorDatabase ()
# Load embeddings into the database
for idx in tqdm ( range ( len ( embeddings )), "Loading embeddings" ):
vector_db . insert ( idx , Vector ( embeddings [ idx ], { 'answer' : answers [ idx ][: 20 ]}))作為理智檢查,我們可以打印數據庫的一小部分樣本
vector_db . display (
np_format = {
'edgeitems' : 1 ,
'precision' : 5 ,
'threshold' : 3 ,
'suppress' : True
},
keys = range ( 10 )
) +-----+-------------------------+------------------------------------+
| Key | Data | Metadata |
+-----+-------------------------+------------------------------------+
| 0 | [-0.00618 ... -0.00047] | {'answer': 'Natalia sold 48/2 = '} |
| 1 | [-0.01997 ... -0.01791] | {'answer': 'Weng earns 12/60 = $'} |
| 2 | [-0.00623 ... -0.0061 ] | {'answer': 'In the beginning, Be'} |
| 3 | [-0.07849 ... 0.00721] | {'answer': 'Maila read 12 x 2 = '} |
| 4 | [-0.01669 ... 0.01263] | {'answer': 'He writes each frien'} |
| 5 | [0.02484 ... 0.05185] | {'answer': 'There are 80/100 * 1'} |
| 6 | [-0.01807 ... -0.01859] | {'answer': 'He eats 32 from the '} |
| 7 | [ 0.01265 ... -0.02016] | {'answer': 'To the initial 2 pou'} |
| 8 | [-0.00504 ... 0.0143 ] | {'answer': 'Let S be the amount '} |
| 9 | [-0.0239 ... -0.00905] | {'answer': 'She works 8 hours a '} |
+-----+-------------------------+------------------------------------+
我們的VekTordB實例是通過使用二進制樹代表超空間的不同分區/拆分的大約最近鄰居(ANN)搜索的實現來支持的。
這些left是通過隨機挑選right向量而生成的
重複此過程,直到我們在每個節點中最多有k項目(分區)
我們可以通過產生樹林*獲得更好的結果 *?並蒐索所有這些,所以讓我們這樣做:
import random
# Set seed value for replication
random . seed ( 42 )
# Plant a bunch of trees ?️
vector_db . build ( n_trees = 3 , k = 3 )
print ( vector_db . trees [ 0 ], " n " )這是我們森林中第一棵樹的代表(節點顯示了每個分區中的實例數)
__________100______________
/
________________________________________________63______ __________37___________
/ /
_51__ _12_ 16____ ___21____
/ / / /
6 45____________________ _7 5 2 _14___ _10_ _11_____
/ / / / / / /
3 3 3 __________42_____________ 4 3 2 3 5 _9 _5 5 4 ___7
/ / / / / / / /
___18____ _____24____ 2 2 3 2 6 3 4 1 2 3 3 1 6_ 1
/ / / / /
_8_ _10_ ___11___ _13_ 3 3 1 3 2 4
/ / / / /
4 4 4 6_ 6_ _5 5 8_ 3 1
/ / / / / / / /
2 2 3 1 1 3 2 4 2 4 4 1 2 3 3 5
/ / / /
1 3 3 1 2 2 2 3
最後,我們可以通過簡單地搜索數據庫以獲取類似於目標問題的答案來運行查詢。
我們使用如下所示的距離函數來量化兩個向量彼此相似的方式。
例如,如果我們在培訓數據集中提出第一個問題
from vektordb . utils import print_similarity_scores
# Extract first question
query = questions [ 0 ]
print ( " n Query:" , query , " n " )
# Run search and display similarity scores
results = vector_db . search ( embed ([ query ])[ 0 ], 3 )
print_similarity_scores ( results )我們希望帶有相同索引( 0 )的答案是最高的結果:
Query: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?
+-----+---------------------+
| Key | Score |
+-----+---------------------+
| 0 | 0.15148634752350043 |
| 15 | 0.6105711817572272 |
| 83 | 0.6823805943068366 |
+-----+---------------------+
COS 597A (普林斯頓):AI中的長期內存 - 向量搜索和數據庫CMU 15-445/645 (Carnegie Mellon):數據庫系統