ดาวน์โหลด VektorDB - ดาวน์โหลดซอร์สโค้ด VektorDB

VektorDB

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

Vektordb?

ฐานข้อมูลเวกเตอร์ขั้นต่ำเพื่อการศึกษา

สโลแกน: ทำให้มันง่ายและพวกเขาจะเรียนรู้ ...

ต้องการเรียนรู้เพิ่มเติมหรือไม่? ตรวจสอบส่วน 'อ้างอิง' ด้านล่าง?

ตัวอย่าง: Amazon Bedrock พบกับ Vektordb

ในตัวอย่างนี้เราจะใช้ชุดข้อมูลวิชาคณิตศาสตร์ระดับประถมศึกษา 8K (GSM8K)

 from datasets import load_dataset

# Number of samples we want to process
N_SAMPLES = 100

# Load dataset
# https://huggingface.co/datasets/openai/gsm8k
ds = load_dataset ( "openai/gsm8k" , "main" , split = "train" )[: N_SAMPLES ]
questions = ds [ 'question' ]
answers = ds [ 'answer' ]

ซึ่งมี "ปัญหาคำศัพท์ทางคณิตศาสตร์ระดับประถมศึกษาที่มีความหลากหลายทางภาษาศาสตร์" ในรูปแบบของคู่ question-answer เช่นเดียวกับที่แสดงด้านล่าง

 ### Question

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?

### Answer

Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72

เป้าหมายของเราคือการเปลี่ยนคู่ question-answer เหล่านี้ให้กลายเป็น embeddings จัดเก็บไว้ใน Vektordb และดำเนินการบางอย่าง

Embeddings เป็นเพียงการแสดงตัวเลขของชิ้นส่วนของข้อมูลโดยปกติจะอยู่ในรูปแบบของเวกเตอร์ คุณสามารถเปลี่ยนข้อมูลประเภทใดให้เป็น embeddings (เช่น? ️?) และพวกเขาจะ รักษา ความหมายของข้อมูลต้นฉบับ หากคุณต้องการเรียนรู้เพิ่มเติมเกี่ยวกับ Embeddings ให้ตรวจสอบการแมป Embeddings: จากความหมายถึงเวกเตอร์และกลับ

มากำหนดฟังก์ชั่นผู้ช่วยเพื่อเรียกรุ่นที่ฝังตัวกันผ่านอเมซอน

 import json
import boto3

# Initialize Bedrock client
bedrock = boto3 . client ( "bedrock-runtime" )

def embed ( texts : list , model_id = "cohere.embed-english-v3" ):
    """Generates embeddings for an array of strings using Cohere Embed models."""
    model_provider = model_id . split ( '.' )[ 0 ]
    assert model_provider == "cohere" , 
        f"Invalid model provider (Got: { model_provider } , Expected: cohere)"

    # Prepare payload
    accept = "*/*"
    content_type = "application/json"
    body = json . dumps ({
        'texts' : texts ,
        'input_type' : "search_document"
    })

    # Call model
    response = bedrock . invoke_model (
        body = body ,
        modelId = model_id ,
        accept = accept ,
        contentType = content_type
    )

    # Process response
    response_body = json . loads ( response . get ( 'body' ). read ())
    return response_body . get ( 'embeddings' )

และใช้เพื่อสร้าง embeddings สำหรับชุดย่อยขนาดเล็กของข้อมูลของเรา (คำตอบเท่านั้นตอนนี้)

 from tqdm import tqdm

# Text call limit for Cohere Embed models via Amazon Bedrock
# https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html
MAX_TEXTS_PER_CALL = 96

embeddings = []
for idx in tqdm ( range ( 0 , len ( answers ), MAX_TEXTS_PER_CALL ), "Generating embeddings" ):
    embeddings += embed ( answers [ idx : idx + MAX_TEXTS_PER_CALL ])

ตอนนี้เราพร้อมที่จะเริ่มต้น Vektordb และเริ่มโหลดข้อมูล

 from vektordb import ANNVectorDatabase
from vektordb . types import Vector

# Initialize database
vector_db = ANNVectorDatabase ()

# Load embeddings into the database
for idx in tqdm ( range ( len ( embeddings )), "Loading embeddings" ):
    vector_db . insert ( idx , Vector ( embeddings [ idx ], { 'answer' : answers [ idx ][: 20 ]}))

เพื่อตรวจสอบสติเราสามารถพิมพ์ตัวอย่างเล็ก ๆ ของฐานข้อมูลของเรา

 vector_db . display (
    np_format = {
        'edgeitems' : 1 ,
        'precision' : 5 ,
        'threshold' : 3 ,
        'suppress' : True
    },
    keys = range ( 10 )
)

 +-----+-------------------------+------------------------------------+
| Key |           Data          |              Metadata              |
+-----+-------------------------+------------------------------------+
|  0  | [-0.00618 ... -0.00047] | {'answer': 'Natalia sold 48/2 = '} |
|  1  | [-0.01997 ... -0.01791] | {'answer': 'Weng earns 12/60 = $'} |
|  2  | [-0.00623 ... -0.0061 ] | {'answer': 'In the beginning, Be'} |
|  3  | [-0.07849 ...  0.00721] | {'answer': 'Maila read 12 x 2 = '} |
|  4  | [-0.01669 ...  0.01263] | {'answer': 'He writes each frien'} |
|  5  |  [0.02484 ... 0.05185]  | {'answer': 'There are 80/100 * 1'} |
|  6  | [-0.01807 ... -0.01859] | {'answer': 'He eats 32 from the '} |
|  7  | [ 0.01265 ... -0.02016] | {'answer': 'To the initial 2 pou'} |
|  8  | [-0.00504 ...  0.0143 ] | {'answer': 'Let S be the amount '} |
|  9  | [-0.0239  ... -0.00905] | {'answer': 'She works 8 hours a '} |
+-----+-------------------------+------------------------------------+

อินสแตนซ์ Vektordb ของเราได้รับการสนับสนุนโดยการใช้งานการค้นหาเพื่อนบ้านที่ใกล้ที่สุด (ANN) โดยประมาณที่ใช้ต้นไม้ไบนารีเพื่อแสดงพาร์ติชัน/แยกต่างหากของ Hyperspace

พาร์ติชันเหล่านี้ถูกสร้างขึ้นโดยการเลือกเวกเตอร์สองตัวโดยการสุ่มค้นหาไฮเปอร์เพลนเท่ากันระหว่างสองแล้วแยกจุดอื่น ๆ ออกเป็น left และ right ขึ้นอยู่กับด้านใด

กระบวนการนี้ซ้ำจนกว่าเราจะมีรายการ k ส่วนใหญ่ ในแต่ละโหนด (พาร์ติชัน)

เราจะได้ผลลัพธ์ที่ดีขึ้นโดยการสร้าง ป่าไม้ *? และค้นหาพวกเขาทั้งหมดดังนั้นเรามาทำอย่างนั้น:

 import random

# Set seed value for replication
random . seed ( 42 )

# Plant a bunch of trees ?️
vector_db . build ( n_trees = 3 , k = 3 )
print ( vector_db . trees [ 0 ], " n " )

นี่คือการเป็นตัวแทนของต้นไม้ต้นแรกใน ป่า ของเรา (โหนดแสดงจำนวนอินสแตนซ์ในแต่ละพาร์ทิชัน)

                                                       __________100______________
                                                      /                           
     ________________________________________________63______          __________37___________
    /                                                                /                       
  _51__                                                    _12_      16____               ___21____
 /                                                       /        /                   /         
 6    45____________________                             _7    5    2    _14___        _10_      _11_____
/   /                                                 /    /        /            /        /        
3 3  3           __________42_____________              4  3  2 3       5     _9     _5    5    4     ___7
                /                                     /              /    /     /    /   /    /    
            ___18____               _____24____        2 2             3 2   6  3   4  1  2 3  3 1   6_   1
           /                      /                                       /     /               /  
          _8_      _10_        ___11___      _13_                           3 3    1 3              2  4
         /       /          /            /                                                        / 
         4   4    4    6_     6_      _5    5    8_                                                   3 1
        /  /   /   /     /      /    /   /  
        2 2 3 1  1 3  2  4   2  4    4  1  2 3  3  5
                        /     /   /            / 
                        1 3    3 1  2 2           2 3

ในที่สุดเราสามารถเรียกใช้การสืบค้นได้เพียงแค่ค้นหาฐานข้อมูลเพื่อหาคำตอบ ที่คล้าย กับคำถามเป้าหมาย

เราใช้ฟังก์ชั่นระยะทางเช่นเดียวกับที่แสดงด้านล่างเพื่อวัดปริมาณเวกเตอร์สองตัวที่คล้ายกันซึ่งกันและกัน

ตัวอย่างเช่นหากเราถามคำถามแรกในชุดข้อมูลการฝึกอบรมของเรา

 from vektordb . utils import print_similarity_scores

# Extract first question
query = questions [ 0 ]
print ( " n Query:" , query , " n " )

# Run search and display similarity scores
results = vector_db . search ( embed ([ query ])[ 0 ], 3 )
print_similarity_scores ( results )

เราคาดว่าคำตอบที่มีดัชนีเดียวกัน ( 0 ) เป็นผลลัพธ์สูงสุด:

 Query: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?

+-----+---------------------+
| Key |        Score        |
+-----+---------------------+
|  0  | 0.15148634752350043 |
|  15 |  0.6105711817572272 |
|  83 |  0.6823805943068366 |
+-----+---------------------+

การอ้างอิง

บทความและหนังสือ

(Bernhardsson, 2015a) วิธีการ Neisghbor ที่ใกล้ที่สุดและโมเดลเวกเตอร์ - ตอนที่ 1
(Bernhardsson, 2015b) เพื่อนบ้านและโมเดลเวกเตอร์ที่ใกล้ที่สุด - ตอนที่ 2 - อัลกอริทึมและโครงสร้างข้อมูล
(Bruch, 2024) ฐานรากของการดึงเวกเตอร์
(Manning, Raghavan & Schütze, 2008) บทนำสู่การดึงข้อมูล
(Pan, Wang & Li, 2023) การสำรวจระบบการจัดการฐานข้อมูลเวกเตอร์
(Teofili, 2019) การเรียนรู้อย่างลึกซึ้งสำหรับการค้นหา