ดาวน์โหลด fvdb - ดาวน์โหลดซอร์สโค้ด fvdb

fvdb

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

FVDB - พอร์ซเลนบาง ๆ รอบ FAISS

fvdb เป็นเสื้อคลุมที่เรียบง่ายและเรียบง่ายรอบฐานข้อมูลเวกเตอร์ FAISS มันใช้ดัชนี L2 กับเวกเตอร์ปกติ

มันใช้แพ็คเกจ faiss-cpu และ sentence-transformers สำหรับการฝังตัว หากคุณต้องการ FAISS เวอร์ชัน GPU (อาจจะไม่มาก) คุณสามารถติดตั้ง faiss-gpu ด้วยตนเองและใช้ GPUIndexFlatL2 แทน IndexFlatL2 ใน fvdb/db.hy คุณยังสามารถใช้รูปแบบการฝังข้อความ GPU ได้แม้ในขณะที่ใช้ faiss-cpu

หากเปิดใช้งานบทสรุป ( ไม่ใช่ ค่าเริ่มต้นให้ดูส่วนการกำหนดค่าด้านล่าง) สรุปของสารสกัดจะถูกเก็บไว้ข้างสารสกัด

มันเข้ากันได้ดีกับ trag

คุณสมบัติ

การค้นหาความคล้ายคลึงกันด้วยคะแนน
ทางเลือกของการฝังประโยค-transformer
การจัดรูปแบบผลลัพธ์ที่เป็นประโยชน์ (JSON, tabulated ... )
การเข้าถึง CLI
สารสกัดสรุป

อินพุตใด ๆ นอกเหนือจากข้อความธรรมดา (markdown, asciidoc, rst, ซอร์สโค้ด ฯลฯ ) อยู่ นอกขอบเขต คุณควรมีแพ็คเกจที่มีอยู่มากมาย

การใช้งาน

 import hy # fvdb is written in Hy, but you can use it from python too
from fvdb import faiss , ingest , similar , sources , write

# data ingestion
v = faiss ()
ingest ( v , "doc.md" )
ingest ( v , "docs-dir" )
write ( v , "/tmp/test.fvdb" ) # defaults to $XDG_DATA_HOME/fvdb (~/.local/share/fvdb/ on Linux)

# search
results = similar ( v , "some query text" )
results = marginal ( v , "some query text" ) # not yet implemented

# information, management
sources ( v )
    { ...
      'docs-dir/Once More to the Lake.txt' ,
      'docs-dir/Politics and the English Language.txt' ,
      'docs-dir/Reflections on Gandhi.txt' ,
      'docs-dir/Shooting an elephant.txt' ,
      'docs-dir/The death of the moth.txt' ,
      ... }

info ( v )
    {   'records' : 42 ,
        'embeddings' : 42 ,
        'embedding_dimension' : 1024 ,
        'is_trained' : True ,
        'path' : '/tmp/test-vdb' ,
        'sources' : 24 ,
        'embedding_model' : 'Alibaba-NLP/gte-large-en-v1.5' }

nuke ( v )

สิ่งเหล่านี้ยังสามารถใช้ได้จากบรรทัดคำสั่ง

$ # defaults to $XDG_DATA_HOME/fvdb (~/.local/share/fvdb/ on Linux)
# data ingestion (saves on exit)
$ fvdb ingest doc.md
    Adding 2 records

$ fvdb ingest docs-dir
    Adding 42 records

$ # search
$ fvdb similar -j " some query text " > results.json   # --json / -j gives json output

$ fvdb similar -r 2 " George Orwell's formative experience as a policeman in colonial Burma "
    # defaults to tabulated output (not all fields will be shown)
       score  source                             added                               page    length
    --------  ---------------------------------- --------------------------------  ------  --------
    0.579925  docs-dir/A hanging.txt             2024-11-05T11:37:26.232773+00:00       0      2582
    0.526988  docs-dir/Shooting an elephant.txt  2024-11-05T11:37:43.891659+00:00       0      3889

$ fvdb marginal " some query text "                       # not yet implemented

$ # information, management
$ fvdb sources
    ...
    docs-dir/Once More to the Lake.txt
    docs-dir/Politics and the English Language.txt
    docs-dir/Reflections on Gandhi.txt
    docs-dir/Shooting an elephant.txt
    docs-dir/The death of the moth.txt
    ...

$ fvdb info
    -------------------  -----------------------------
    records              44
    embeddings           44
    embedding_dimension  1024
    is_trained           True
    path                 /tmp/test
    sources              24
    embedding_model      Alibaba-NLP/gte-large-en-v1.5
    -------------------  -----------------------------

$ fvdb nuke

การกำหนดค่า

ค้นหา $XDG_CONFIG_HOME/fvdb/conf.toml มิฉะนั้นจะใช้ค่าเริ่มต้น

คุณไม่สามารถผสมโมเดล Embeddings ใน FVDB เดียว

นี่คือตัวอย่าง

 # Sets the default path to something other than $XDG_CONFIG_HOME/fvdb/conf.toml
path = " /tmp/test.fvdb "

# Summaries are useful if you use an embedding model with large maximum sequence length,
# for example, gte-large-en-v1.5 has maximum sequence length of 8192.
summary = true		

# A conservative default model, maximum sequence length of 512,
# so no point using summaries.
embeddings.model = " all-mpnet-base-v2 "

# # Some models need extra options
# embeddings.model = "Alibaba-NLP/gte-large-en-v1.5"
# embeddings.trust_remote_code = true
# # You can put some smaller models on a cpu, but larger models will be slow
# embeddings.device = "cpu"

การติดตั้ง

ติดตั้ง Pytorch ครั้งแรกซึ่งใช้โดย sentence-transformers คุณต้องตัดสินใจว่าคุณต้องการ CPU หรือ CUDA (Nvidia GPU) ของ Pytorch หรือไม่ สำหรับการฝังข้อความเพียงแค่ fvdb CPU ก็เพียงพอแล้วพร้อมกับรุ่นเริ่มต้น

แล้ว,