Descargar fvdb - Descargar código fuente fvdb

fvdb

Otro código fuente

1.0.0

Descargar

FVDB - Porcelana delgada alrededor de Faiss

fvdb es un envoltorio simple y mínimo alrededor de la base de datos Faiss Vector. Utiliza un índice L2 con vectores normalizados.

Utiliza el paquete faiss-cpu y sentence-transformers para incrustaciones. Si necesita la versión GPU de FAISS (muy probablemente no), puede instalar manualmente faiss-gpu y usar GPUIndexFlatL2 en lugar de IndexFlatL2 en fvdb/db.hy . Todavía puede usar un modelo de incrustación de texto GPU incluso mientras usa faiss-cpu .

Si los resúmenes están habilitados ( no es el valor predeterminado, consulte la sección de configuración a continuación), se almacenará un resumen del extracto junto con el extracto.

Coincide bien con Trag.

Características

búsqueda de similitud con puntaje
Elección de incrustaciones de transformador de oraciones
Formateo útil de los resultados (JSON, Tabulado ...)
acceso a CLI
extraer resúmenes

Cualquier entrada que no sea texto sin formato (Markdown, AsciIDoc, RST, código fuente, etc.) está fuera de alcance . Debe uno de los muchos paquetes disponibles (no estructurado, Trafiltura, Dorma, etc.) para convertir a texto sin formato en un paso separado.

Uso

 import hy # fvdb is written in Hy, but you can use it from python too
from fvdb import faiss , ingest , similar , sources , write

# data ingestion
v = faiss ()
ingest ( v , "doc.md" )
ingest ( v , "docs-dir" )
write ( v , "/tmp/test.fvdb" ) # defaults to $XDG_DATA_HOME/fvdb (~/.local/share/fvdb/ on Linux)

# search
results = similar ( v , "some query text" )
results = marginal ( v , "some query text" ) # not yet implemented

# information, management
sources ( v )
    { ...
      'docs-dir/Once More to the Lake.txt' ,
      'docs-dir/Politics and the English Language.txt' ,
      'docs-dir/Reflections on Gandhi.txt' ,
      'docs-dir/Shooting an elephant.txt' ,
      'docs-dir/The death of the moth.txt' ,
      ... }

info ( v )
    {   'records' : 42 ,
        'embeddings' : 42 ,
        'embedding_dimension' : 1024 ,
        'is_trained' : True ,
        'path' : '/tmp/test-vdb' ,
        'sources' : 24 ,
        'embedding_model' : 'Alibaba-NLP/gte-large-en-v1.5' }

nuke ( v )

Estos también están disponibles desde la línea de comando.

$ # defaults to $XDG_DATA_HOME/fvdb (~/.local/share/fvdb/ on Linux)
# data ingestion (saves on exit)
$ fvdb ingest doc.md
    Adding 2 records

$ fvdb ingest docs-dir
    Adding 42 records

$ # search
$ fvdb similar -j " some query text " > results.json   # --json / -j gives json output

$ fvdb similar -r 2 " George Orwell's formative experience as a policeman in colonial Burma "
    # defaults to tabulated output (not all fields will be shown)
       score  source                             added                               page    length
    --------  ---------------------------------- --------------------------------  ------  --------
    0.579925  docs-dir/A hanging.txt             2024-11-05T11:37:26.232773+00:00       0      2582
    0.526988  docs-dir/Shooting an elephant.txt  2024-11-05T11:37:43.891659+00:00       0      3889

$ fvdb marginal " some query text "                       # not yet implemented

$ # information, management
$ fvdb sources
    ...
    docs-dir/Once More to the Lake.txt
    docs-dir/Politics and the English Language.txt
    docs-dir/Reflections on Gandhi.txt
    docs-dir/Shooting an elephant.txt
    docs-dir/The death of the moth.txt
    ...

$ fvdb info
    -------------------  -----------------------------
    records              44
    embeddings           44
    embedding_dimension  1024
    is_trained           True
    path                 /tmp/test
    sources              24
    embedding_model      Alibaba-NLP/gte-large-en-v1.5
    -------------------  -----------------------------

$ fvdb nuke

Configuración

Busca $XDG_CONFIG_HOME/fvdb/conf.toml , de lo contrario usa los valores predeterminados.

No puede mezclar modelos de incrustaciones en un solo FVDB.

Aquí hay un ejemplo.

 # Sets the default path to something other than $XDG_CONFIG_HOME/fvdb/conf.toml
path = " /tmp/test.fvdb "

# Summaries are useful if you use an embedding model with large maximum sequence length,
# for example, gte-large-en-v1.5 has maximum sequence length of 8192.
summary = true		

# A conservative default model, maximum sequence length of 512,
# so no point using summaries.
embeddings.model = " all-mpnet-base-v2 "

# # Some models need extra options
# embeddings.model = "Alibaba-NLP/gte-large-en-v1.5"
# embeddings.trust_remote_code = true
# # You can put some smaller models on a cpu, but larger models will be slow
# embeddings.device = "cpu"

Instalación

Primero instale Pytorch, que utiliza los sentence-transformers . Debe decidir si desea la versión CPU o CUDA (NVIDIA GPU) de Pytorch. Para solo incrustaciones de texto para fvdb , la CPU es suficiente, con el modelo predeterminado.

Entonces,