byaldi下载 - byaldi源代码下载

byaldi

其他源码

0.0.5

下载

欢迎来到Byaldi

你可知道？在电影Ragatouille中，雷米（Remy）制作的这道菜实际上不是比塔图尔（Remy），而是该菜的精致版本，称为“ confit byaldi”。

Byaldi徽标是一只开朗的老鼠，使用放大镜来查看复杂的文档。它在大鼠周围的一个圆圈中间说“旁政”。

配x这是Byaldi的预发行版。请报告您遇到的任何问题，可能会有很多怪癖可以解决！

Byaldi是Ragatouille的迷你姐妹项目。这是Colpali存储库围绕的简单包装器，可以使使用熟悉的API等后期交流多模型型号易于使用。

入门

首先，警告：这是使用未压缩的索引和缺乏其他类型的改进的预发行库。

当前，我们支持基础COLPALI引擎（包括新的和Better Colqwen2检查点）支持的所有模型，例如vidore/colqwen2-v1.0 。从广义上讲，目的是使Byaldi支持所有Colvlm模型。

将来的更新将支持其他后端。由于Byaldi的存在是为了促进采用多模式猎犬的采用，我们还打算增加对Visrag等模型的支持。

最终，我们将添加一个HNSW索引机制，汇总，谁知道，也许是2位量化？

随着多模式生态系统的进一步发展，它将被更新！

先决条件

流行音乐

为了将PDF转换为具有友好许可证的图像，我们使用pdf2image库。该库需要在系统上安装poppler 。通过按照其网站上的说明，Poppler非常容易安装。 TL; DR是：

Macos与自制

brew install poppler

Debian/Ubuntu

sudo apt-get install -y poppler-utils

闪存注意力

Gemma使用了最近的Flash注意力。为了使事情尽可能顺利地运行，我们建议您在安装库后安装它：

pip install --upgrade byaldi
pip install flash-attn

硬件

Colpali使用数十亿个参数模型来编码文档。我们建议使用GPU进行平稳操作，尽管弱/较旧的GPU完全可以！编码您的收藏将遭受CPU或国会议员的性能不佳。

使用`byaldi`

Byaldi在很大程度上以Ragatouille的形式建模，这意味着所有内容旨在采用最少的代码行，因此您可以很快地建立在其顶部，而不是花时间弄清楚如何创建检索管道。

加载模型

用byaldi加载模型非常简单：

 from byaldi import RAGMultiModalModel
# Optionally, you can specify an `index_root`, which is where it'll save the index. It defaults to ".byaldi/".
RAG = RAGMultiModalModel . from_pretrained ( "vidore/colqwen2-v1.0" )

如果您已经有一个索引，并希望将其与查询所需的模型一起加载，则可以轻松地这样做：

 from byaldi import RAGMultiModalModel
# Optionally, you can specify an `index_root`, which is where it'll look for the index. It defaults to ".byaldi/".
RAG = RAGMultiModalModel . from_index ( "your_index_name" )

创建索引

用byaldi创建索引是简单而灵活的。您可以索引一个PDF文件，单个图像文件或包含其中多个的目录。这是创建索引的方法：

 from byaldi import RAGMultiModalModel
# Optionally, you can specify an `index_root`, which is where it'll save the index. It defaults to ".byaldi/".
RAG = RAGMultiModalModel . from_pretrained ( "vidore/colqwen2-v1.0" )
RAG . index (
    input_path = "docs/" , # The path to your documents
    index_name = index_name , # The name you want to give to your index. It'll be saved at `index_root/index_name/`.
    store_collection_with_index = False , # Whether the index should store the base64 encoded documents.
    doc_ids = [ 0 , 1 , 2 ], # Optionally, you can specify a list of document IDs. They must be integers and match the number of documents you're passing. Otherwise, doc_ids will be automatically created.
    metadata = [{ "author" : "John Doe" , "date" : "2021-01-01" }], # Optionally, you can specify a list of metadata for each document. They must be a list of dictionaries, with the same length as the number of documents you're passing.
    overwrite = True # Whether to overwrite an index if it already exists. If False, it'll return None and do nothing if `index_root/index_name` exists.
)

就是这样！该模型将开始旋转并创建您的索引，在完成后将所有必要的信息导出到磁盘。然后，您可以使用上面介绍的RAGMultiModalModel.from_index("your_index_name")方法在需要时加载它（您无需在创建它后立即执行此操作 - 它已经将其加载到内存中并准备就绪！）。

您必须在此处做出的主要决定是您是否要将store_collection_with_index设置为true。如果设置为true，它将大大简化您的工作流程：将作为查询结果的一部分返回基本64编码的相关文档版本，因此您可以立即将其输送到LLM。但是，它为您的索引增加了大量的内存和存储要求，因此，如果您对这些资源的简短，则可能需要将其设置为False（默认设置），并在需要时自己创建Base64编码版本。

搜索

创建或加载索引后，您可以开始搜索相关文档。同样，这是一个非常简单的命令：

 results = RAG . search ( query , k = 3 )

结果将是Result对象的列表，您也可以将其视为普通词典。每个结果都将以这种格式：

[
    {
        "doc_id" : 0 ,
        "page_num" : 10 ,
        "score" : 12.875 ,
        "metadata" : {},
        "base64" : None
    },
    ...
]

page_num是1个索引，而doc_ids为0索引。这是为了使使用其他PDF操纵工具更简单，其中第一页通常是第1页。图像和单页PDF的page_num始终为1，它仅对较长的PDF有用。

如果您通过元数据或用标志编码以存储base64版本，则将填充这些字段。结果按分数排序，因此列表中的项目0将永远是最相关的文档，等等。

将文档添加到现有索引

由于索引是内存的，因此它们是补充友好的！如果您需要摄入一些新的PDF，只需使用from_index加载索引，然后将add_to_index调用，与原始index()方法：

 RAG . add_to_index ( "path_to_new_docs" ,
        store_collection_with_index : bool = False ,
        ...
    )

展开

附加信息

版本 0.0.5
类型其他源码
更新时间 2025-04-18
大小 5.2MB
来自于 Github

byaldi

欢迎来到Byaldi

入门

先决条件

流行音乐

闪存注意力

硬件

使用`byaldi`

加载模型

创建索引

搜索

将文档添加到现有索引

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

byaldi

欢迎来到Byaldi

入门

先决条件

流行音乐

闪存注意力

硬件

使用byaldi

加载模型

创建索引

搜索

将文档添加到现有索引

使用`byaldi`