FlashRank下载 - FlashRank源代码下载

FlashRank

其他源码

Minor fixes

下载

在输入LLMS之前，将与SOTA成对或ListWise Rerankers重新排列您的搜索结果

Ultra-lite＆Super-Fast Python库，可在您现有的搜索和检索管道中添加重新排列。它基于Sota LLM和交叉编码器，并感谢所有模型所有者。

支持：

成对 /尖端的重读者。（基于交叉编码器，IE Max tokens = 512 ）
ListWise基于LLM的Rerankers。（基于LLM，IE Max tokens = 8192 ）
有关支持模型的完整列表，请参见下文。

特征

⚡超轻型：
- 无需火炬或变压器。在CPU上运行。
- 拥有世界上最微小的重新管理模型，〜4MB 。
⏱️超级快：
- rerank速度是段落中令牌＃的函数，查询 +模型深度（层）
- 为了给出一个想法，示例（在代码中）使用默认模型花费的时间如下。
- 详细的基准测试，结核病
？ $ concious ：
- 最低$每次调用$： Lambda之类的无服务器部署由每次调用的内存和时间收取*
- 较小的包装尺寸=较短的冷启动时间，更快地重新部署了无服务器。
基于SOTA跨编码器和其他模型：
- “零拍的rerankers有多好？” - 查看参考部分。

模型名称	描述	尺寸	笔记
`ms-marco-TinyBERT-L-2-v2`	默认模型	〜4MB	型号卡
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	〜34MB	型号卡
`rank-T5-flan`	最佳的非跨编码器reranker	〜110MB	型号卡
`ms-marco-MultiBERT-L-12`	多语言，支持100多种语言	〜150MB	支持的语言
`ce-esci-MiniLM-L12-v2`	在Amazon ESCI数据集上进行了微调	-	型号卡
`rank_zephyr_7b_v1_full`	4位定量的GGUF	〜4GB	型号卡
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	型号卡

路线图中的模型：
- inranker
为什么首选更光滑的型号？重读是大型检索管道的最后一站，想法是避免任何额外的开销，尤其是针对用户的方案。为此，最终的模型非常小，不需要任何专业的硬件，但是选择了竞争性能。随时提出问题，以增加对合适的新型号的支持。

安装：

如果您需要轻巧的成对rerankers [默认]

 pip install flashrank

如果您需要基于LLM的LISTWISE RERANKERS

 pip install flashrank [ listwise ]

使排名更快：

max_length值应该很大，能够适应您最长的段落。换句话说，如果您最长的段落（100个令牌） +查询（16个令牌）对代币估计为116，然后说设置max_length = 128足够好，以适合[cls]和[sep]之类的保留令牌。使用Openai Tiktoken库库估算令牌密度，如果每个令牌的性能对您来说至关重要。对于较小的通道大小，给出更长的max_length （例如512）将对响应时间产生负面影响。

入门：

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

您可以通过任何搜索和检索管道使用它：

词汇搜索（支持全文搜索或倒置索引的调节）

语义搜索 /抹布用途（vectordbs）

混合搜索

部署模式

如何在AWS lambda函数中使用它？

在AWS或其他无服务器环境中，整个VM仅读取，您可能必须创建自己的自定义DIR。您可以在Dockerfile中这样做，并将其用于加载模型（最终作为温暖呼叫之间的缓存）。您可以在INIT期间使用CACHE_DIR参数进行操作。

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

参考：

在MS-Marco上微调的交叉编码器的内域和Zeroshot性能

MS-Marco上的Rankt5的内域和Zeroshot表现

如何引用？

要在您的工作中引用此存储库，请单击右侧的“引用此存储库”链接（Bewlow Repo描述和标签）

论文引用了闪存的速度

cos-mix：余弦相似性和距离融合以改进信息检索
bryndza在2024年的气候主义中：姿态，目标和仇恨事件检测通过检索型GPT-4和Llama
与气候行动主义有关的推文中的态度和仇恨事件检测 - 案例2024年共享任务

[重要更新]

~~一个名为Swiftrank的克隆库指向我们的模型存储库，我们正在使用临时解决方案来避免这种偷窃。感谢您的耐心和理解。~~

这个问题已经解决，模型现在在HF中。请升级以继续PIP安装-U闪存。谢谢您的耐心理解

展开

附加信息

版本 Minor fixes
类型其他源码
更新时间 2025-05-24
大小 2.25MB
来自于 Github

FlashRank

目录

特征

安装：

如果您需要轻巧的成对rerankers [默认]

如果您需要基于LLM的LISTWISE RERANKERS

使排名更快：

入门：

您可以通过任何搜索和检索管道使用它：

部署模式

如何在AWS lambda函数中使用它？

参考：

如何引用？

论文引用了闪存的速度

[重要更新]

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express