FlashRank下載 - FlashRank源代碼下載

FlashRank

其他源碼

Minor fixes

下載

在輸入LLMS之前，將與SOTA成對或ListWise Rerankers重新排列您的搜索結果

Ultra-lite＆Super-Fast Python庫，可在您現有的搜索和檢索管道中添加重新排列。它基於Sota LLM和交叉編碼器，並感謝所有模型所有者。

支持：

成對 /尖端的重讀者。（基於交叉編碼器，IE Max tokens = 512 ）
ListWise基於LLM的Rerankers。（基於LLM，IE Max tokens = 8192 ）
有關支持模型的完整列表，請參見下文。

特徵

⚡超輕型：
- 無需火炬或變壓器。在CPU上運行。
- 擁有世界上最微小的重新管理模型，〜4MB 。
⏱️超級快：
- rerank速度是段落中令牌＃的函數，查詢 +模型深度（層）
- 為了給出一個想法，示例（在代碼中）使用默認模型花費的時間如下。
- 詳細的基準測試，結核病
？ $ concious ：
- 最低$每次調用$： Lambda之類的無服務器部署由每次調用的內存和時間收取*
- 較小的包裝尺寸=較短的冷啟動時間，更快地重新部署了無服務器。
基於SOTA跨編碼器和其他模型：
- “零拍的rerankers有多好？” - 查看參考部分。

模型名稱	描述	尺寸	筆記
`ms-marco-TinyBERT-L-2-v2`	默認模型	〜4MB	型號卡
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	〜34MB	型號卡
`rank-T5-flan`	最佳的非跨編碼器reranker	〜110MB	型號卡
`ms-marco-MultiBERT-L-12`	多語言，支持100多種語言	〜150MB	支持的語言
`ce-esci-MiniLM-L12-v2`	在Amazon ESCI數據集上進行了微調	-	型號卡
`rank_zephyr_7b_v1_full`	4位定量的GGUF	〜4GB	型號卡
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	型號卡

路線圖中的模型：
- inranker
為什麼首選更光滑的型號？重讀是大型檢索管道的最後一站，想法是避免任何額外的開銷，尤其是針對用戶的方案。為此，最終的模型非常小，不需要任何專業的硬件，但是選擇了競爭性能。隨時提出問題，以增加對合適的新型號的支持。

安裝：

如果您需要輕巧的成對rerankers [默認]

 pip install flashrank

如果您需要基於LLM的LISTWISE RERANKERS

 pip install flashrank [ listwise ]

使排名更快：

max_length值應該很大，能夠適應您最長的段落。換句話說，如果您最長的段落（100個令牌） +查詢（16個令牌）對代幣估計為116，然後說設置max_length = 128足夠好，以適合[cls]和[sep]之類的保留令牌。使用Openai Tiktoken庫庫估算令牌密度，如果每個令牌的性能對您來說至關重要。對於較小的通道大小，給出更長的max_length （例如512）將對響應時間產生負面影響。

入門：

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

您可以通過任何搜索和檢索管道使用它：

詞彙搜索（支持全文搜索或倒置索引的調節）

語義搜索 /抹布用途（vectordbs）

混合搜索

部署模式

如何在AWS lambda函數中使用它？

在AWS或其他無服務器環境中，整個VM僅讀取，您可能必須創建自己的自定義DIR。您可以在Dockerfile中這樣做，並將其用於加載模型（最終作為溫暖呼叫之間的緩存）。您可以在INIT期間使用CACHE_DIR參數進行操作。

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

參考：

在MS-Marco上微調的交叉編碼器的內域和Zeroshot性能

MS-Marco上的Rankt5的內域和Zeroshot表現

如何引用？

要在您的工作中引用此存儲庫，請單擊右側的“引用此存儲庫”鏈接（Bewlow Repo描述和標籤）

論文引用了閃存的速度

cos-mix：餘弦相似性和距離融合以改進信息檢索
bryndza在2024年的氣候主義中：姿態，目標和仇恨事件檢測通過檢索型GPT-4和Llama
與氣候行動主義有關的推文中的態度和仇恨事件檢測 - 案例2024年共享任務

[重要更新]

~~一個名為Swiftrank的克隆庫指向我們的模型存儲庫，我們正在使用臨時解決方案來避免這種偷竊。感謝您的耐心和理解。~~

這個問題已經解決，模型現在在HF中。請升級以繼續PIP安裝-U閃存。謝謝您的耐心理解

展開

附加信息

版本 Minor fixes
類型其他源碼
更新時間 2025-05-24
大小 2.25MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部

FlashRank

目錄

特徵

安裝：