FlashRankダウンロードFlashRankソースコードのダウンロード

FlashRank

その他のソースコード

Minor fixes

ダウンロード

LLMに給餌する前に、SOTAペアワイズまたはリストワイズリランカーで検索結果を再ランクします

既存の検索および検索パイプラインに再ランクを追加するためのUltra-Lite＆Super-Fast Pythonライブラリ。これは、すべてのモデル所有者に感謝して、Sota LLMとクロスエンコーダーに基づいています。

サポート：

ペアワイズ /ポイントワイズリランカー。（クロスエンコーダーベース、すなわちMax tokens = 512 ）
ListWise LLMベースの再審査剤。（LLMベース、すなわちMax tokens = 8192 ）
サポートされているモデルの完全なリストについては、以下を参照してください。

特徴

⚡ウルトラライト：
- トーチや変圧器は必要ありません。 CPUで実行されます。
- 世界で最も小さな再ランキングモデルを誇っています。
⏱️超高速：
- Rerank Speedは、パッセージのトークンの＃、クエリ +モデルの深さ（レイヤー）の関数です
- アイデアを与えるために、デフォルトモデルを使用した例（コード内）で取得された時間を以下に示します。
- 詳細なベンチマーク、TBD
？ $ arcious ：
- 呼び出しごとの最低$： Lambdaのようなサーバーレスの展開は、呼び出しごとにメモリと時間によって請求されます*
- パッケージサイズが小さい=コールドスタート時間の短い、サーバーレスのための迅速な再展開。
SOTAクロスエンコーダーおよびその他のモデルに基づいています。
- 「ゼロショット再生者はどれくらい良いですか？」 - 参照セクションを見てください。

モデル名	説明	サイズ	メモ
`ms-marco-TinyBERT-L-2-v2`	デフォルトモデル	〜4MB	モデルカード
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	〜34MB	モデルカード
`rank-T5-flan`	最高の非クロスエンコーダーリランカー	〜110MB	モデルカード
`ms-marco-MultiBERT-L-12`	多言語は、100以上の言語をサポートしています	〜150MB	サポート言語
`ce-esci-MiniLM-L12-v2`	Amazon Esciデータセットで微調整されています	-	モデルカード
`rank_zephyr_7b_v1_full`	4ビット定量化GGUF	〜4GB	モデルカード
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	モデルカード

ロードマップのモデル：
- インランカー
なぜ滑らかなモデルが好まれるのですか？再ランキングは、より大きな検索パイプラインの最終レッグです。特にユーザー向けのシナリオのために、余分なオーバーヘッドを避けることです。そのため、特殊なハードウェアを必要とせず、競争力のあるパフォーマンスを提供する非常に小さなフットプリントを備えたモデルが選択されています。お気軽に問題を提起して、適切であると思われる新しいモデルのサポートを追加してください。

インストール：

軽量のペアワイズリランカーが必要な場合[デフォルト]

 pip install flashrank

LLMベースのリストワイズリランカーが必要な場合

 pip install flashrank [ listwise ]

はじめる：

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

任意の検索および検索パイプラインで使用できます。

語彙検索（フルテキスト検索または反転インデックスをサポートする規制）

セマンティック検索 / rag usecases（vectordbs）

ハイブリッド検索

展開パターン

AWSラムダ機能で使用する方法は？

AWSまたはその他のサーバーレス環境では、VM全体が読み取り専用です。独自のカスタムディレクトリを作成する必要があります。 DockerFileでこれを行い、モデルのロードに使用することができます（最終的には温かい通話の間のキャッシュとして）。 cache_dirパラメーターを使用してinit中に行うことができます。

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

参考文献：

MS-Marcoで微調整されたクロスエンコーダーのドメイン内およびゼロショットパフォーマンス

MS-Marcoで微調整されたRANKT5のドメイン内およびゼロショットパフォーマンス

引用する方法は？

作業でこのリポジトリを引用するには、右側の「このリポジトリを引用」リンクをクリックしてください（Bewlow Repoの説明とタグ）

Flashrankを引用した論文

cos-mix：改善された情報検索のためのコサインの類似性と距離融合
Climateactivism 2024のBryndza：検索されたGPT-4およびLLAMAを介したスタンス、ターゲット、および憎悪イベント検出
気候活動に関連するツイートでのスタンスと憎悪のイベントの検出 - ケース2024での共有タスク

[重要な更新]

Swiftrankと呼ばれるクローンライブラリは、モデルバケットを指しています。この盗みを避けるための暫定ソリューションに取り組んでいます。忍耐と理解をありがとう。

この問題は解決され、モデルは現在HFにあります。アップグレードして、PIPインストール-U Flashrankを続行してください。忍耐と理解をありがとう

拡大する

追加情報

バージョン Minor fixes
タイプその他のソースコード
更新時間 2025-05-24
サイズ 2.25MB
から Github

FlashRank

目次

特徴

インストール：

軽量のペアワイズリランカーが必要な場合[デフォルト]

LLMベースのリストワイズリランカーが必要な場合

ランキングをより速くする：

はじめる：

任意の検索および検索パイプラインで使用できます。

展開パターン

AWSラムダ機能で使用する方法は？

参考文献：

引用する方法は？

Flashrankを引用した論文

[重要な更新]

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express