FlashRank -Download - FlashRank Quellcode Download

FlashRank

Anderer Quellcode

Minor fixes

Herunterladen

Ringen Sie Ihre Suchergebnisse erneut mit SOTA paarweise oder Listheitsverkleidung, bevor Sie in Ihre LLMs einfügen

Ultra-Lite & Super-Fast-Python-Bibliothek, um Ihre vorhandenen Such- und Abrufpipelines erneut zu ranken. Es basiert auf SOTA LLMS und Cross-Codern, mit Dankbarkeit allen Modellbesitzern.

Unterstützung:

Paarweise / pointreiche Reranker. (Cross -Encoderbasiert, dh Max tokens = 512 )
LISTEWE LLM -basierte Reranker. (LLM basiert, dh Max tokens = 8192 )
Weiter unten finden Sie die vollständige Liste der unterstützten Modelle.

Inhaltsverzeichnis

Merkmale
Installation
Ranking schneller machen
Erste Schritte
Bereitstellungsmuster
Wie zitiere ich?
Papiere zitieren Flashrank

Merkmale

⚡ Ultra-Lite :
- Keine Fackel oder Transformatoren benötigt. Läuft auf CPU.
- Verfügt über das kleinste leitende Modell der Welt, ~ 4 MB .
⏱️ superschnitzer :
- Die Rerank -Geschwindigkeit ist eine Funktion der Anzahl der Token in Passagen, Abfrage + Modelltiefe (Schichten)
- Um eine Idee zu geben, ist die Zeit, die das Beispiel (im Code) mit dem Standardmodell verwendet hat, unten.
- Detailliertes Benchmarking, TBD
? $ bekontral :
- Niedrigste US
- Kleinere Paketgröße = kürzere Kaltstartzeiten, schnellere Neueinladungen für serverlos.
Basierend auf SOTA Cross-Codern und anderen Modellen :
- "Wie gut sind Null-Shot-Reranker?" - Schauen Sie sich den Referenzabschnitt an.

Modellname	Beschreibung	Größe	Notizen
`ms-marco-TinyBERT-L-2-v2`	Standardmodell	~ 4 MB	Modellkarte
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	~ 34MB	Modellkarte
`rank-T5-flan`	Bester Non-Cross-Coder-Reranker	~ 110 MB	Modellkarte
`ms-marco-MultiBERT-L-12`	Multi-Lingual unterstützt mehr als 100 Sprachen	~ 150 MB	Unterstützte Sprachen
`ce-esci-MiniLM-L12-v2`	Fein abgestimmt auf Amazon Esci-Datensatz	- -	Modellkarte
`rank_zephyr_7b_v1_full`	4-Bit-quantisierter GGUF	~ 4 GB	Modellkarte
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	- -	Modellkarte

Modelle in Roadmap:
- Inranker
Warum werden schlankere Modelle bevorzugt? Das Umbau ist das letzte Bein größerer Abrufpipelines. IDEE ist es, zusätzliche Overheads insbesondere für benutzergerichtete Szenarien zu vermeiden. Zu diesem Endmodelle mit wirklich kleinem Fußabdruck, das keine spezialisierte Hardware benötigt und dennoch wettbewerbsfähige Leistung bietet. Fühlen Sie sich frei, Probleme aufzuwerfen, um neue Modelle zu unterstützen, wie Sie es für richtig halten.

Installation:

Wenn Sie leichte paarweise Reranker benötigen [Standard]

 pip install flashrank

Wenn Sie LLM -basierte Listen -Reranker benötigen

 pip install flashrank [ listwise ]

Ranking schneller machen:

max_length -Wert sollte groß sein, um Ihre längste Passage zu erhalten. Mit anderen Worten, wenn Ihre längste Passage (100 Token) + Abfrage (16 Token) nach einer Token -Schätzung 116 ist, sagen Sie, dass max_length = 128 gut genug ist, einschließlich Raum für reservierte Token wie [CLS] und [Sep]. Verwenden Sie Openai Tiktoken wie Bibliotheken, um die Token -Dichte abzuschätzen, wenn die Leistung pro Token für Sie von entscheidender Bedeutung ist. Nichtchalant eine längere max_length wie 512 für kleinere Durchgangsgrößen wirkt sich negativ auf die Reaktionszeit aus.

Erste Schritte:

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

Sie können es mit jeder Such- und Abrufpipeline verwenden:

Lexikalische Suche (Regulardbs, die die Volltext-Suche oder einen invertierten Index unterstützt)

Semantische Suche / Rag Uscasen (Vektorde)

Hybridsuche

Bereitstellungsmuster

Wie benutze ich es in einer AWS -Lambda -Funktion?

In AWS oder anderen serverlosen Umgebungen ist die gesamte VM schreibgeschützt. Möglicherweise müssen Sie möglicherweise Ihr eigenes benutzerdefiniertes Dir erstellen. Sie können dies in Ihrer Dockerfile tun und sie zum Laden der Modelle verwenden (und schließlich als Cache zwischen warmen Aufrufen). Sie können dies während der Init mit dem Parameter cache_dir tun.

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

Referenzen:

In-Domain- und Zeroshot-Leistung von Cross-Encodern, die auf MS-Marco abgestimmt sind

In-Domain- und Zeroshot-Leistung von RANKT5, die auf MS-Marco fein abgestimmt sind

Wie zitiere ich?

Um dieses Repository in Ihrer Arbeit zu zitieren, klicken Sie auf der rechten Seite auf den Link "dieses Repository zitieren" (Bewlow Repo -Beschreibungen und -Tags).

Papiere zitieren Flashrank

COS-MIX: Ähnlichkeit der Cosinus und Distanzfusion für ein verbessertes Informationsabruf
Bryndza bei ClimateActivism 2024: Erkennung von Haltung, Ziel- und Hass-Ereignis durch retrieval-aushusterte GPT-4 und Lama
Erkennung von Haltung und Hassereignis in Tweets im Zusammenhang mit Klimaaktivismus - gemeinsame Aufgabe bei Fall 2024

[Wichtiges Update]

~~Eine Klonbibliothek namens Swiftrank zeigt auf unsere Modellkläger. Wir arbeiten an einer Zwischenlösung, um dieses Diebstahl zu vermeiden . Vielen Dank für Geduld und Verständnis.~~

Dieses Problem ist behoben, die Modelle sind jetzt in HF. Bitte upgraden Sie ein Upgrade, um die PIP Install -u FlashRank fortzusetzen . Vielen Dank für Geduld und Verständnis

Expandieren

Zusätzliche Informationen

Version Minor fixes
Typ Anderer Quellcode
Aktualisierungszeit 2025-05-24
Größe 2.25MB
Kommt von Github

Ähnliche Anwendungen

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

FlashRank

Inhaltsverzeichnis

Merkmale

Installation:

Wenn Sie leichte paarweise Reranker benötigen [Standard]

Wenn Sie LLM -basierte Listen -Reranker benötigen

Ranking schneller machen:

Erste Schritte:

Sie können es mit jeder Such- und Abrufpipeline verwenden:

Bereitstellungsmuster

Wie benutze ich es in einer AWS -Lambda -Funktion?

Referenzen:

Wie zitiere ich?

Papiere zitieren Flashrank

[Wichtiges Update]

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express