Descargar FlashRank - Descargar código fuente de FlashRank

FlashRank

Otro código fuente

Minor fixes

Descargar

Vuelva a clasificar sus resultados de búsqueda con Sota por pares o en Listwise Rerankers antes de alimentar a sus LLMS

Biblioteca de Python Ultra-Lite & Super-Fast para agregar un reanimiento a sus tuberías de búsqueda y recuperación existentes. Se basa en SOTA LLMS y los codificadores cruzados, con gratitud a todos los propietarios de modelos.

Soporte:

Revestadores por pares / puntuales. (Basado en codificadores cruzados, es decir, Max tokens = 512 )
Rerankers basados en LLM con Listwise. (Basado en LLM, es decir, Max tokens = 8192 )
Consulte a continuación la lista completa de modelos compatibles.

Tabla de contenido

Características
Instalación
Hacer la clasificación más rápido
Empezando
Patrones de implementación
¿Cómo citar?
Documentos citando FlashRank

Características

⚡ Ultra-lite :
- No se necesitan antorcha ni transformadores . Funciona en CPU.
- Cuenta con el modelo más pequeño del mundo del mundo, ~ 4mb .
⏱️ Super rápido :
- Rerank Speed es una función de # de tokens en pasajes, consulta + profundidad del modelo (capas)
- Para dar una idea, el tiempo tomado por el ejemplo (en código) utilizando el modelo predeterminado está a continuación.
- Benchmarking detallado, TBD
? $ conciencia :
- Más bajo $ por invocación: las implementaciones sin servidor como Lambda se cobran por memoria y tiempo por invocación*
- Tamaño de paquete más pequeño = horarios de inicio en frío más cortos, redeploymentos más rápidos para sin servidor.
Basado en los codificadores cruzados de SOTA y otros modelos :
- "¿Qué tan buenos son los vuelos a disparar cero?" - Mire la sección de referencia.

Nombre del modelo	Descripción	Tamaño	Notas
`ms-marco-TinyBERT-L-2-v2`	Modelo predeterminado	~ 4MB	Tarjeta modelo
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	~ 34 MB	Tarjeta modelo
`rank-T5-flan`	Mejor no cruzado Reranker	~ 110 MB	Tarjeta modelo
`ms-marco-MultiBERT-L-12`	Multilingüe, admite más de 100 idiomas	~ 150 MB	Idiomas compatibles
`ce-esci-MiniLM-L12-v2`	Ajustado en el conjunto de datos de Amazon ESCI	-	Tarjeta modelo
`rank_zephyr_7b_v1_full`	Gguf de 4 bits	~ 4GB	Tarjeta modelo
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	Tarjeta modelo

Modelos en la hoja de ruta:
- InRANDER
¿Por qué se prefieren los modelos más elegantes? La idea de que Reranking es la etapa final de las tuberías de recuperación más grandes, la idea es evitar cualquier sobrecarga adicional, especialmente para escenarios orientados al usuario. Para ese fin, los modelos con una huella realmente pequeña que no necesitan hardware especializado y, sin embargo, ofrecen un rendimiento competitivo. Siéntase libre de plantear problemas para agregar soporte para un nuevo modelos como mejor le parezca.

Instalación:

Si necesita vueltas por pares livianas [predeterminadas]

 pip install flashrank

Si necesita LLM basado en Listwise Rerankers

 pip install flashrank [ listwise ]

Hacer la clasificación más rápido:

El valor max_length debe ser grande capaz de acomodar su pasaje más largo. En otras palabras, si su pasaje más largo (100 tokens) + consulta (16 tokens) par de la estimación de tokens es 116, entonces digamos que el ajuste max_length = 128 es lo suficientemente bueno, incluido el espacio para tokens reservados como [CLS] y [SEP]. Use Operai Tiktoken como bibliotecas para estimar la densidad del token, si el rendimiento por token es crítico para usted. Darles no patrocinadamente una max_length más larga como 512 para tamaños de pasaje más pequeños afectará negativamente el tiempo de respuesta.

Empezando:

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

Puede usarlo con cualquier tubería de búsqueda y recuperación:

Búsqueda léxica (regulardbs que admite la búsqueda de texto completo o el índice invertido)

Search / Rag Usecases (Vectordbs) semántico (Vectordbs)

Búsqueda híbrida

Patrones de implementación

¿Cómo usarlo en una función AWS Lambda?

En AWS u otros entornos sin servidor, todo VM es de solo lectura, es posible que deba crear su propio DIR personalizado. Puede hacerlo en su Dockerfile y usarlo para cargar los modelos (y eventualmente como un caché entre llamadas cálidas). Puede hacerlo durante el init con el parámetro CACHE_DIR.

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

Referencias:

Rendimiento en el dominio y cerohot de codificadores cruzados sintonizados en MS-Marco

Rendimiento de dominio y cerohot de Rankt5 fino en MS-Marco

¿Cómo citar?

Para citar este repositorio en su trabajo, haga clic en el enlace "Cite este repositorio" en el lado derecho (descripciones y etiquetas de repositorio de Bewlow)

Documentos citando FlashRank

COS-MIX: Similitud de coseno y fusión de distancia para una mejor recuperación de información
Bryndza en ClimateActivism 2024: Detección de eventos de postura, objetivo y odio a través de GPT-4 y LLAMA augsados por recuperación
Detección de eventos de postura y odio en tweets relacionados con el activismo climático: tarea compartida en el caso 2024

[Actualización importante]

Una biblioteca de clon llamada Swiftrank está apuntando a nuestros cubos modelo, estamos trabajando en una solución provisional para evitar este robo . Gracias por la paciencia y la comprensión.

Este problema se resuelve, los modelos están en HF ahora. Actualice para continuar PIP Install -U FlashRank. Gracias por la paciencia y la comprensión

Expandir

Información adicional

Versión Minor fixes
Tipo Otro código fuente
Fecha de actualización 2025-05-24
tamaño 2.25MB
Proviene de Github

Aplicaciones relacionadas

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recomendado para ti

chat.petals.dev

Otro código fuente

1.0.0
GPT Prompt Templates

Otro código fuente

1.0.0
GPTyped

Otro código fuente

GPTyped 1.0.5
Google Dorks

Otro código fuente

1.0
shepherd

Otro código fuente

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Otro código fuente

v1.1.0-rc-3
Google Dorks

Otro código fuente

1.0
shepherd

Otro código fuente

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Otro código fuente

v1.1.0-rc-3

Información relacionada Todo