Скачать FlashRank - FlashRank исходный код скачать

FlashRank

Другой исходный код

Minor fixes

Скачать

Перед тем, как подавать в свои LLMS, повторно оцените результаты поиска с помощью SOTA Pairse или Diftise Rerankers, прежде чем кормить

Ultra-Lite & Super-Fast Library Python, чтобы добавить повторную оценку в ваши существующие трубопроводы поиска и поиска. Он основан на SOTA LLMS и CrossCoders, с благодарностью всем владельцам моделей.

Поддержка:

Парные / точковые реранкеры. (Cross Encoder, то есть Max tokens = 512 )
Список Rerankers на основе LLM. (LLM на основе, то есть Max tokens = 8192 )
См. Ниже для полного списка поддерживаемых моделей.

Функции

⚡ Ultra-Lite :
- Факела или трансформаторов не требуется. Бежит по процессору.
- Может похвастаться крошечной моделью реэранзировки в мире, ~ 4 МБ .
⏱ Super-Fast :
- Скорость перерыва является функцией # токенов в отрывках, запрос + глубина модели (слои)
- Чтобы дать идею, время, взятое по примеру (в коде), с использованием модели по умолчанию ниже.
- Подробный сравнительный анализ, TBD
? $ incious :
- Самые низкие $ за вызов: развертывания без серверов, такие как Lambda, взимаются за память и время за вызов*
- Меньший размер пакета = более короткое время запуска, более быстрые повторные развертывания для сервера.
Основано на кросс-кодерах SOTA и других моделях :
- "Насколько хороши нулевые реранкеры?" - Посмотрите на справочный раздел.

Название модели	Описание	Размер	Примечания
`ms-marco-TinyBERT-L-2-v2`	Модель по умолчанию	~ 4 МБ	Модель карта
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	~ 34 МБ	Модель карта
`rank-T5-flan`	Лучший не поперечный реранкер	~ 110 МБ	Модель карта
`ms-marco-MultiBERT-L-12`	Многоязычный, поддерживает 100+ языков	~ 150 МБ	Поддерживаемые языки
`ce-esci-MiniLM-L12-v2`	Настройка набора данных Amazon ESCI	-	Модель карта
`rank_zephyr_7b_v1_full`	4-битный gguf	~ 4 ГБ	Модель карта
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	Модель карта

Модели в дорожной карте:
- Инранкер
Почему более гладкие модели предпочтительнее? Реанкинг-это последний этап более крупных трубопроводов, идея состоит в том, чтобы избежать дополнительных накладных расходов, особенно для сценариев, связанных с пользователем. С этой целью модели с действительно небольшим количеством следов, которые не нуждаются в каком -либо специализированном оборудовании, и все же предлагают конкурентоспособную производительность. Не стесняйтесь поднимать проблемы, чтобы добавить поддержку новых моделей, как вы считаете нужным.

Установка:

Если вам нужны легкие парные переоценки [по умолчанию]

 pip install flashrank

Если вам нужны списки Rerankers на основе LLM

 pip install flashrank [ listwise ]

Сделав рейтинг быстрее:

Значение max_length должно быть значительным, способным приспособить ваш самый длинный проход. Другими словами, если ваш самый длинный отрывок (100 токенов) + Запрос (16 токенов) Пара по оценке токена 116, то, скажем, настройка max_length = 128 - это достаточно хорошо включено для зарезервированных токенов, таких как [CLS] и [SEP]. Используйте Openai Tiktoken, как библиотеки, чтобы оценить плотность токенов, если для вас важна производительность на токен. Неважно, давая более длинное max_length , например, 512 для меньших размеров прохода, негативно повлияет на время отклика.

Начиная:

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

Вы можете использовать его с любым конвейером поиска и поиска:

Lexical Search (Regulardbs, поддерживающий полнотекстовый поиск или инвертированный индекс)

Семантический поиск / тряпичные использование (векторды)

Гибридный поиск

Модели развертывания

Как использовать его в функции AWS Lambda?

В AWS или других без серверных средах виртуальной машины только для чтения вам, возможно, придется создать свой собственный директор. Вы можете сделать это в своем Dockerfile и использовать его для загрузки моделей (и в конечном итоге в качестве кэша между теплыми вызовами). Вы можете сделать это во время init с параметром cache_dir.

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

Ссылки:

Внутренние и нулевые характеристики Cross Encoders, настраиваемые на MS-Marco

Внутренняя и нуль

Как цитировать?

Чтобы привести этот репозиторий в своей работе, нажмите на ссылку «Стидеть этот репозиторий» на правой стороне (описания и теги Bewlow)

Документы со ссылкой на Flashrank

COS-MIX: сходство косинуса и слияние расстояния для улучшения поиска информации
Bryndza в ClimateActivism 2024: Позиция, цель и ненависть, обнаружение событий с помощью поиска GPT-4 и Llama.
Постановка и ненависть обнаружены в твитах, связанных с климатической активностью - общая задача в случае 2024

[Важное обновление]

Библиотека клонов под названием Swiftrank указывает на наши модельные ведра, мы работаем над промежуточным решением, чтобы избежать этого кражи . Спасибо за терпение и понимание.

Эта проблема решена, модели сейчас в HF. Пожалуйста, обновите, чтобы продолжить установку PIP -u FlashRank. Спасибо за терпение и понимание

Расширять

Дополнительная информация

Версия Minor fixes
Тип Другой исходный код
Время обновления 2025-05-24
размер 2.25MB
От Github

Связанные приложения

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

FlashRank

Оглавление

Функции

Установка:

Если вам нужны легкие парные переоценки [по умолчанию]

Если вам нужны списки Rerankers на основе LLM

Сделав рейтинг быстрее:

Начиная:

Вы можете использовать его с любым конвейером поиска и поиска:

Модели развертывания

Как использовать его в функции AWS Lambda?

Ссылки:

Как цитировать?

Документы со ссылкой на Flashrank

[Важное обновление]

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express