ดาวน์โหลด FlashRank - ดาวน์โหลดซอร์สโค้ด FlashRank

FlashRank

ซอร์สโค้ดอื่น ๆ

Minor fixes

ดาวน์โหลด

จัดอันดับผลการค้นหาของคุณอีกครั้งด้วย sota pairwise หรือ rearankers listwise ก่อนที่จะป้อนเข้าสู่ LLM ของคุณ

ไลบรารี Python Ultra-Lite & Super-Fast เพื่อเพิ่มการจัดอันดับใหม่ให้กับการค้นหาและการดึงข้อมูลที่มีอยู่ของคุณ มันขึ้นอยู่กับ SOTA LLMS และ cross-encoders โดยมีความกตัญญูต่อเจ้าของโมเดลทุกคน

สนับสนุน:

Rerankers คู่ / pointwise (อิง Cross Encoder เช่น Max tokens = 512 )
Rerankers ที่ใช้ LLM ตาม LLM (ตาม LLM เช่น Max tokens = 8192 )
ดูด้านล่างสำหรับรายการเต็มรูปแบบที่รองรับ

สารบัญ

คุณสมบัติ
การติดตั้ง
การจัดอันดับเร็วขึ้น
เริ่มต้น
รูปแบบการปรับใช้
วิธีการอ้างอิง?
เอกสารที่อ้างถึง Flashrank

คุณสมบัติ

⚡ Ultra-Lite :
- ไม่จำเป็นต้องใช้คบเพลิงหรือหม้อแปลง ทำงานบน CPU
- มี รูปแบบการเปลี่ยนเส้นทางที่น้อยที่สุดในโลก ~ 4MB
⏱ เร็วมาก :
- Rerank Speed เป็นฟังก์ชั่นของ # of Tokens ในข้อความ, Query + Model Depth (เลเยอร์)
- ในการให้แนวคิดเวลาที่ใช้โดยตัวอย่าง (ในรหัส) โดยใช้โมเดลเริ่มต้นอยู่ด้านล่าง
- การเปรียบเทียบอย่างละเอียด TBD
- $ concious :
- ต่ำสุด $ ต่อการเรียกร้อง: การปรับใช้แบบไม่มีเซิร์ฟเวอร์เช่นแลมบ์ดาถูกเรียกเก็บเงินโดยหน่วยความจำและเวลาต่อการเรียกใช้*
- ขนาดแพ็คเกจที่เล็กกว่า = เวลาเริ่มต้นเย็นที่สั้นลง, การปรับใช้ใหม่เร็วขึ้นสำหรับ Serverless
ขึ้นอยู่กับการเข้ารหัสข้าม SOTA และรุ่นอื่น ๆ :
- "Rerankers zero-shot ดีแค่ไหน?" - ดูที่ส่วนอ้างอิง

ชื่อนางแบบ	คำอธิบาย	ขนาด	หมายเหตุ
`ms-marco-TinyBERT-L-2-v2`	รุ่นเริ่มต้น	~ 4MB	การ์ดรุ่น
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	~ 34MB	การ์ดรุ่น
`rank-T5-flan`	Reranker ที่ไม่ได้เข้ารหัสที่ดีที่สุด	~ 110MB	การ์ดรุ่น
`ms-marco-MultiBERT-L-12`	หลายภาษารองรับ 100 ภาษา	~ 150MB	ภาษาที่รองรับ
`ce-esci-MiniLM-L12-v2`	ปรับแต่งในชุดข้อมูล Amazon Esci	-	การ์ดรุ่น
`rank_zephyr_7b_v1_full`	GGUF แบบ 4 บิต	~ 4GB	การ์ดรุ่น
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	การ์ดรุ่น

โมเดลในแผนงาน:
- คนรับรู้
ทำไมโมเดลที่เพรียวบางถึงเป็นที่ต้องการ? Reranking เป็นขาสุดท้ายของท่อดึงขนาดใหญ่ความคิดคือการหลีกเลี่ยงค่าใช้จ่ายเพิ่มเติมใด ๆ โดยเฉพาะอย่างยิ่งสำหรับสถานการณ์ที่ผู้ใช้หันหน้าเข้าหา ในตอนท้ายของรุ่นที่มีรอยเท้าขนาดเล็กที่ไม่ต้องการฮาร์ดแวร์พิเศษใด ๆ และยังมีการเลือกประสิทธิภาพการแข่งขัน อย่าลังเลที่จะเพิ่มปัญหาเพื่อเพิ่มการสนับสนุนสำหรับรุ่นใหม่ตามที่คุณเห็นว่าเหมาะสม

การติดตั้ง:

หากคุณต้องการ Rerankers คู่ที่มีน้ำหนักเบา [ค่าเริ่มต้น]

 pip install flashrank

หากคุณต้องการ Rerankers ตามรายการ LLM

 pip install flashrank [ listwise ]

การจัดอันดับเร็วขึ้น:

ค่า max_length ควรมีขนาดใหญ่สามารถรองรับทางเดินที่ยาวที่สุดของคุณได้ กล่าวอีกนัยหนึ่งถ้าข้อความที่ยาวที่สุดของคุณ (100 โทเค็น) + เคียวรี (16 โทเค็น) คู่โดยประมาณการโทเค็นคือ 116 จากนั้นบอกว่าการตั้งค่า max_length = 128 เป็นห้องที่ดีพอสำหรับโทเค็นที่สงวนไว้เช่น [CLS] และ [SEP] ใช้ openai tiktoken เช่นห้องสมุดเพื่อประเมินความหนาแน่นของโทเค็นหากประสิทธิภาพต่อโทเค็นเป็นสิ่งสำคัญสำหรับคุณ การไม่ให้ max_length ที่ยาวขึ้นเช่น 512 สำหรับขนาดที่เล็กกว่าจะส่งผลเสียต่อเวลาตอบสนอง

เริ่มต้นใช้งาน:

 from flashrank import Ranker , RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker ( max_length = 128 )

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker ( model_name = "rank-T5-flan" , cache_dir = "/opt" )

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker ( model_name = "ms-marco-MultiBERT-L-12" , cache_dir = "/opt" )

or 

ranker = Ranker ( model_name = "rank_zephyr_7b_v1_full" , max_length = 1024 ) # adjust max_length based on your passage length

 # Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" : { "additional" : "info1" }
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" : { "additional" : "info2" }
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" : { "additional" : "info3" }

   },
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" : { "additional" : "info4" }
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" : { "additional" : "info5" }
   }
]

rerankrequest = RerankRequest ( query = query , passages = passages )
results = ranker . rerank ( rerankrequest )
print ( results )

 # Reranked output from default reranker
[
   {
      "id" : 4 ,
      "text" : "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup." ,
      "meta" :{
         "additional" : "info4"
      },
      "score" : 0.016847236
   },
   {
      "id" : 5 ,
      "text" : "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels" ,
      "meta" :{
         "additional" : "info5"
      },
      "score" : 0.011563735
   },
   {
      "id" : 3 ,
      "text" : "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run." ,
      "meta" :{
         "additional" : "info3"
      },
      "score" : 0.00081340264
   },
   {
      "id" : 1 ,
      "text" : "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step." ,
      "meta" :{
         "additional" : "info1"
      },
      "score" : 0.00063596206
   },
   {
      "id" : 2 ,
      "text" : "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper" ,
      "meta" :{
         "additional" : "info2"
      },
      "score" : 0.00024851
   }
]

คุณสามารถใช้กับไปป์ไลน์การค้นหาและดึงข้อมูลใด ๆ :

การค้นหาคำศัพท์ (regulardbs ที่รองรับการค้นหาข้อความแบบเต็มหรือดัชนีกลับด้าน)

ความหมายการค้นหา / rag usecases (vectordbs)

การค้นหาแบบไฮบริด

รูปแบบการปรับใช้

จะใช้ในฟังก์ชั่น AWS Lambda ได้อย่างไร?

ใน AWS หรือสภาพแวดล้อมที่ไม่มีเซิร์ฟเวอร์อื่น ๆ VM ทั้งหมดเป็นแบบอ่านอย่างเดียวคุณอาจต้องสร้าง DIR ที่กำหนดเองของคุณเอง คุณสามารถทำได้ใน DockerFile ของคุณและใช้สำหรับการโหลดโมเดล (และในที่สุดก็เป็นแคชระหว่างการโทรที่อบอุ่น) คุณสามารถทำได้ในระหว่างการเริ่มต้นด้วยพารามิเตอร์ CACHE_DIR

 ranker = Ranker ( model_name = "ms-marco-MiniLM-L-12-v2" , cache_dir = "/opt" )

ข้อมูลอ้างอิง:

ประสิทธิภาพในโดเมนและ ZeroShot ของ Cross Encoders ปรับแต่งอย่างละเอียดบน MS-Marco

ประสิทธิภาพในโดเมนและ ZeroShot ของ RankT5 ปรับแต่งบน MS-Marco

วิธีการอ้างอิง?

หากต้องการอ้างถึงที่เก็บข้อมูลนี้ในงานของคุณโปรดคลิกลิงก์ "อ้างอิงที่เก็บนี้" ทางด้านขวา (คำอธิบายและแท็ก repo ของ Bewlow)

เอกสารที่อ้างถึง Flashrank

cos-mix: ความคล้ายคลึงกันของโคไซน์และระยะทางฟิวชั่นสำหรับการปรับปรุงข้อมูลที่ได้รับการปรับปรุง
Bryndza ที่ ClimateActivism 2024: ท่าทางการตรวจจับเหตุการณ์และความเกลียดชังผ่านการค้นพบ GPT-4 และ Llama
การตรวจจับเหตุการณ์ท่าทางและความเกลียดชังในทวีตที่เกี่ยวข้องกับการเคลื่อนไหวของสภาพภูมิอากาศ - งานที่ใช้ร่วมกันในกรณีปี 2024

[การอัปเดตที่สำคัญ]

ห้องสมุดโคลนที่เรียกว่า Swiftrank กำลังชี้ไปที่ถังโมเดลของเราเรากำลังทำงานกับโซลูชันชั่วคราวเพื่อหลีกเลี่ยงการขโมยนี้ ขอบคุณสำหรับความอดทนและความเข้าใจ

ปัญหานี้ได้รับการแก้ไขโมเดลอยู่ใน HF ในขณะนี้ กรุณาอัพเกรดเพื่อดำเนินการ ติดตั้ง PIP ต่อ -U Flashrank ขอบคุณสำหรับความอดทนและความเข้าใจ

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน Minor fixes
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-05-24
ขนาด 2.25MB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด