LightRAG 다운로드 - LightRAG 소스 코드 다운로드

Lightrag : 간단하고 빠른 검색 세대

이 저장소는 Lightrag 코드를 호스팅합니다. 이 코드의 구조는 나노 그래프를 기반으로합니다.

? 소식

알고리즘 흐름도

그림 1 : Lightrag Indexing 흐름도 그림 2 : Lightrag 검색 및 쿼리 유량 차트

설치하다

소스에서 설치 (권장)

 cd LightRAG
pip install -e .

PYPI에서 설치하십시오

pip install lightrag-hku

빠른 시작

로컬로 Lightrag를 실행하는 비디오 데모.
모든 코드는 examples 에서 찾을 수 있습니다.
OpenAI 모델을 사용하는 경우 환경에서 OpenAI API 키를 설정하십시오. export OPENAI_API_KEY="sk-...".
데모 텍스트 "찰스 디킨스 (Charles Dickens)의 크리스마스 캐롤"을 다운로드하십시오.

curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt

아래 파이썬 스 니펫 (스크립트)을 사용하여 Lightrag를 초기화하고 쿼리를 수행하십시오.

 import os
from lightrag import LightRAG , QueryParam
from lightrag . llm import gpt_4o_mini_complete , gpt_4o_complete

#########
# Uncomment the below two lines if running in a jupyter notebook to handle the async nature of rag.insert()
# import nest_asyncio
# nest_asyncio.apply()
#########

WORKING_DIR = "./dickens"


if not os . path . exists ( WORKING_DIR ):
    os . mkdir ( WORKING_DIR )

rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = gpt_4o_mini_complete  # Use gpt_4o_mini_complete LLM model
    # llm_model_func=gpt_4o_complete  # Optionally, use a stronger model
)

with open ( "./book.txt" ) as f :
    rag . insert ( f . read ())

# Perform naive search
print ( rag . query ( "What are the top themes in this story?" , param = QueryParam ( mode = "naive" )))

# Perform local search
print ( rag . query ( "What are the top themes in this story?" , param = QueryParam ( mode = "local" )))

# Perform global search
print ( rag . query ( "What are the top themes in this story?" , param = QueryParam ( mode = "global" )))

# Perform hybrid search
print ( rag . query ( "What are the top themes in this story?" , param = QueryParam ( mode = "hybrid" )))

열린 AI와 같은 API 사용

Lightrag는 또한 열린 AI와 같은 채팅/임베드 API를 지원합니다.

 async def llm_model_func (
    prompt , system_prompt = None , history_messages = [], ** kwargs
) -> str :
    return await openai_complete_if_cache (
        "solar-mini" ,
        prompt ,
        system_prompt = system_prompt ,
        history_messages = history_messages ,
        api_key = os . getenv ( "UPSTAGE_API_KEY" ),
        base_url = "https://api.upstage.ai/v1/solar" ,
        ** kwargs
    )

async def embedding_func ( texts : list [ str ]) -> np . ndarray :
    return await openai_embedding (
        texts ,
        model = "solar-embedding-1-large-query" ,
        api_key = os . getenv ( "UPSTAGE_API_KEY" ),
        base_url = "https://api.upstage.ai/v1/solar"
    )

rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = llm_model_func ,
    embedding_func = EmbeddingFunc (
        embedding_dim = 4096 ,
        max_token_size = 8192 ,
        func = embedding_func
    )
)

포옹 얼굴 모델 사용

Hugging Face 모델을 사용하려면 Lightrag를 다음과 같이 설정하면됩니다.

 from lightrag . llm import hf_model_complete , hf_embedding
from transformers import AutoModel , AutoTokenizer
from lightrag . utils import EmbeddingFunc

# Initialize LightRAG with Hugging Face model
rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = hf_model_complete ,  # Use Hugging Face model for text generation
    llm_model_name = 'meta-llama/Llama-3.1-8B-Instruct' ,  # Model name from Hugging Face
    # Use Hugging Face embedding function
    embedding_func = EmbeddingFunc (
        embedding_dim = 384 ,
        max_token_size = 5000 ,
        func = lambda texts : hf_embedding (
            texts ,
            tokenizer = AutoTokenizer . from_pretrained ( "sentence-transformers/all-MiniLM-L6-v2" ),
            embed_model = AutoModel . from_pretrained ( "sentence-transformers/all-MiniLM-L6-v2" )
        )
    ),
)

Ollama 모델 사용

개요

Ollama 모델을 사용하려면 모델을 사용하고 모델을 포함시키려는 모델 (예 : nomic-embed-text 가져와야합니다.

그런 다음 Lightrag를 다음과 같이 설정하면됩니다.

 from lightrag . llm import ollama_model_complete , ollama_embedding
from lightrag . utils import EmbeddingFunc

# Initialize LightRAG with Ollama model
rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = ollama_model_complete ,  # Use Ollama model for text generation
    llm_model_name = 'your_model_name' , # Your model name
    # Use Ollama embedding function
    embedding_func = EmbeddingFunc (
        embedding_dim = 768 ,
        max_token_size = 8192 ,
        func = lambda texts : ollama_embedding (
            texts ,
            embed_model = "nomic-embed-text"
        )
    ),
)

저장에 NEO4J 사용

생산 수준 시나리오의 경우 엔터프라이즈 솔루션을 활용하고 싶을 것입니다.
KG 저장 용. Docker에서 Neo4J를 실행하는 것은 원활한 로컬 테스트를 위해 권장됩니다.
https://hub.docker.com/_/neo4j를 참조하십시오

 export NEO4J_URI = "neo4j://localhost:7687"
export NEO4J_USERNAME = "neo4j"
export NEO4J_PASSWORD = "password"

When you launch the project be sure to override the default KG : NetworkS
by specifying kg = "Neo4JStorage" .

# Note: Default settings use NetworkX
#Initialize LightRAG with Neo4J implementation.
WORKING_DIR = "./local_neo4jWorkDir"

rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = gpt_4o_mini_complete ,  # Use gpt_4o_mini_complete LLM model
    kg = "Neo4JStorage" , #<-----------override KG default
    log_level = "DEBUG"  #<-----------override log_level default
)

작업 예제는 test_neo4j.py를 참조하십시오.

컨텍스트 크기 증가

Lightrag가 작동하기 위해서는 컨텍스트가 32k 이상이어야합니다. 기본적으로 Ollama 모델의 컨텍스트 크기는 8k입니다. 두 가지 방법 중 하나를 사용하여이를 달성 할 수 있습니다.

modelfile에서 `num_ctx` 매개 변수 증가.

모델 당기기 :

ollama pull qwen2

모델 파일 표시 :

ollama show --modelfile qwen2 > Modelfile

다음 줄을 추가하여 Modelfile을 편집하십시오.

PARAMETER num_ctx 32768

수정 된 모델 생성 :

ollama create -f Modelfile qwen2m

Ollama API를 통한 `num_ctx` 설정.

tiy는 llm_model_kwargs param을 사용하여 Ollama를 구성 할 수 있습니다.

 rag = LightRAG (
    working_dir = WORKING_DIR ,
    llm_model_func = ollama_model_complete ,  # Use Ollama model for text generation
    llm_model_name = 'your_model_name' , # Your model name
    llm_model_kwargs = { "options" : { "num_ctx" : 32768 }},
    # Use Ollama embedding function
    embedding_func = EmbeddingFunc (
        embedding_dim = 768 ,
        max_token_size = 8192 ,
        func = lambda texts : ollama_embedding (
            texts ,
            embed_model = "nomic-embed-text"
        )
    ),
)

완전히 기능적인 예

gemma2:2b 모델을 사용하는 완전히 기능적인 examples/lightrag_ollama_demo.py 있는데, 4 개의 요청 만 병렬로 실행하고 컨텍스트 크기를 32k로 설정합니다.

낮은 램 GPU

Low RAM GPU 에서이 실험을 실행하려면 소형 모델을 선택하고 컨텍스트 창을 조정해야합니다 (컨텍스트 증가 메모리 소비 증가). 예를 들어, gemma2:2b 사용하는 동안 컨텍스트 크기를 26K로 설정하는 데 필요한 6GB의 RAM을 사용하여 용도가 다른 채굴 GPU 에서이 Ollama 예제를 실행합니다. book.txt 에서 197 개의 단체와 19 개의 관계를 찾을 수있었습니다.

쿼리 파라

 class QueryParam :
    mode : Literal [ "local" , "global" , "hybrid" , "naive" ] = "global"
    only_need_context : bool = False
    response_type : str = "Multiple Paragraphs"
    # Number of top-k items to retrieve; corresponds to entities in "local" mode and relationships in "global" mode.
    top_k : int = 60
    # Number of tokens for the original chunks.
    max_token_for_text_unit : int = 4000
    # Number of tokens for the relationship descriptions
    max_token_for_global_context : int = 4000
    # Number of tokens for the entity descriptions
    max_token_for_local_context : int = 4000

배치 삽입

 # Batch Insert: Insert multiple texts at once
rag . insert ([ "TEXT1" , "TEXT2" ,...])

증분 삽입

 # Incremental Insert: Insert new documents into an existing LightRAG instance
rag = LightRAG (
     working_dir = WORKING_DIR ,
     llm_model_func = llm_model_func ,
     embedding_func = EmbeddingFunc (
          embedding_dim = embedding_dimension ,
          max_token_size = 8192 ,
          func = embedding_func ,
     ),
)

with open ( "./newText.txt" ) as f :
    rag . insert ( f . read ())

사용자 정의 kg 삽입

 rag = LightRAG (
     working_dir = WORKING_DIR ,
     llm_model_func = llm_model_func ,
     embedding_func = EmbeddingFunc (
          embedding_dim = embedding_dimension ,
          max_token_size = 8192 ,
          func = embedding_func ,
     ),
)

custom_kg = {
    "entities" : [
        {
            "entity_name" : "CompanyA" ,
            "entity_type" : "Organization" ,
            "description" : "A major technology company" ,
            "source_id" : "Source1"
        },
        {
            "entity_name" : "ProductX" ,
            "entity_type" : "Product" ,
            "description" : "A popular product developed by CompanyA" ,
            "source_id" : "Source1"
        }
    ],
    "relationships" : [
        {
            "src_id" : "CompanyA" ,
            "tgt_id" : "ProductX" ,
            "description" : "CompanyA develops ProductX" ,
            "keywords" : "develop, produce" ,
            "weight" : 1.0 ,
            "source_id" : "Source1"
        }
    ]
}

rag . insert_custom_kg ( custom_kg )

엔티티를 삭제하십시오

 #  Delete Entity: Deleting entities by their names
rag = LightRAG (
     working_dir = WORKING_DIR ,
     llm_model_func = llm_model_func ,
     embedding_func = EmbeddingFunc (
          embedding_dim = embedding_dimension ,
          max_token_size = 8192 ,
          func = embedding_func ,
     ),
)

rag . delete_by_entity ( "Project Gutenberg" )

멀티 파일 유형 지원

textract TXT, DOCX, PPTX, CSV 및 PDF와 같은 읽기 파일 유형을 지원합니다.

 import textract

file_path = 'TEXT.pdf'
text_content = textract . process ( file_path )

rag . insert ( text_content . decode ( 'utf-8' ))

그래프 시각화

HTML로 그래프 시각화

다음 코드는 examples/graph_visual_with_html.py 에서 찾을 수 있습니다

 import networkx as nx
from pyvis . network import Network

# Load the GraphML file
G = nx . read_graphml ( './dickens/graph_chunk_entity_relation.graphml' )

# Create a Pyvis network
net = Network ( notebook = True )

# Convert NetworkX graph to Pyvis network
net . from_nx ( G )

# Save and display the network
net . show ( 'knowledge_graph.html' )

NEO4J로 그래프 시각화

다음 코드는 examples/graph_visual_with_neo4j.py 에서 찾을 수 있습니다

 import os
import json
from lightrag . utils import xml_to_json
from neo4j import GraphDatabase

# Constants
WORKING_DIR = "./dickens"
BATCH_SIZE_NODES = 500
BATCH_SIZE_EDGES = 100

# Neo4j connection credentials
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your_password"

def convert_xml_to_json ( xml_path , output_path ):
    """Converts XML file to JSON and saves the output."""
    if not os . path . exists ( xml_path ):
        print ( f"Error: File not found - { xml_path } " )
        return None

    json_data = xml_to_json ( xml_path )
    if json_data :
        with open ( output_path , 'w' , encoding = 'utf-8' ) as f :
            json . dump ( json_data , f , ensure_ascii = False , indent = 2 )
        print ( f"JSON file created: { output_path } " )
        return json_data
    else :
        print ( "Failed to create JSON data" )
        return None

def process_in_batches ( tx , query , data , batch_size ):
    """Process data in batches and execute the given query."""
    for i in range ( 0 , len ( data ), batch_size ):
        batch = data [ i : i + batch_size ]
        tx . run ( query , { "nodes" : batch } if "nodes" in query else { "edges" : batch })

def main ():
    # Paths
    xml_file = os . path . join ( WORKING_DIR , 'graph_chunk_entity_relation.graphml' )
    json_file = os . path . join ( WORKING_DIR , 'graph_data.json' )

    # Convert XML to JSON
    json_data = convert_xml_to_json ( xml_file , json_file )
    if json_data is None :
        return

    # Load nodes and edges
    nodes = json_data . get ( 'nodes' , [])
    edges = json_data . get ( 'edges' , [])

    # Neo4j queries
    create_nodes_query = """
    UNWIND $nodes AS node
    MERGE (e:Entity {id: node.id})
    SET e.entity_type = node.entity_type,
        e.description = node.description,
        e.source_id = node.source_id,
        e.displayName = node.id
    REMOVE e:Entity
    WITH e, node
    CALL apoc.create.addLabels(e, [node.entity_type]) YIELD node AS labeledNode
    RETURN count(*)
    """

    create_edges_query = """
    UNWIND $edges AS edge
    MATCH (source {id: edge.source})
    MATCH (target {id: edge.target})
    WITH source, target, edge,
         CASE
            WHEN edge.keywords CONTAINS 'lead' THEN 'lead'
            WHEN edge.keywords CONTAINS 'participate' THEN 'participate'
            WHEN edge.keywords CONTAINS 'uses' THEN 'uses'
            WHEN edge.keywords CONTAINS 'located' THEN 'located'
            WHEN edge.keywords CONTAINS 'occurs' THEN 'occurs'
           ELSE REPLACE(SPLIT(edge.keywords, ',')[0], ' " ', '')
         END AS relType
    CALL apoc.create.relationship(source, relType, {
      weight: edge.weight,
      description: edge.description,
      keywords: edge.keywords,
      source_id: edge.source_id
    }, target) YIELD rel
    RETURN count(*)
    """

    set_displayname_and_labels_query = """
    MATCH (n)
    SET n.displayName = n.id
    WITH n
    CALL apoc.create.setLabels(n, [n.entity_type]) YIELD node
    RETURN count(*)
    """

    # Create a Neo4j driver
    driver = GraphDatabase . driver ( NEO4J_URI , auth = ( NEO4J_USERNAME , NEO4J_PASSWORD ))

    try :
        # Execute queries in batches
        with driver . session () as session :
            # Insert nodes in batches
            session . execute_write ( process_in_batches , create_nodes_query , nodes , BATCH_SIZE_NODES )

            # Insert edges in batches
            session . execute_write ( process_in_batches , create_edges_query , edges , BATCH_SIZE_EDGES )

            # Set displayName and labels
            session . run ( set_displayname_and_labels_query )

    except Exception as e :
        print ( f"Error occurred: { e } " )

    finally :
        driver . close ()

if __name__ == "__main__" :
    main ()

Lightrag init 매개 변수

매개 변수	유형	설명	기본
working_dir	`str`	캐시가 저장 될 디렉토리	`lightrag_cache+timestamp`
KV_STORAGE	`str`	문서 및 텍스트 청크 용 스토리지 유형. 지원 유형 : `JsonKVStorage` , `OracleKVStorage`	`JsonKVStorage`
vector_storage	`str`	임베딩 벡터를위한 저장 유형. 지원 유형 : `NanoVectorDBStorage` , `OracleVectorDBStorage`	`NanoVectorDBStorage`
그래프 _storage	`str`	그래프 가장자리 및 노드 용 스토리지 유형. 지원 유형 : `NetworkXStorage` , `Neo4JStorage` , `OracleGraphStorage`	`NetworkXStorage`
log_level		응용 프로그램 런타임의 로그 레벨	`logging.DEBUG`
chunk_token_size	`int`	문서를 분할 할 때 청크 당 최대 토큰 크기	`1200`
chunk_overlap_token_size	`int`	문서를 분할 할 때 두 덩어리 사이에서 토큰 크기를 겹칩니다	`100`
tiktoken_model_name	`str`	토큰 번호를 계산하는 데 사용되는 Tiktoken 인코더의 모델 이름	`gpt-4o-mini`
Entity_Extract_max_gleaning	`int`	엔티티 추출 프로세스의 루프 수, 추가 기록 메시지	`1`
entity_summary_to_max_tokens	`int`	각 엔티티 요약에 대한 최대 토큰 크기	`500`
node_embedding_algorithm	`str`	노드 임베딩에 대한 알고리즘 (현재 사용되지 않음)	`node2vec`
Node2Vec_params	`dict`	노드 임베딩에 대한 매개 변수	`{"dimensions": 1536,"num_walks": 10,"walk_length": 40,"window_size": 2,"iterations": 3,"random_seed": 3,}`
embedding_func	`EmbeddingFunc`	텍스트에서 임베딩 벡터를 생성하는 기능	`openai_embedding`
embedding_batch_num	`int`	임베딩 프로세스를위한 최대 배치 크기 (배치 당 다중 텍스트)	`32`
embedding_func_max_async	`int`	동시 비동기 임베딩 프로세스의 최대 수	`16`
llm_model_func	`callable`	LLM 생성에 대한 기능	`gpt_4o_mini_complete`
llm_model_name	`str`	생성의 LLM 모델 이름	`meta-llama/Llama-3.2-1B-Instruct`
llm_model_max_token_size	`int`	LLM 생성의 최대 토큰 크기 (엔티티 관계 요약에 영향)	`32768`
llm_model_max_async	`int`	동시 비동기 LLM 프로세스의 최대 수	`16`
llm_model_kwargs	`dict`	LLM 생성에 대한 추가 매개 변수
vector_db_storage_cls_kwargs	`dict`	벡터 데이터베이스의 추가 매개 변수 (현재 사용되지 않음)
enable_llm_cache	`bool`	`TRUE` 이라면 LLM을 저장하면 캐시가 발생합니다. 반복 된 프롬프트는 캐시 된 응답을 반환합니다	`TRUE`
addon_params	`dict`	추가 매개 변수, eG, `{"example_number": 1, "language": "Simplified Chinese"}` : 예제 제한 및 출력 언어 설정	`example_number: all examples, language: English`
convert_response_to_json_func	`callable`	사용되지 않습니다	`convert_response_to_json`

API 서버 구현

Lightrag는 또한 RAG 작업에 대한 RESTFUL API 액세스를위한 FASTAPI 기반 서버 구현을 제공합니다. 이를 통해 Lightrag를 서비스로 실행하고 HTTP 요청을 통해 상호 작용할 수 있습니다.

API 서버 설정

설정 지침을 확장하려면 클릭하십시오

먼저 필요한 종속성이 있는지 확인하십시오.

pip install fastapi uvicorn pydantic

환경 변수 설정 :

 export RAG_DIR= " your_index_directory "  # Optional: Defaults to "index_default"
export OPENAI_BASE_URL= " Your OpenAI API base URL "  # Optional: Defaults to "https://api.openai.com/v1"
export OPENAI_API_KEY= " Your OpenAI API key "  # Required
export LLM_MODEL= " Your LLM model " # Optional: Defaults to "gpt-4o-mini"
export EMBEDDING_MODEL= " Your embedding model " # Optional: Defaults to "text-embedding-3-large"

API 서버 실행 :

python examples/lightrag_api_openai_compatible_demo.py

서버는 http://0.0.0.0:8020 에서 시작됩니다.

API 엔드 포인트

API 서버는 다음 엔드 포인트를 제공합니다.

1. 쿼리 엔드 포인트

쿼리 엔드 포인트 세부 사항을 보려면 클릭하십시오

URL : /query
방법 : 게시물
몸:

{
    "query" : " Your question here " ,
    "mode" : " hybrid " ,  // Can be "naive", "local", "global", or "hybrid"
    "only_need_context" : true // Optional: Defaults to false, if true, only the referenced context will be returned, otherwise the llm answer will be returned
}

예:

curl -X POST " http://127.0.0.1:8020/query " 
     -H " Content-Type: application/json " 
     -d ' {"query": "What are the main themes?", "mode": "hybrid"} '

2. 텍스트 엔드 포인트를 삽입하십시오

텍스트 텍스트 삽입 세부 사항 삽입을 보려면 클릭하십시오

URL : /insert
방법 : 게시물
몸:

{
    "text" : " Your text content here "
}

예:

curl -X POST " http://127.0.0.1:8020/insert " 
     -H " Content-Type: application/json " 
     -d ' {"text": "Content to be inserted into RAG"} '

3. 파일 끝점을 삽입하십시오

삽입 파일 엔드 포인트 세부 사항을 보려면 클릭하십시오

url : /insert_file
방법 : 게시물
몸:

{
    "file_path" : " path/to/your/file.txt "
}

예:

curl -X POST " http://127.0.0.1:8020/insert_file " 
     -H " Content-Type: application/json " 
     -d ' {"file_path": "./book.txt"} '

4. 건강 점검 종말점

건강 검사 종말점 세부 정보를 보려면 클릭하십시오

URL : /health
방법 : 얻으십시오
예:

curl -X GET " http://127.0.0.1:8020/health "

구성

API 서버는 환경 변수를 사용하여 구성 할 수 있습니다.

RAG_DIR : 헝겊 지수 저장 디렉토리 (기본값 : "index_default")
API 키 및 기본 URL은 특정 LLM 및 임베딩 모델 제공 업체의 코드에 구성되어야합니다.

오류 처리

오류 처리 세부 정보를 보려면 클릭하십시오

API에는 포괄적 인 오류 처리가 포함됩니다.

파일을 찾지 못한 오류 (404)
처리 오류 (500)
여러 파일 인코딩 지원 (UTF-8 및 GBK)

평가

데이터 세트

Lightrag에 사용되는 데이터 세트는 Tommychien/Ultradomain에서 다운로드 할 수 있습니다.

쿼리를 생성합니다

Lightrag는 다음 프롬프트를 사용하여 example/generate_query.py 의 해당 코드와 함께 높은 수준의 쿼리를 생성합니다.

즉각적인

 Given the following description of a dataset :

{ description }

Please identify 5 potential users who would engage with this dataset . For each user , list 5 tasks they would perform with this dataset . Then , for each ( user , task ) combination , generate 5 questions that require a high - level understanding of the entire dataset .

Output the results in the following structure :
- User 1 : [ user description ]
    - Task 1 : [ task description ]
        - Question 1 :
        - Question 2 :
        - Question 3 :
        - Question 4 :
        - Question 5 :
    - Task 2 : [ task description ]
        ...
    - Task 5 : [ task description ]
- User 2 : [ user description ]
    ...
- User 5 : [ user description ]
    ...

배치 평가

Lightrag는 높은 수준의 쿼리에서 두 개의 Rag 시스템의 성능을 평가하기 위해 example/batch_eval.py 에서 특정 코드를 사용할 수있는 다음 프롬프트를 사용합니다.

즉각적인

 - - - Role - - -
You are an expert tasked with evaluating two answers to the same question based on three criteria : ** Comprehensiveness ** , ** Diversity ** , and ** Empowerment ** .
- - - Goal - - -
You will evaluate two answers to the same question based on three criteria : ** Comprehensiveness ** , ** Diversity ** , and ** Empowerment ** .

- ** Comprehensiveness ** : How much detail does the answer provide to cover all aspects and details of the question ?
- ** Diversity ** : How varied and rich is the answer in providing different perspectives and insights on the question ?
- ** Empowerment ** : How well does the answer help the reader understand and make informed judgments about the topic ?

For each criterion , choose the better answer ( either Answer 1 or Answer 2 ) and explain why . Then , select an overall winner based on these three categories .

Here is the question :
{ query }

Here are the two answers :

** Answer 1 : **
{ answer1 }

** Answer 2 : **
{ answer2 }

Evaluate both answers using the three criteria listed above and provide detailed explanations for each criterion .

Output your evaluation in the following JSON format :

{{
    "Comprehensiveness" : {{
        "Winner" : "[Answer 1 or Answer 2]" ,
        "Explanation" : "[Provide explanation here]"
    }},
    "Empowerment" : {{
        "Winner" : "[Answer 1 or Answer 2]" ,
        "Explanation" : "[Provide explanation here]"
    }},
    "Overall Winner" : {{
        "Winner" : "[Answer 1 or Answer 2]" ,
        "Explanation" : "[Summarize why this answer is the overall winner based on the three criteria]"
    }}
}}

전체 성능 테이블

	농업		CS		합법적인		혼합
	Naiverag	Lightrag	Naiverag	Lightrag	Naiverag	Lightrag	Naiverag	Lightrag
포괄적 인	32.4%	67.6%	38.4%	61.6%	16.4%	83.6%	38.8%	61.2%
다양성	23.6%	76.4%	38.0%	62.0%	13.6%	86.4%	32.4%	67.6%
권한 부여	32.4%	67.6%	38.8%	61.2%	16.4%	83.6%	42.8%	57.2%
전반적인	32.4%	67.6%	38.8%	61.2%	15.2%	84.8%	40.0%	60.0%
	RQ-RAG	Lightrag	RQ-RAG	Lightrag	RQ-RAG	Lightrag	RQ-RAG	Lightrag
포괄적 인	31.6%	68.4%	38.8%	61.2%	15.2%	84.8%	39.2%	60.8%
다양성	29.2%	70.8%	39.2%	60.8%	11.6%	88.4%	30.8%	69.2%
권한 부여	31.6%	68.4%	36.4%	63.6%	15.2%	84.8%	42.4%	57.6%
전반적인	32.4%	67.6%	38.0%	62.0%	14.4%	85.6%	40.0%	60.0%
	하이드	Lightrag	하이드	Lightrag	하이드	Lightrag	하이드	Lightrag
포괄적 인	26.0%	74.0%	41.6%	58.4%	26.8%	73.2%	40.4%	59.6%
다양성	24.0%	76.0%	38.8%	61.2%	20.0%	80.0%	32.4%	67.6%
권한 부여	25.2%	74.8%	40.8%	59.2%	26.0%	74.0%	46.0%	54.0%
전반적인	24.8%	75.2%	41.6%	58.4%	26.4%	73.6%	42.4%	57.6%
	그래프 크래그	Lightrag	그래프 크래그	Lightrag	그래프 크래그	Lightrag	그래프 크래그	Lightrag
포괄적 인	45.6%	54.4%	48.4%	51.6%	48.4%	51.6%	50.4%	49.6%
다양성	22.8%	77.2%	40.8%	59.2%	26.4%	73.6%	36.0%	64.0%
권한 부여	41.2%	58.8%	45.2%	54.8%	43.6%	56.4%	50.8%	49.2%
전반적인	45.2%	54.8%	48.0%	52.0%	47.2%	52.8%	50.4%	49.6%

낳다

모든 코드는 ./reproduce 디렉토리에서 찾을 수 있습니다.

STEP-0 고유 한 컨텍스트를 추출합니다

먼저 데이터 세트에서 고유 한 컨텍스트를 추출해야합니다.

암호

 def extract_unique_contexts ( input_directory , output_directory ):

    os . makedirs ( output_directory , exist_ok = True )

    jsonl_files = glob . glob ( os . path . join ( input_directory , '*.jsonl' ))
    print ( f"Found { len ( jsonl_files ) } JSONL files." )

    for file_path in jsonl_files :
        filename = os . path . basename ( file_path )
        name , ext = os . path . splitext ( filename )
        output_filename = f" { name } _unique_contexts.json"
        output_path = os . path . join ( output_directory , output_filename )

        unique_contexts_dict = {}

        print ( f"Processing file: { filename } " )

        try :
            with open ( file_path , 'r' , encoding = 'utf-8' ) as infile :
                for line_number , line in enumerate ( infile , start = 1 ):
                    line = line . strip ()
                    if not line :
                        continue
                    try :
                        json_obj = json . loads ( line )
                        context = json_obj . get ( 'context' )
                        if context and context not in unique_contexts_dict :
                            unique_contexts_dict [ context ] = None
                    except json . JSONDecodeError as e :
                        print ( f"JSON decoding error in file { filename } at line { line_number } : { e } " )
        except FileNotFoundError :
            print ( f"File not found: { filename } " )
            continue
        except Exception as e :
            print ( f"An error occurred while processing file { filename } : { e } " )
            continue

        unique_contexts_list = list ( unique_contexts_dict . keys ())
        print ( f"There are { len ( unique_contexts_list ) } unique `context` entries in the file { filename } ." )

        try :
            with open ( output_path , 'w' , encoding = 'utf-8' ) as outfile :
                json . dump ( unique_contexts_list , outfile , ensure_ascii = False , indent = 4 )
            print ( f"Unique `context` entries have been saved to: { output_filename } " )
        except Exception as e :
            print ( f"An error occurred while saving to the file { output_filename } : { e } " )

    print ( "All files have been processed." )

STEP-1 컨텍스트 삽입

추출 된 컨텍스트의 경우 Lightrag 시스템에 삽입합니다.

암호

 def insert_text ( rag , file_path ):
    with open ( file_path , mode = 'r' ) as f :
        unique_contexts = json . load ( f )

    retries = 0
    max_retries = 3
    while retries < max_retries :
        try :
            rag . insert ( unique_contexts )
            break
        except Exception as e :
            retries += 1
            print ( f"Insertion failed, retrying ( { retries } / { max_retries } ), error: { e } " )
            time . sleep ( 10 )
    if retries == max_retries :
        print ( "Insertion failed after exceeding the maximum number of retries" )

2 단계 쿼리를 생성합니다

데이터 세트에서 각 컨텍스트의 첫 번째와 후반부에서 토큰을 추출한 다음 데이터 세트 설명으로 결합하여 쿼리를 생성합니다.

암호

 tokenizer = GPT2Tokenizer . from_pretrained ( 'gpt2' )

def get_summary ( context , tot_tokens = 2000 ):
    tokens = tokenizer . tokenize ( context )
    half_tokens = tot_tokens // 2

    start_tokens = tokens [ 1000 : 1000 + half_tokens ]
    end_tokens = tokens [ - ( 1000 + half_tokens ): 1000 ]

    summary_tokens = start_tokens + end_tokens
    summary = tokenizer . convert_tokens_to_string ( summary_tokens )

    return summary

3 단계 쿼리

2 단계에서 생성 된 쿼리의 경우이를 추출하고 Lightrag를 쿼리합니다.

암호

 def extract_queries ( file_path ):
    with open ( file_path , 'r' ) as f :
        data = f . read ()

    data = data . replace ( '**' , '' )

    queries = re . findall ( r'- Question d+: (.+)' , data )

    return queries

코드 구조

.
├── examples
│   ├── batch_eval . py
│   ├── generate_query . py
│   ├── graph_visual_with_html . py
│   ├── graph_visual_with_neo4j . py
│   ├── lightrag_api_openai_compatible_demo . py
│   ├── lightrag_azure_openai_demo . py
│   ├── lightrag_bedrock_demo . py
│   ├── lightrag_hf_demo . py
│   ├── lightrag_lmdeploy_demo . py
│   ├── lightrag_ollama_demo . py
│   ├── lightrag_openai_compatible_demo . py
│   ├── lightrag_openai_demo . py
│   ├── lightrag_siliconcloud_demo . py
│   └── vram_management_demo . py
├── lightrag
│   ├── kg
│   │   ├── __init__ . py
│   │   └── neo4j_impl . py
│   ├── __init__ . py
│   ├── base . py
│   ├── lightrag . py
│   ├── llm . py
│   ├── operate . py
│   ├── prompt . py
│   ├── storage . py
│   └── utils . py
├── reproduce
│   ├── Step_0 . py
│   ├── Step_1_openai_compatible . py
│   ├── Step_1 . py
│   ├── Step_2 . py
│   ├── Step_3_openai_compatible . py
│   └── Step_3 . py
├── . gitignore
├── . pre - commit - config . yaml
├── Dockerfile
├── get_all_edges_nx . py
├── LICENSE
├── README . md
├── requirements . txt
├── setup . py
├── test_neo4j . py
└── test . py

스타 역사

기부금

우리의 모든 기고자들에게 감사합니다!

?소환

 @ article { guo2024lightrag ,
title = { LightRAG : Simple and Fast Retrieval - Augmented Generation },
author = { Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang },
year = { 2024 },
eprint = { 2410.05779 },
archivePrefix = { arXiv },
primaryClass = { cs . IR }
}

우리의 일에 관심을 가져 주셔서 감사합니다!

확장하다

LightRAG