relik 다운로드 - relik 소스 코드 다운로드

검색, 읽기 및 링크 : 학업 예산에 대한 빠르고 정확한 엔티티 연결 및 관계 추출

엔티티 연결 및 관계 추출을 위한 빠르고 경량 정보 추출 모델.

설치

PYPI에서 설치

pip install relik

기타 설치 옵션

선택적 종속성으로 설치하십시오

모든 선택적 종속성으로 설치하십시오.

pip install relik[all]

교육 및 평가를위한 선택적 종속성으로 설치하십시오.

pip install relik[train]

FAISS의 선택적 종속성으로 설치하십시오

Faiss PYPI 패키지는 CPU에서만 사용할 수 있습니다. GPU의 경우 소스에서 설치하거나 Conda 패키지를 사용하십시오.

CPU의 경우 :

pip install relik[faiss]

GPU의 경우 :

conda create -n relik python=3.10
conda activate relik

# install pytorch
conda install -y pytorch=2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0

pip install relik

Fastapi 및 Ray로 모델을 제공하기위한 선택적 종속성으로 설치하십시오.

pip install relik[serve]

소스에서 설치

git clone https://github.com/SapienzaNLP/relik.git
cd relik
pip install -e .[all]

? 모델

관계 추출을위한 Relik Large (? Large RE V2, Colab ✅) : relik-ie/relik-relation-extraction-large
폐쇄 정보 추출을위한 Relik Large (? 대형 El + Re, Colab ✅) : https://huggingface.co/relik-ie/relik-cie-large
폐쇄 정보 추출을위한 Relik Extra 큰 (? el + re의 THICC BOI) : relik-ie/relik-cie-xl
Entity Linking을위한 Relik Small (?? ⚡ Tiny and Fast El, Colab ✅) : sapienzanlp/relik-entity-linking-small
엔티티 링크를위한 Relik Small (⚡ 소형 및 빠른 EL) : sapienzanlp/relik-entity-linking-small
폐쇄 정보 추출을위한 Relik Small (EL + RE) : relik-ie/relik-cie-small
Entity Linking (El for the Wild)을 위해 대형 Relik : relik-ie/relik-entity-linking-large-robust
Relik Small for Entity Linking (RE + NER) : relik-ie/relik-relation-extraction-small-wikipedia-ner

논문의 모델 :

엔티티 링크를위한 Relik Large (종이 버전) : sapienzanlp/relik-entity-linking-large
엔티티 링크 (종이 버전)를위한 Relik Base : sapienzanlp/relik-entity-linking-base
관계 추출을위한 Relik Large (종이 버전) : sapienzanlp/relik-relation-extraction-nyt-large

전체 모델 목록을 찾을 수 있습니까? 포옹 얼굴.

다른 모델 크기는 향후에 사용할 수 있습니까?.

빠른 시작

Relik은 엔티티 링크 및 관계 추출을 위한 가볍고 빠른 모델입니다. 리트리버와 독자의 두 가지 주요 구성 요소로 구성됩니다. 리트리버는 대규모 컬렉션에서 관련 문서를 검색 할 책임이 있으며 독자는 검색된 문서에서 엔티티 및 관계를 추출 할 책임이 있습니다. Relik은 from_pretrained 메소드와 함께 사용하여 미리 훈련 된 파이프 라인을로드 할 수 있습니다.

다음은 엔티티 링크에 Relik을 사용하는 방법의 예입니다.

 from relik import Relik
from relik . inference . data . objects import RelikOutput

relik = Relik . from_pretrained ( "sapienzanlp/relik-entity-linking-large" )
relik_out : RelikOutput = relik ( "Michael Jordan was one of the best players in the NBA." )

산출:

 RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[
      Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
      Span(start=50, end=53, label="National Basketball Association", text="NBA"),
  ],
  triples=[],
  candidates=Candidates(
      span=[
          [
              [
                  {"text": "Michael Jordan", "id": 4484083},
                  {"text": "National Basketball Association", "id": 5209815},
                  {"text": "Walter Jordan", "id": 2340190},
                  {"text": "Jordan", "id": 3486773},
                  {"text": "50 Greatest Players in NBA History", "id": 1742909},
                  ...
              ]
          ]
      ]
  ),
)

그리고 관계 추출 :

 from relik import Relik
from relik . inference . data . objects import RelikOutput

relik = Relik . from_pretrained ( "sapienzanlp/relik-relation-extraction-nyt-large" )
relik_out : RelikOutput = relik ( "Michael Jordan was one of the best players in the NBA." )

산출:

 RelikOutput(
  text='Michael Jordan was one of the best players in the NBA.', 
  tokens=Michael Jordan was one of the best players in the NBA., 
  id=0, 
  spans=[
    Span(start=0, end=14, label='--NME--', text='Michael Jordan'), 
    Span(start=50, end=53, label='--NME--', text='NBA')
  ], 
  triplets=[
    Triplets(
      subject=Span(start=0, end=14, label='--NME--', text='Michael Jordan'), 
      label='company', 
      object=Span(start=50, end=53, label='--NME--', text='NBA'), 
      confidence=1.0
      )
  ], 
  candidates=Candidates(
    span=[], 
    triplet=[
              [
                [
                  {"text": "company", "id": 4, "metadata": {"definition": "company of this person"}}, 
                  {"text": "nationality", "id": 10, "metadata": {"definition": "nationality of this person or entity"}}, 
                  {"text": "child", "id": 17, "metadata": {"definition": "child of this person"}}, 
                  {"text": "founded by", "id": 0, "metadata": {"definition": "founder or co-founder of this organization, religion or place"}}, 
                  {"text": "residence", "id": 18, "metadata": {"definition": "place where this person has lived"}},
                  ...
              ]
          ]
      ]
  ),
)

용법

리트리버와 독자는 별도로 사용할 수 있습니다. 리트리버 전용 Relik의 경우 출력에는 입력 텍스트 후보가 포함됩니다.

리트리버 전용 예 :

 from relik import Relik
from relik . inference . data . objects import RelikOutput

# If you want to use only the retriever
retriever = Relik . from_pretrained ( "sapienzanlp/relik-entity-linking-large" , reader = None )
relik_out : RelikOutput = retriever ( "Michael Jordan was one of the best players in the NBA." )

산출:

 RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[],
  triples=[],
  candidates=Candidates(
      span=[
              [
                  {"text": "Michael Jordan", "id": 4484083},
                  {"text": "National Basketball Association", "id": 5209815},
                  {"text": "Walter Jordan", "id": 2340190},
                  {"text": "Jordan", "id": 3486773},
                  {"text": "50 Greatest Players in NBA History", "id": 1742909},
                  ...
              ]
      ],
      triplet=[],
  ),
)

독자 전용 예 :

 from relik import Relik
from relik . inference . data . objects import RelikOutput

# If you want to use only the reader
reader = Relik . from_pretrained ( "sapienzanlp/relik-entity-linking-large" , retriever = None )
candidates = [
    "Michael Jordan" ,
    "National Basketball Association" ,
    "Walter Jordan" ,
    "Jordan" ,
    "50 Greatest Players in NBA History" ,
]
text = "Michael Jordan was one of the best players in the NBA."
relik_out : RelikOutput = reader ( text , candidates = candidates )

산출:

 RelikOutput(
  text="Michael Jordan was one of the best players in the NBA.",
  tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
  id=0,
  spans=[
      Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
      Span(start=50, end=53, label="National Basketball Association", text="NBA"),
  ],
  triples=[],
  candidates=Candidates(
      span=[
          [
              [
                  {
                      "text": "Michael Jordan",
                      "id": -731245042436891448,
                  },
                  {
                      "text": "National Basketball Association",
                      "id": 8135443493867772328,
                  },
                  {
                      "text": "Walter Jordan",
                      "id": -5873847607270755146,
                      "metadata": {},
                  },
                  {"text": "Jordan", "id": 6387058293887192208, "metadata": {}},
                  {
                      "text": "50 Greatest Players in NBA History",
                      "id": 2173802663468652889,
                  },
              ]
          ]
      ],
  ),
)

클리

Relik은 모델에 FastApi 서버를 제공하거나 데이터 세트에서 추론을 수행하는 CLI를 제공합니다.

`relik serve`

relik serve --help

Usage: relik serve [OPTIONS] RELIK_PRETRAINED [DEVICE] [RETRIEVER_DEVICE]                             
                    [DOCUMENT_INDEX_DEVICE] [READER_DEVICE] [PRECISION]                                
                    [RETRIEVER_PRECISION] [DOCUMENT_INDEX_PRECISION]                                   
                    [READER_PRECISION] [ANNOTATION_TYPE]                                               
                                                                                                       
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────╮
│ *    relik_pretrained              TEXT                        [default: None] [required]           │
│      device                        [DEVICE]                    The device to use for relik (e.g.,   │
│                                                                ' cuda ' , ' cpu ' ).                      │
│                                                                [default: None]                      │
│      retriever_device              [RETRIEVER_DEVICE]          The device to use for the retriever  │
│                                                                (e.g., ' cuda ' , ' cpu ' ).               │
│                                                                [default: None]                      │
│      document_index_device         [DOCUMENT_INDEX_DEVICE]     The device to use for the index      │
│                                                                (e.g., ' cuda ' , ' cpu ' ).               │
│                                                                [default: None]                      │
│      reader_device                 [READER_DEVICE]             The device to use for the reader     │
│                                                                (e.g., ' cuda ' , ' cpu ' ).               │
│                                                                [default: None]                      │
│      precision                     [PRECISION]                 The precision to use for relik       │
│                                                                (e.g., ' 32 ' , ' 16 ' ).                  │
│                                                                [default: 32]                        │
│      retriever_precision           [RETRIEVER_PRECISION]       The precision to use for the         │
│                                                                retriever (e.g., ' 32 ' , ' 16 ' ).        │
│                                                                [default: None]                      │
│      document_index_precision      [DOCUMENT_INDEX_PRECISION]  The precision to use for the index   │
│                                                                (e.g., ' 32 ' , ' 16 ' ).                  │
│                                                                [default: None]                      │
│      reader_precision              [READER_PRECISION]          The precision to use for the reader  │
│                                                                (e.g., ' 32 ' , ' 16 ' ).                  │
│                                                                [default: None]                      │
│      annotation_type               [ANNOTATION_TYPE]           The type of annotation to use (e.g., │
│                                                                ' CHAR ' , ' WORD ' ).                     │
│                                                                [default: char]                      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────╮
│ --host                         TEXT     [default: 0.0.0.0]                                          │
│ --port                         INTEGER  [default: 8000]                                             │
│ --frontend    --no-frontend             [default: no-frontend]                                      │
│ --help                                  Show this message and exit.                                 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯

예를 들어:

relik serve sapienzanlp/relik-entity-linking-large

`relik inference`

relik inference --help

  Usage: relik inference [OPTIONS] MODEL_NAME_OR_PATH INPUT_PATH OUTPUT_PATH

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    model_name_or_path      TEXT  [default: None] [required]                                           │
│ *    input_path              TEXT  [default: None] [required]                                           │
│ *    output_path             TEXT  [default: None] [required]                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────╮
│ --batch-size                               INTEGER  [default: 8]                                        │
│ --num-workers                              INTEGER  [default: 4]                                        │
│ --device                                   TEXT     [default: cuda]                                     │
│ --precision                                TEXT     [default: fp16]                                     │
│ --top-k                                    INTEGER  [default: 100]                                      │
│ --window-size                              INTEGER  [default: None]                                     │
│ --window-stride                            INTEGER  [default: None]                                     │
│ --annotation-type                          TEXT     [default: char]                                     │
│ --progress-bar        --no-progress-bar             [default: progress-bar]                             │
│ --model-kwargs                             TEXT     [default: None]                                     │
│ --inference-kwargs                         TEXT     [default: None]                                     │
│ --help                                              Show this message and exit.                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

예를 들어:

relik inference sapienzanlp/relik-entity-linking-large data.txt output.jsonl

도커 이미지

Relik 용 Docker Images는 Docker Hub에서 사용할 수 있습니다. 다음과 같이 최신 이미지를 가져올 수 있습니다.

docker pull sapienzanlp/relik:latest

그리고 이미지를 다음과 같이 실행하십시오.

docker run -p 12345:8000 sapienzanlp/relik:latest -c relik-ie/relik-cie-small

API는 http://localhost:12345 에서 제공됩니다. 모델로 전달할 수있는 여러 매개 변수가있는 단일 엔드 포인트 /relik 노출시킵니다. API의 빠른 문서는 http://localhost:12345/docs 에서 찾을 수 있습니다. 다음은 API를 쿼리하는 방법에 대한 간단한 예입니다.

curl -X ' GET ' 
  ' http://127.0.0.1:12345/api/relik?text=Michael%20Jordan%20was%20one%20of%20the%20best%20players%20in%20the%20NBA.&is_split_into_words=false&retriever_batch_size=32&reader_batch_size=32&return_windows=false&use_doc_topic=false&annotation_type=char&relation_threshold=0.5 ' 
  -H ' accept: application/json '

여기서 Docker 이미지로 전달할 수있는 전체 매개 변수 목록 :

docker run sapienzanlp/relik:latest -h

Usage: relik [-h --help] [-c --config] [-p --precision] [-d --device] [--retriever] [--retriever-device] 
[--retriever-precision] [--index-device] [--index-precision] [--reader] [--reader-device] [--reader-precision] 
[--annotation-type] [--frontend] [--workers] -- start the FastAPI server for the RElik model

where:
    -h --help               Show this help text
    -c --config             Pretrained ReLiK config name (from HuggingFace) or path
    -p --precision          Precision, default ' 32 ' .
    -d --device             Device to use, default ' cpu ' .
    --retriever             Override retriever model name.
    --retriever-device      Override retriever device.
    --retriever-precision   Override retriever precision.
    --index-device          Override index device.
    --index-precision       Override index precision.
    --reader                Override reader model name.
    --reader-device         Override reader device.
    --reader-precision      Override reader precision.
    --annotation-type       Annotation type ( ' char ' , ' word ' ), default ' char ' .
    --frontend              Whether to start the frontend server.
    --workers               Number of workers to use.

시작하기 전에

다음 섹션에서는 데이터 준비 방법에 대한 단계별 안내서를 제공하고 리트리버 및 리더를 교육하며 모델을 평가합니다.

엔티티 링크

모든 데이터에는 다음과 같은 구조가 있어야합니다.

{
  "doc_id" : int,  # Unique identifier for the document
  "doc_text" : txt,  # Text of the document
  "doc_span_annotations" : # Char level annotations
    [
      [ start, end, label ],
      [ start, end, label ],
      ...
    ]
}

우리는 교육 및 평가를 위해 Blink (Wu et al., 2019)와 Aida (Hoffart et al, 2011) 데이터 세트를 사용했습니다. 보다 구체적으로, 우리는 Retriever를 미세 조정하고 독자를 훈련시키기 위해 Retriever 및 Aida 데이터 세트를 사전 훈련하기 위해 Blink 데이터 세트를 사용했습니다.

깜박임 데이터 세트는이 스크립트를 사용하여 장르 레포에서 다운로드 할 수 있습니다. 우리는 Training and Validation DataSets로 blink-train-kilt.jsonl 및 blink-dev-kilt.jsonl 사용했습니다. data/blink 폴더에서 두 파일을 다운로드했다고 가정하면 다음 스크립트를 사용하여 Blink 데이터 세트를 Relik 형식으로 변환했습니다.

 # Train
python scripts/data/blink/preprocess_genre_blink.py 
  data/blink/blink-train-kilt.jsonl 
  data/blink/processed/blink-train-kilt-relik.jsonl

# Dev
python scripts/data/blink/preprocess_genre_blink.py 
  data/blink/blink-dev-kilt.jsonl 
  data/blink/processed/blink-dev-kilt-relik.jsonl

AIDA 데이터 세트는 공개적으로 사용할 수 없지만 text 필드없이 사용한 파일을 제공합니다. data/aida/processed 폴더에서 Relik 형식의 파일을 찾을 수 있습니다.

우리가 사용한 Wikipedia 지수는 여기에서 다운로드 할 수 있습니다.

관계 추출

모든 데이터에는 다음과 같은 구조가 있어야합니다.

{
  "doc_id" : int,  # Unique identifier for the document
  "doc_words: list[txt] # Tokenized text of the document
  "doc_span_annotations" : # Token level annotations of mentions (label is optional)
    [
      [ start, end, label ],
      [ start, end, label ],
      ...
    ],
  "doc_triplet_annotations" : # Triplet annotations
  [
    {
      "subject" : [ start, end, label ], # label is optional
      "relation" : name, # type is optional
      "object" : [ start, end, label ], # label is optional
    },
    {
      "subject" : [ start, end, label ], # label is optional
      "relation" : name, # type is optional
      "object" : [ start, end, label ], # label is optional
    },
  ]
}

관계 추출을 위해 CopyRe에서 가져온 RAW_NYT에서 NYT 데이터 세트를 전처리하는 방법의 예를 제공합니다. 데이터 세트를 데이터/raw_nyt로 다운로드 한 다음 실행하십시오.

python scripts/data/nyt/preprocess_nyt.py data/raw_nyt data/nyt/processed/

공정한 비교를 위해 우리는 이전 작업의 전처리를 재현했으며, 이는 엔티티 스팬에 대한 반복 표면 형태의 잘못된 처리로 인해 삼중 항을 복제합니다. 원래 데이터를 Relik 형식으로 올바르게 구문 분석하려면 플래그 -legacy-format false를 설정할 수 있습니다. 제공된 RENYT 모델은 레거시 형식으로 교육을 받았습니다.

? 리트리버

우리는 리트리버를위한 2 단계 교육 과정을 수행합니다. 먼저, 우리는 Retriever를 Blink (Wu et al., 2019) 데이터 세트를 사용하여 "사전 훈련"한 다음 Aida를 사용하여 "미세 조정"합니다 (Hoffart et al, 2011).

데이터 준비

리트리버에는 DPR과 유사한 형식의 데이터 세트가 필요합니다. 각 줄은 다음 키가있는 사전 인 jsonl 파일입니다.

{
  "question" : " .... " ,
  "positive_ctxs" : [{
    "title" : " ... " ,
    "text" : " .... "
  }],
  "negative_ctxs" : [{
    "title" : " ... " ,
    "text" : " .... "
  }],
  "hard_negative_ctxs" : [{
    "title" : " ... " ,
    "text" : " .... "
  }]
}

리트리버는 또한 문서를 검색하기 위해 색인이 필요합니다. 색인 문서는 JSONL 파일이거나 DPR과 유사한 TSV 파일 일 수 있습니다.

jsonl : 각 라인은 다음 키가있는 JSON 객체입니다 : id , text , metadata
tsv : 각 줄은 id 및 text 열이있는 탭 구분 된 문자열이며 metadata 필드에 저장 될 다른 열이 있습니다.

jsonl 예 :

{
  "id" : " ... " ,
  "text" : " ... " ,
  "metadata" : [ " {...} " ]
},
...

tsv 예 :

 id t text t any other column
...

엔티티 링크

깜박거리다

Relik 형식의 Blink 데이터 세트가 있으면 다음 스크립트로 Windows를 만들 수 있습니다.

 # train
relik data create-windows 
  data/blink/processed/blink-train-kilt-relik.jsonl 
  data/blink/processed/blink-train-kilt-relik-windowed.jsonl

# dev
relik data create-windows 
  data/blink/processed/blink-dev-kilt-relik.jsonl 
  data/blink/processed/blink-dev-kilt-relik-windowed.jsonl

그런 다음 DPR 형식으로 변환합니다.

 # train
relik data convert-to-dpr 
  data/blink/processed/blink-train-kilt-relik-windowed.jsonl 
  data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl 
  data/kb/wikipedia/documents.jsonl 
  --title-map data/kb/wikipedia/title_map.json

# dev
relik data convert-to-dpr 
  data/blink/processed/blink-dev-kilt-relik-windowed.jsonl 
  data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl 
  data/kb/wikipedia/documents.jsonl 
  --title-map data/kb/wikipedia/title_map.json

에이다

AIDA 데이터 세트를 공개적으로 사용할 수 없으므로 AIDA 데이터 세트에 대한 주석을 Relik 형식으로 예제로 제공 할 수 있습니다. data/aida 에 전체 AIDA 데이터 세트가 있다고 가정하면 Relik 형식으로 변환 한 다음 다음 스크립트로 Windows를 만들 수 있습니다.

relik data create-windows 
  data/aida/processed/aida-train-relik.jsonl 
  data/aida/processed/aida-train-relik-windowed.jsonl

그런 다음 DPR 형식으로 변환합니다.

relik data convert-to-dpr 
  data/aida/processed/aida-train-relik-windowed.jsonl 
  data/aida/processed/aida-train-relik-windowed-dpr.jsonl 
  data/kb/wikipedia/documents.jsonl 
  --title-map data/kb/wikipedia/title_map.json

관계 추출

NYT

relik data create-windows 
  data/data/processed/nyt/train.jsonl 
  data/data/processed/nyt/train-windowed.jsonl 
  --is-split-into-words 
  --window-size none

그런 다음 DPR 형식으로 변환합니다.

relik data convert-to-dpr 
  data/data/processed/nyt/train-windowed.jsonl 
  data/data/processed/nyt/train-windowed-dpr.jsonl

모델 훈련

relik retriever train 명령은 리트리버를 훈련시키는 데 사용될 수 있습니다. 다음과 같은 주장이 필요합니다.

config_path : 구성 파일의 경로.
overrides : 형식 key=value 의 구성 파일에 대한 재정의 목록.

구성 파일의 예는 relik/retriever/conf 폴더에서 찾을 수 있습니다.

엔티티 링크

relik/retriever/conf 의 구성 파일은 pretrain_iterable_in_batch.yaml 및 finetune_iterable_in_batch.yaml 입니다.

예를 들어 AIDA 데이터 세트에서 리트리버를 훈련 시키려면 다음 명령을 실행할 수 있습니다.

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml 
  model.language_model=intfloat/e5-base-v2 
  data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl 
  data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl 
  data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl 
  data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

관계 추출

relik/retriever/conf 의 구성 파일은 finetune_nyt_iterable_in_batch.yaml 이며 NYT 데이터 세트의 리트리버를 미세 조정하는 데 사용했습니다. CIE의 경우 우리는 이전 단계에서 깜박임으로부터 사방 된 것을 용도 변경합니다.

예를 들어 NYT 데이터 세트에서 리트리버를 훈련 시키려면 다음 명령을 실행할 수 있습니다.

relik retriever train relik/retriever/conf/finetune_nyt_iterable_in_batch.yaml 
  model.language_model=intfloat/e5-base-v2 
  data.train_dataset_path=data/nyt/processed/nyt-train-relik-windowed-dpr.jsonl 
  data.val_dataset_path=data/nyt/processed/nyt-dev-relik-windowed-dpr.jsonl 
  data.test_dataset_path=data/nyt/processed/nyt-test-relik-windowed-dpr.jsonl

추론

relik retriever train 사령부에 train.only_test=True 통과함으로써 훈련을 건너 뛰고 모델 만 평가할 수 있습니다. 또한 Pytorch Lightning Checkpoint와 데이터 세트로의 경로가 필요합니다.

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml 
  train.only_test=True 
  test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
  model.checkpoint_path=path/to/checkpoint

리트리버 인코더는 다음 명령으로 체크 포인트에서 저장할 수 있습니다.

 from relik . retriever . lightning_modules . pl_modules import GoldenRetrieverPLModule

checkpoint_path = "path/to/checkpoint"
retriever_folder = "path/to/retriever"

# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"

pl_module = GoldenRetrieverPLModule . load_from_checkpoint ( checkpoint_path )
pl_module . model . save_pretrained ( retriever_folder , push_to_hub = push_to_hub , repo_id = repo_id )

push_to_hub=True 사용하면 모델이 푸시됩니까? 모델이 밀릴 저장소 ID로 repo_id 사용하여 얼굴 허브를 껴안습니다.

리트리버는 문서를 검색하기 위해 색인이 필요합니다. relik retriever create-index 명령을 사용하여 인덱스를 만들 수 있습니다.

relik retriever create-index --help 

 Usage: relik retriever build-index [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH                                                                   
                                    DOCUMENT_PATH OUTPUT_FOLDER                                                                                                                                              
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    question_encoder_name_or_path      TEXT  [default: None] [required]                                                                   │
│ *    document_path                      TEXT  [default: None] [required]                                                                   │
│ *    output_folder                      TEXT  [default: None] [required]                                                                   │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --document-file-type                                  TEXT     [default: jsonl]                                                            │
│ --passage-encoder-name-or-path                        TEXT     [default: None]                                                             │
│ --indexer-class                                       TEXT     [default: relik.retriever.indexers.inmemory.InMemoryDocumentIndex]          │
│ --batch-size                                          INTEGER  [default: 512]                                                              │
│ --num-workers                                         INTEGER  [default: 4]                                                                │
│ --passage-max-length                                  INTEGER  [default: 64]                                                               │
│ --device                                              TEXT     [default: cuda]                                                             │
│ --index-device                                        TEXT     [default: cpu]                                                              │
│ --precision                                           TEXT     [default: fp32]                                                             │
│ --push-to-hub                     --no-push-to-hub             [default: no-push-to-hub]                                                   │
│ --repo-id                                             TEXT     [default: None]                                                             │
│ --help                                                         Show this message and exit.                                                 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

인코더와 인덱스를 사용하면 리포트 ID 또는 로컬 경로에서 리트리버를로드 할 수 있습니다.

 from relik . retriever import GoldenRetriever

encoder_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
index_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index"

retriever = GoldenRetriever (
  question_encoder = encoder_name_or_path ,
  document_index = index_name_or_path ,
  device = "cuda" , # or "cpu"
  precision = "16" , # or "32", "bf16"
  index_device = "cuda" , # or "cpu"
  index_precision = "16" , # or "32", "bf16"
)

그런 다음 문서를 검색하는 데 사용될 수 있습니다.

 retriever . retrieve ( "Michael Jordan was one of the best players in the NBA." , top_k = 100 )

? 리더

독자는 후보자 세트 (예 : 가능한 실체 또는 관계)에서 문서에서 실체와 관계를 추출 할 책임이 있습니다. 독자는 스팬 추출 또는 삼중 항 추출에 대한 교육을받을 수 있습니다. RelikReaderForSpanExtraction 은 SPAN 추출, 즉 엔티티 링크에 사용되는 반면 RelikReaderForTripletExtraction 트리플렛 추출, 즉 관계 추출에 사용됩니다.

데이터 준비

리더는 리트리버의 후보자로 확대되기 전에 섹션에서 만든 창화 된 데이터 세트가 필요합니다. relik retriever add-candidates 명령을 사용하여 응시자를 데이터 세트에 추가 할 수 있습니다.

relik retriever add-candidates --help

 Usage: relik retriever add-candidates [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH                                 
                                       DOCUMENT_NAME_OR_PATH INPUT_PATH                                        
                                       OUTPUT_PATH

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    question_encoder_name_or_path      TEXT  [default: None] [required]                                    │
│ *    document_name_or_path              TEXT  [default: None] [required]                                    │
│ *    input_path                         TEXT  [default: None] [required]                                    │
│ *    output_path                        TEXT  [default: None] [required]                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --passage-encoder-name-or-path                           TEXT     [default: None]                           │
│ --relations                                              BOOLEAN  [default: False]                          │
│ --top-k                                                  INTEGER  [default: 100]                            │
│ --batch-size                                             INTEGER  [default: 128]                            │
│ --num-workers                                            INTEGER  [default: 4]                              │
│ --device                                                 TEXT     [default: cuda]                           │
│ --index-device                                           TEXT     [default: cpu]                            │
│ --precision                                              TEXT     [default: fp32]                           │
│ --use-doc-topics                  --no-use-doc-topics             [default: no-use-doc-topics]              │
│ --help                                                            Show this message and exit.               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

엔티티 링크

이전에 훈련 된 리트리버를 사용하여 독자가 사용할 각 창에 후보자를 추가해야합니다. 다음은 열차 분할을 위해 Aida에서 이미 훈련 된 리트리버를 사용하는 예입니다.

relik retriever add-candidates sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index data/aida/processed/aida-train-relik-windowed.jsonl data/aida/processed/aida-train-relik-windowed-candidates.jsonl

관계 추출

관계 추출도 마찬가지입니다. 숙련 된 리트리버를 사용하려면 :

relik retriever add-candidates sapienzanlp/relik-retriever-small-nyt-question-encoder sapienzanlp/relik-retriever-small-nyt-document-index data/nyt/processed/nyt-train-relik-windowed.jsonl data/nyt/processed/nyt-train-relik-windowed-candidates.jsonl

모델 훈련

리트리버와 유사하게 relik reader train 명령을 사용하여 리트리버를 훈련시킬 수 있습니다. 다음과 같은 주장이 필요합니다.

config_path : 구성 파일의 경로.
overrides : 형식 key=value 의 구성 파일에 대한 재정의 목록.

구성 파일의 예는 relik/reader/conf 폴더에서 찾을 수 있습니다.

엔티티 링크

relik/reader/conf 의 구성 파일은 large.yaml base.yaml 은 각각 대형 및 기본 리더를 훈련 시켰습니다. 예를 들어, Aida 데이터 세트 실행에서 대규모 독자를 훈련시키기 위해 :

relik reader train relik/reader/conf/large.yaml 
  train_dataset_path=data/aida/processed/aida-train-relik-windowed-candidates.jsonl 
  val_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl 
  test_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl

관계 추출

relik/reader/conf 의 구성 파일은 large_nyt.yaml , base_nyt.yaml 및 small_nyt.yaml 입니다. 예를 들어, Aida 데이터 세트 실행에서 대규모 독자를 훈련시키기 위해 :

relik reader train relik/reader/conf/large_nyt.yaml 
  train_dataset_path=data/nyt/processed/nyt-train-relik-windowed-candidates.jsonl 
  val_dataset_path=data/nyt/processed/nyt-dev-relik-windowed-candidates.jsonl 
  test_dataset_path=data/nyt/processed/nyt-test-relik-windowed-candidates.jsonl

추론

독자는 Checkpoint에서 다음 명령으로 저장할 수 있습니다.

 from relik . reader . lightning_modules . relik_reader_pl_module import RelikReaderPLModule

checkpoint_path = "path/to/checkpoint"
reader_folder = "path/to/reader"

# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-reader-deberta-v3-large-aida"

pl_model = RelikReaderPLModule . load_from_checkpoint (
    trainer . checkpoint_callback . best_model_path
)
pl_model . relik_reader_core_model . save_pretrained ( experiment_path , push_to_hub = push_to_hub , repo_id = repo_id )

push_to_hub=True 사용하면 모델이 푸시됩니까? 모델이 업로드 될 저장소 ID로 repo_id 사용하여 얼굴 허브를 안아줍니다.

리더는 리포 ID 또는 로컬 경로에서로드 할 수 있습니다.

 from relik . reader import RelikReaderForSpanExtraction , RelikReaderForTripletExtraction

# the reader for span extraction
reader_span = RelikReaderForSpanExtraction (
  "sapienzanlp/relik-reader-deberta-v3-large-aida"
)
# the reader for triplet extraction
reader_tripltes = RelikReaderForTripletExtraction (
  "sapienzanlp/relik-reader-deberta-v3-large-nyt"
)

실체와 관계를 추출하는 데 사용됩니다.

 # an example of candidates for the reader
candidates = [ "Michael Jordan" , "NBA" , "Chicago Bulls" , "Basketball" , "United States" ]
reader_span . read ( "Michael Jordan was one of the best players in the NBA." , candidates = candidates )

성능

엔티티 링크

Gerbil을 사용하여 Entity 링크에 대한 Relik의 성능을 평가합니다. 다음 표는 Relik Large 및 Base의 결과 (Inkb Micro F1)를 보여줍니다.

모델	에이다	MSNBC	데스	K50	R128	R500	O15	O16	더하다	ood	ait (m : s)
장르	83.7	73.7	54.1	60.7	46.7	40.3	56.1	50.0	58.2	54.5	38:00
entqa	85.8	72.1	52.9	64.5	54.1	41.9	61.1	51.3	60.5	56.4	20:00
Relik _Small	82.2	72.7	55.6	68.3	48.0	42.3	62.7	53.6	60.7	57.6	00:29
Relik _Base	85.3	72.3	55.6	68.0	48.1	41.6	62.5	52.3	60.7	57.2	00:29
Relik _Large	86.4	75.0	56.3	72.8	51.7	43.0	65.1	57.2	63.4	60.2	01:46

비교 시스템의 평가 (INKB 마이크로 F1)는 도메인 aida 테스트 세트 및 도메인 외 MSNBC (MSN), derczynski (der), kore50 (K50), N3-Reuters-128 (R128), N3-RSS-500 (R500), O15 (oKE-15) 및 OKE-15 (O16) 시험. Bold는 최고의 모델을 나타냅니다. 장르는 언급 사전을 사용합니다. AIT 컬럼은 24GB RAM에 맞지 않고 A100이 사용되는 ENTQA를 제외하고 NVIDIA RTX 4090을 사용하여 전체 AIDA 테스트 세트를 처리해야한다는 시간과 초의 시간을 보여줍니다.

Relik을 평가하려면 다음 단계를 사용합니다.

여기에서 Gerbil 서버를 다운로드하십시오.
Gerbil 서버 시작 :

 cd gerbil && ./start.sh

다음 서비스를 시작하십시오.

 cd gerbil-SpotWrapNifWS4Test && mvn clean -Dmaven.tomcat.port=1235 tomcat:run

Gerbil 용 Relik 서버를 시작하여 모델 이름을 인수로 제공합니다 (예 : sapienzanlp/relik-entity-linking-large ) :

python relik/reader/utils/gerbil.py --relik-model-name sapienzanlp/relik-entity-linking-large

URL http : // localhost : 1234/gerbil 및 :
- 실험 유형으로 A2KB를 선택하십시오
- "MA -Strong Annotation Match"를 선택하십시오.
- 이름 필드에서 실험에주고 싶은 이름을 쓰십시오.
- URI 필드에서 http : // localhost : 1235/gerbil-spotwrapnifws4test/myalgorithm
- 데이터 세트 선택 (우리는 AIDA-B, MSNBC, DER, K50, R128, R500, OKE15, OKE16을 사용합니다).
- 마지막으로 실험을 실행하십시오

관계 추출

다음 표는 NYT 데이터 세트에서 Relik Large의 결과 (마이크로 F1)를 보여줍니다.

모델	NYT	NYT (pretr)	ait (m : s)
반역자	93.1	93.4	01:45
uie	93.5	-	-
USM	94.0	94.1	-
Relik _Large	95.0	94.9	00:30

관계 추출을 평가하기 위해 스크립트 Relik/Reader/Trainer/Predict_Re.py와 함께 독자를 직접 사용할 수 있으며 이미 검색된 후보자와 함께 파일을 가리킬 수 있습니다. 숙련 된 독자를 사용하려면 :

python relik/reader/trainer/predict_re.py --model_path sapienzanlp/relik-reader-deberta-v3-large-nyt --data_path /Users/perelluis/Documents/relik/data/debug/test.window.candidates.jsonl --is-eval

개발 세트를 기반으로 관계를 예측하기위한 임계 값을 계산합니다. 평가하는 동안 계산하려면 다음을 실행할 수 있습니다.

python relik/reader/trainer/predict_re.py --model_path sapienzanlp/relik-reader-deberta-v3-large-nyt --data_path /Users/perelluis/Documents/relik/data/debug/dev.window.candidates.jsonl --is-eval --compute-threshold

? 이 일을 인용하십시오

이 작업의 일부를 사용하는 경우 다음과 같이 논문을 인용하는 것을 고려하십시오.

 @inproceedings { orlando-etal-2024-relik ,
    title     = " Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget " ,
    author    = " Orlando, Riccardo and Huguet Cabot, Pere-Llu{'i}s and Barba, Edoardo and Navigli, Roberto " ,
    booktitle = " Findings of the Association for Computational Linguistics: ACL 2024 " ,
    month     = aug,
    year      = " 2024 " ,
    address   = " Bangkok, Thailand " ,
    publisher = " Association for Computational Linguistics " ,
}

? 특허

데이터 및 소프트웨어는 Creative Commons Attribution-Noncommercial-Sharealike 4.0에 따라 라이센스가 부여됩니다.

확장하다