unsupervised passage reranking下載 - unsupervised passage reranking源代碼下載

unsupervised passage reranking

Ai源碼

1.0.0

下載

內容

設定
輸入數據格式
下載數據
用法
結果
問題
引用

該存儲庫包含在論文“通過零片問題產生的改善段落檢索”中引入的UPR（無監督段落）算法的正式實施。

UPR算法

在重新排行Contriever的前1000個Wikipedia段落之後的結果

設定

要使用此存儲庫，需要使用標準的Pytorch安裝。我們在需求中提供依賴項.txt文件。

我們建議使用NGC最近的Pytorch容器之一。可以使用命令docker pull nvcr.io/nvidia/pytorch:22.01-py3繪製Docker圖像：22.01-PY3。要使用此Docker映像，還需要安裝NVIDIA容器工具包。

在Docker容器上，請使用PIP安裝安裝庫transformers和sentencepiece 。

輸入數據格式

維基百科證據

我們遵循DPR公約，並將Wikipedia文章分為100字的長段落。 DPR提供的證據文件可以通過命令下載

 python utils / download_data . py - - resource data . wikipedia - split . psgs_w100

該證據文件包含用於段落ID，通過文本和段落標題的選項卡分隔字段。

id  text    title
1   " Aaron Aaron ( or ; " " Ahärôn " " ) is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusiv
ely from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained 
with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother's spokesman ( " " prophet "
" ) to the Pharaoh. Part of the Law (Torah) that Moses received from "    Aaron
2   " God at Sinai granted Aaron the priesthood for himself and his male descendants, and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed
 the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bib
le. According to the Book of Exodus, Aaron first functioned as Moses' assistant. Because Moses complained that he could not speak well, God appointed Aaron as Moses' " " prophet " " (Exodu
s 4:10-17; 7:1). At the command of Moses, he let "   Aaron
... ... ...

TOP-K檢索數據

輸入數據格式是JSON。 JSON文件中的每個字典都包含一個問題，一個包含Top-K檢索到的數據的列表以及（可選）可能答案的列表。對於每個Top-K段落，我們包括（證據）ID，HAS_ANSWER和（可選的）檢索器得分屬性。 id屬性是Wikipedia證據文件中的段落ID， has_answer表示段落文本是否包含答案跨度。以下是.json文件的模板

[
  {
    "question" : " .... " ,
    "answers" : [ " ... " , " ... " , " ... " ],
    "ctxs" : [
              {
                "id" : " .... " ,
                "score" : " ... " ,
                "has_answer" : " .... " ,
              },
              ...
            ]
  },
  ...
]

當使用自然問題設置查詢時，使用BM25檢索段落時的一個示例。

[
  {
    "question" : " who sings does he love me with reba " ,
    "answers" : [ " Linda Davis " ],
    "ctxs" : [     
              {
               "id" : 11828871 ,
               "score" : 18.3 ,
               "has_answer" : false
              },
              {
                "id" : 11828872 ,
                "score" : 14.7 ,
                "has_answer" : false ,
              },
              {
                "id" : 11828866 ,
                "score" : 14.4 ,
                "has_answer" : true ,
              },
           ...
            ]
  },
  ...
]

下載數據

我們提供了前1000個檢索的段落，以供媒體Questions-Open（NQ），Triviaqa，squad-open，webQeestions（WebQ）和EntityQuestions（EQ）（EQ）數據集（EQ）數據集（EQ）數據集（EQ）數據集：BM25，MSS，MSS，Contriever，DPR和MSS-DPR。請使用以下命令下載這些數據集

 python utils / download_data . py 
	- - resource { key from download_data . py 's RESOURCES_MAP}  
	[ optional - - output_dir { your location }]

例如，要下載所有TOP-K數據，請使用--resource data 。要下載特定的檢索員的TOP-K數據，例如BM25，請使用--resource data.retriever-outputs.bm25 。

用法

要將檢索到的段落與UPR重新排列，請使用以下命令，其中需要指示證據文件和TOP-K檢索到的top-K的路徑。

DISTRIBUTED_ARGS= " -m torch.distributed.launch --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6000 "

python ${DISTRIBUTED_ARGS} upr.py  
  --num-workers 2 
  --log-interval 1 
  --topk-passages 1000 
  --hf-model-name " bigscience/T0_3B " 
  --use-gpu 
  --use-bf16 
  --report-topk-accuracies 1 5 20 100 
  --evidence-data-path " wikipedia-split/psgs_w100.tsv " 
  --retriever-topk-passages-path " bm25/nq-dev.json "

--use-bf16選項可在Ampere GPU上（例如A100或A6000）上節省速度和內存。但是，在使用V100 GPU時，應刪除此參數。

我們在目錄“示例”下提供了一個示例腳本“ upr-demo.sh”。要使用此腳本，請相應地修改數據和輸入 /輸出文件路徑。

結果

在UPR中使用T0-3B語言模型時，我們在數據集的測試集上提供評估得分。

通過檢索

無監督比獵犬的前20個檢索準確性

獵犬（+重新排名）	小隊開場	Triviaqa	自然問題 - 開放	網絡問題	實體問題
MSS	51.3	67.2	60.0	49.2	51.2
MSS + UPR	75.7	81.3	77.3	71.8	71.3
BM25	71.1	76.4	62.9	62.4	71.2
BM25 + UPR	83.6	83.0	78.6	72.9	79.3
反對	63.4	73.9	67.9	65.7	63.0
CRONTIEVER + UPR	81.3	82.8	80.4	75.7	76.0

有監督檢察官的前20名檢索準確性

獵犬（+重新排名）	小隊開場	Triviaqa	自然問題 - 開放	網絡問題	實體問題
dpr	59.4	79.8	79.2	74.6	51.1
DPR + UPR	80.7	84.3	83.4	76.5	65.4
MSS-DPR	73.1	81.9	81.4	76.9	60.6
MSS-DPR + UPR	85.2	84.8	83.9	77.2	73.9

消融研究：預訓練的語言模型的影響

我們將從BM25和MSS獵犬自然問題開發的每一個中檢索到的前1000個段落的結合。該數據文件可以下載為：

python utils/download_data.py --resource data.retriever-outputs.mss-bm25-union.nq-dev

對於這些消融實驗，我們通過參數--topk-passages 2000 ，因為該文件包含兩組前1000段的聯合。

語言模型	獵犬	top-1	前五名	前20名	前100名
-	BM25	22.3	43.8	62.3	76.0
-	MSS	17.7	38.6	57.4	72.4
T5（3b）	BM25 + MSS	22.0	50.5	71.4	84.0
gpt-neo（2.7b）	BM25 + MSS	27.2	55.0	73.9	84.2
GPT-J（6b）	BM25 + MSS	29.8	59.5	76.8	85.6
T5-LM-ADAPT（250m）	BM25 + MSS	23.9	51.4	70.7	83.1
T5-LM-ADAPT（800m）	BM25 + MSS	29.1	57.5	75.1	84.8
T5-LM-ADAPT（3B）	BM25 + MSS	29.7	59.9	76.9	85.6
T5-LM-ADAPT（11B）	BM25 + MSS	32.1	62.3	78.5	85.8
T0-3B	BM25 + MSS	36.7	64.9	79.1	86.1
T0-11b	BM25 + MSS	37.4	64.9	79.1	86.0

可以通過使用腳本gpt/upr_gpt.py在UPR中運行GPT模型。該腳本具有與upr.py腳本相似的選項，但是我們需要將--use-fp16作為參數而不是--use-bf16 。 --hf-model-name的論點可以是EleutherAI/gpt-neo-2.7B或EleutherAI/gpt-j-6B 。

開放域QA實驗

請參閱目錄Open-Domain-QA，以獲取有關培訓和推理預先訓練檢查點的詳細信息。

問題

對於代碼庫中的任何錯誤或錯誤，請打開新問題，或向Devendra Singh Sachan（[email protected]）發送電子郵件。

引用

如果您發現此代碼或數據有用，請考慮將我們的論文視為：

@article{sachan2022improving,
  title = " Improving Passage Retrieval with Zero-Shot Question Generation " ,
  author = " Sachan, Devendra Singh and Lewis, Mike and Joshi, Mandar and Aghajanyan, Armen and Yih, Wen-tau and Pineau, Joelle and Zettlemoyer, Luke " ,
  booktitle = " Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing " ,
  publisher = " Association for Computational Linguistics " ,
  url = " https://arxiv.org/abs/2204.07496 " ,
  year = " 2022 "
}

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2025-09-10
大小 415.61KB
來自於 Github

相關應用

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
Dark Passage遊戲

2023-04-13

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部