RetroMAE Download - RetroMAE Source code download

RetroMAE

AI Source Code

1.0.0

Download

RetroMAE

Codebase for RetroMAE and beyond.

What's New

Oct. 2022, RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder is accepted to EMNLP 2022; SOTA performances on MS MARCO and BEIR from a BERT-base scale dense retriever!
Nov. 2022, RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models is now on ArXiv. Another big stride forward from v1 and major improvements on MS MARCO and BEIR! Models and code are coming soon!

Released Models

We have uploaded some checkpoints to Huggingface Hub.

Model	Description	Link
RetroMAE	Pre-trianed on the wikipedia and bookcorpus	Shitao/RetroMAE
RetroMAE_MSMARCO	Pre-trianed on the MSMARCO passage	Shitao/RetroMAE_MSMARCO
RetroMAE_MSMARCO_finetune	Finetune the RetroMAE_MSMARCO on the MSMARCO passage data	Shitao/RetroMAE_MSMARCO_finetune
RetroMAE_MSMARCO_distill	Finetune the RetroMAE_MSMARCO on the MSMARCO passage data by minimizing the KL-divergence with the cross-encoder	Shitao/RetroMAE_MSMARCO_distill
RetroMAE_BEIR	Finetune the RetroMAE on the MSMARCO passage data for BEIR (use the official negatives provided by BEIR)	Shitao/RetroMAE_BEIR

You can load them easily using the identifier strings. For example:

from transformers import AutoModel
model = AutoModel.from_pretrained('Shitao/RetroMAE')

State of the Art Performance

RetroMAE can provide a strong initialization of dense retriever; after fine-tuned with in-domain data, it gives rise to a high-quality supervised retrieval performance in the corresponding scenario. Besides, It substantially improves the pre-trained model's transferability, which helps to result in superior zero-shot performances on out-of-domain datasets.

MSMARCO Passage

Model pre-trained on wikipedia and bookcorpus:

Model	MRR@10	Recall@1000
Bert	0.346	0.964
RetroMAE	0.382	0.981

Model pre-trained on MSMARCO:

Model	MRR@10	Recall@1000
coCondenser	0.382	0.984
RetroMAE	0.393	0.985
RetroMAE(distillation)	0.416	0.988

BEIR Benchemark

Model	Avg NDCG@10 (18 datasets)
Bert	0.371
Condenser	0.407
RetroMAE	0.452
RetroMAE v2	0.491

Installation

git clone https://github.com/staoxiao/RetroMAE.git
cd RetroMAE
pip install .

For development, install as editable:

pip install -e .

Workflow

This repo includes two functions: pre-train and finetune. Firstly, train the RetroMAE on general dataset (or downstream dataset) with mask language modeling loss. Then finetune the RetroMAE on downstream dataset with contrastive loss. To achieve a better performance, you also can finetune the RetroMAE by distillation the scores provided by cross-encoder. Detailed workflow please refer to our examples.

Pretrain

torchrun --nproc_per_node 8 
  -m pretrain.run 
  --output_dir {path to save ckpt} 
  --data_dir {your data} 
  --do_train True 
  --model_name_or_path bert-base-uncased 
  --pretrain_method {retromae or dupmae}

Finetune

torchrun --nproc_per_node 8 
-m bi_encoder.run 
--output_dir {path to save ckpt} 
--model_name_or_path Shitao/RetroMAE 
--do_train  
--corpus_file ./data/BertTokenizer_data/corpus 
--train_query_file ./data/BertTokenizer_data/train_query 
--train_qrels ./data/BertTokenizer_data/train_qrels.txt 
--neg_file ./data/train_negs.tsv

Examples

Pre-train
- Pre-train on Wikipedia
- Pre-train on MSMARCO Passage
Bi-encoder
- Finetune on MSMARCO Passage
- BEIR Benchemark
Cross-encoder
- Reranker on MSMARCO Passage

Citation

If you find our work helpful, please consider citing us:

@inproceedings{RetroMAE,
  title={RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder},
  author={Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao},
  url={https://arxiv.org/abs/2205.12035},
  booktitle ={EMNLP},
  year={2022},
}

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-09-06
size 69.28KB
From Github

Related Applications

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All