SimCSE with CARDS下載SimCSE with CARDS來源代碼下載

SimCSE with CARDS

Ai源碼

1.0.0

下載

改善對句子嵌入的對比度學習，並通過案例提出陽性並檢索否定性

該存儲庫實現了開關案例的增強和從論文中進行嚴重的負面檢索，“改善對句子嵌入的對比度學習，並以案例提升的陽性並檢索了負面因素”。將這兩種方法與SIMCSE結合起來，導致稱為對比度學習的模型以及增強和檢索句子嵌入的數據（卡片）。

表1。案例切換和檢索樣本句子的示例。

類型	句子
原來的	第一本書的故事仍在繼續。
病例切換	第一本書的故事仍在繼續。
檢索	這個故事始於一個典型的愛情故事。
隨機的	這是作為臨時結果。

結果和檢查點

表2。句子嵌入任務的性能

預處理	微調	STS12	STS13	STS14	STS15	STS16	STSB	生病-r	avg。
羅伯塔基地	SIMCSE +卡	72.65	84.26	76.52	82.98	82.73	82.04	70.66	78.83
羅伯塔·萊爾格（Roberta-Large）	SIMCSE +卡	74.63	86.27	79.25	85.93	83.17	83.86	72.77	80.84

下載鏈接：card-roberta-base（下載，440MB），卡roberta-large（下載，1.23GB）。

表3。膠水任務的性能

預處理	微調	mnli-m	QQP	Qnli	SST-2	可樂	STS-B	MRPC	rte	avg。
debertav2-xxlarge	R-Drop +開關案例	92.0	93.0	96.3	97.2	75.5	93.6	93.9	94.2	91.7

用法

此存儲庫是基於HuggingFace Transformers和simcse構建的。有關包裝版本，請參見要求。

數據準備

 # 1. Download wiki-1m dataset: 
# - use wget -P target_folder in data/datasets/download_wiki.sh, and run
bash data/datasets/download_wiki.sh
# - modify train_file in scripts/bert/run_simcse_pretraining_v2.sh

# 2. preprocess wiki-1m dataset for negative retrieval
# - deduplicate the wiki-1m dataset, and (optionally) remove sentences with less than three words
# - modify paths in data/datasets/simcse_utils.py then run it to get model representations for all sentences in dataset
python data/datasets/simcse_utils.py

# 3. Download SentEval evaluation data:
# - use wget -P target_folder in data/datasets/download_senteval.sh, and run
bash data/datasets/download_senteval.sh

用卡片微調羅伯塔

在運行代碼之前，用戶可能需要更改默認模型檢查點和I/O路徑，包括：

scripts/bert/run_simcse_grid.sh ：第42-50行（train_file，train_file_dedupl（可選），output_dir，tensorboard_dir，send_rep_cache_file，sendeval_data_dir）
scripts/bert/run_simcse_pretraining.sh ：第17-20行（train_file，output_dir，tensorboard_dir，sendeval_data_dir），第45行（send_rep_cache_files），第166-213行（send_rep_cache_files）（send_rep_cache_files）（send_rep_cache_files）（model_name_or_or_path，patpath，progud_name）。

微調 +評估

 # MUST cd to the folder which contains data/, examples/, models/, scripts/, training/ and utils/
cd YOUR_CARDS_WORKING_DIRECTORY

# roberta-base
new_train_file=path_to_wiki1m
sent_rep_cache_file=path_to_sentence_representation_file  # generated by data/datasets/simcse_utils.py 

# run a model with a single set of hyper-parameters
# when running the model for the very first time, need to add overwrite_cache=True, this will produce a processed training data cache.
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=base 
    cuda=0,1,2,3 seed=42 learning_rate=4e-5 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=65 sample_k=1 knn_metric=cos 
    switch_case_probability=0.05 switch_case_method=v2 
    print_only=False

# grid-search on hyper-parameters
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=base 
    cuda=0,1,2,3 seed=42 learning_rate=1e-5,2e-5,4e-5 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=0,9,65 sample_k=1 knn_metric=cos 
    switch_case_probability=0,0.05,0.1,0.15 switch_case_method=v2 
    print_only=False

# roberta-large
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=large 
    cuda=0,1,2,3 seed=42 learning_rate=7.5e-6 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=9 sample_k=1 knn_metric=cos 
    switch_case_probability=0.1 switch_case_method=v1 
    print_only=False

僅評估

 # provide train_file, output_dir, tensorboard_dir if different to the default values
model_name=name_of_saved_mdoel  # e.g., roberta_large_bs128x4_lr2e-5_switchcase0.1_v2
bash ./scripts/bert/run_simcse_pretraining.sh 
    model_name_or_path= ${output_dir} / ${model_name} model_name= ${model_name} config_name= ${output_dir} / ${model_name} /config.json 
    train_file= ${train_file} output_dir= ${output_dir} /test_only tensorboard_dir= ${tensorboard_dir} 
    model_type=roberta model_size=base do_train=False 
    cuda=0 ngpu=1

已知問題

由於未知原因，在使用HuggingFace Transformers v4.11.3和v4.15.0時，一組良好的模型超參數是不同的。上面列出的超參數在變壓器4.11.3上進行了網格搜索。

引用

 @inproceedings{cards,
    title = "Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives",
    author = "Wei Wang and Liangzhu Ge and Jingqiao Zhang and Cheng Yang",
    booktitle = "The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)",
    year = "2022"
}

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2025-09-06
大小 124.22KB
來自於 Github

相關應用

SCP Cards中文版

2023-10-17
咒語卡牌：起源

2022-08-30
地下城卡牌

2022-08-18
法術劍卡：DungeonTop

2022-08-18
原子卡

2022-07-29
有感情的鳥兒

2022-07-26

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部