SimCSE with CARDSダウンロードSimCSE with CARDSソースコードダウンロード

SimCSE with CARDS

AI ソースコード

1.0.0

ダウンロード

ケース熟成されたポジティブと検索されたネガを伴う文の埋め込みの対照的な学習の改善

このリポジトリは、スイッチケースの増強と、論文からのハードネガティブ検索を実装します。 2つのアプローチとSIMCSEを組み合わせることで、Contrastive Learningと呼ばれるモデルと、文の埋め込み（カード）の拡張および取得データと呼ばれるモデルにつながります。

表1。ケーススイッチと取得のサンプル文の例。

タイプ	文
オリジナル	最初の本の物語は続きます。
ケーススイッチ	最初の本の物語は続きます。
取得	物語は典型的なラブストーリーとして始まります。
ランダム	これは一時的な結果として保持されます。

結果とチェックポイント

表2。文のパフォーマンス埋め込みタスク

事前脱出	微調整	sts12	sts13	STS14	sts15	sts16	STSB	SICK-R	平均。
ロベルタベース	Simcse +カード	72.65	84.26	76.52	82.98	82.73	82.04	70.66	78.83
ロベルタ・ラージ	Simcse +カード	74.63	86.27	79.25	85.93	83.17	83.86	72.77	80.84

ダウンロードリンク：Cards-Roberta-Base（ダウンロード、440MB）、Cards-Roberta-Large（ダウンロード、1.23GB）。

表3。接着剤タスクのパフォーマンス

事前脱出	微調整	mnli-m	QQP	Qnli	SST-2	コーラ	sts-b	MRPC	rte	平均。
debertav2-xxlarge	r-drop +スイッチケース	92.0	93.0	96.3	97.2	75.5	93.6	93.9	94.2	91.7

使用法

このレポは、ハギングフェイストランスとSIMCSEに基づいて構築されています。パッケージバージョンについては、requincement.txtを参照してください。

データの準備

 # 1. Download wiki-1m dataset: 
# - use wget -P target_folder in data/datasets/download_wiki.sh, and run
bash data/datasets/download_wiki.sh
# - modify train_file in scripts/bert/run_simcse_pretraining_v2.sh

# 2. preprocess wiki-1m dataset for negative retrieval
# - deduplicate the wiki-1m dataset, and (optionally) remove sentences with less than three words
# - modify paths in data/datasets/simcse_utils.py then run it to get model representations for all sentences in dataset
python data/datasets/simcse_utils.py

# 3. Download SentEval evaluation data:
# - use wget -P target_folder in data/datasets/download_senteval.sh, and run
bash data/datasets/download_senteval.sh

ロベルタをカードで微調整します

コードを実行する前に、ユーザーはデフォルトのモデルチェックポイントとI/Oパスを変更する必要がある場合があります。

scripts/bert/run_simcse_grid.sh ：行42-50（train_file、train_file_dedupl（optional）、output_dir、tensorboard_dir、sent_rep_cache_file、senteval_data_dir）
scripts/bert/run_simcse_pretraining.sh ：行17-20（train_file、output_dir、tensorboard_dir、senteval_data_dir）、行45（sent_rep_cache_files）、行166-213（model_name_or_or_path、config_name）。

微調整 +評価

 # MUST cd to the folder which contains data/, examples/, models/, scripts/, training/ and utils/
cd YOUR_CARDS_WORKING_DIRECTORY

# roberta-base
new_train_file=path_to_wiki1m
sent_rep_cache_file=path_to_sentence_representation_file  # generated by data/datasets/simcse_utils.py 

# run a model with a single set of hyper-parameters
# when running the model for the very first time, need to add overwrite_cache=True, this will produce a processed training data cache.
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=base 
    cuda=0,1,2,3 seed=42 learning_rate=4e-5 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=65 sample_k=1 knn_metric=cos 
    switch_case_probability=0.05 switch_case_method=v2 
    print_only=False

# grid-search on hyper-parameters
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=base 
    cuda=0,1,2,3 seed=42 learning_rate=1e-5,2e-5,4e-5 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=0,9,65 sample_k=1 knn_metric=cos 
    switch_case_probability=0,0.05,0.1,0.15 switch_case_method=v2 
    print_only=False

# roberta-large
bash scripts/bert/run_simcse_grid.sh 
    model_type=roberta model_size=large 
    cuda=0,1,2,3 seed=42 learning_rate=7.5e-6 
    new_train_file= ${new_train_file} sent_rep_cache_file= ${sent_rep_cache_file} 
    dyn_knn=9 sample_k=1 knn_metric=cos 
    switch_case_probability=0.1 switch_case_method=v1 
    print_only=False

評価のみ

 # provide train_file, output_dir, tensorboard_dir if different to the default values
model_name=name_of_saved_mdoel  # e.g., roberta_large_bs128x4_lr2e-5_switchcase0.1_v2
bash ./scripts/bert/run_simcse_pretraining.sh 
    model_name_or_path= ${output_dir} / ${model_name} model_name= ${model_name} config_name= ${output_dir} / ${model_name} /config.json 
    train_file= ${train_file} output_dir= ${output_dir} /test_only tensorboard_dir= ${tensorboard_dir} 
    model_type=roberta model_size=base do_train=False 
    cuda=0 ngpu=1

既知の問題

不明な理由で、ハグFace Transformers v4.11.3およびv4.15.0を使用すると、優れたモデルハイパーパラメーターのセットが異なっていました。上記のハイパーパラメーターは、変圧器4.11.3でグリッド検索されました。

引用

 @inproceedings{cards,
    title = "Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives",
    author = "Wei Wang and Liangzhu Ge and Jingqiao Zhang and Cheng Yang",
    booktitle = "The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)",
    year = "2022"
}

拡大する

追加情報