RepoSim
1.0.0
一種使用預訓練的語言模型檢測語義上相似的Python存儲庫的方法。
該存儲庫包含用於使用預訓練的語言模型來檢測語義上相似的Python存儲庫的方法和腳本。
目前,我們最佳性能模型是使用AdvTest數據集對代碼搜索任務進行微調。有關在存儲庫相似性比較的不同語言模型的評估,請參閱此jupyter筆記本:筆記本/biencoder/biencoder/embeddings_evaluation.ipynb
有關我們方法的實現和應用程序的更多詳細信息,請參見腳本文件夾下方。
Reposnipy是一種神經搜索引擎,可在Github上脫離類似的Python存儲庫,該庫由Reposim提供支持。請隨時嘗試一下!
RepoSim
├── LICENSE
├── README.md
├── data
│ ├── df2txt.py # Convert PoolC dataset for clone detection fine-tuning script
│ ├── repo_topic.json # Topic-Repos mapping
│ └── repo_topic.py # Script to select repos from topics
├── notebooks
│ ├── BiEncoder
│ │ ├── Embeddings_evaluation.ipynb # Evaluations for comparing different language models
│ │ ├── RepoSim.ipynb # Our approach's implementation
│ │ └── UnixCoder_C4_Evaluation.ipynb
│ └── CrossEncoder
│ ├── Clone_Detection_C4_Evaluation.ipynb
│ ├── HungarianAlgorithm.ipynb # Cross-encoder approaches for repo similarity comparison
│ └── keonalgorithms-TheAlgorithmsPython.csv # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
├── LICENSE
├── PlayGround.ipynb # For experimenting with repo embeddings
├── README.md
├── pipeline.py # Our approach's implementation as a HuggingFace pipeline
├── repo_sim.py
└── requirements.txt根據MIT許可分發。有關更多信息,請參見LICENSE 。