RepoSim
1.0.0
一种使用预训练的语言模型检测语义上相似的Python存储库的方法。
该存储库包含用于使用预训练的语言模型来检测语义上相似的Python存储库的方法和脚本。
目前,我们最佳性能模型是使用AdvTest数据集对代码搜索任务进行微调。有关在存储库相似性比较的不同语言模型的评估,请参阅此jupyter笔记本:笔记本/biencoder/biencoder/embeddings_evaluation.ipynb
有关我们方法的实现和应用程序的更多详细信息,请参见脚本文件夹下方。
Reposnipy是一种神经搜索引擎,可在Github上脱离类似的Python存储库,该库由Reposim提供支持。请随时尝试一下!
RepoSim
├── LICENSE
├── README.md
├── data
│ ├── df2txt.py # Convert PoolC dataset for clone detection fine-tuning script
│ ├── repo_topic.json # Topic-Repos mapping
│ └── repo_topic.py # Script to select repos from topics
├── notebooks
│ ├── BiEncoder
│ │ ├── Embeddings_evaluation.ipynb # Evaluations for comparing different language models
│ │ ├── RepoSim.ipynb # Our approach's implementation
│ │ └── UnixCoder_C4_Evaluation.ipynb
│ └── CrossEncoder
│ ├── Clone_Detection_C4_Evaluation.ipynb
│ ├── HungarianAlgorithm.ipynb # Cross-encoder approaches for repo similarity comparison
│ └── keonalgorithms-TheAlgorithmsPython.csv # Evaluation results by ungarianAlgorithm.ipynb
└── scripts
├── LICENSE
├── PlayGround.ipynb # For experimenting with repo embeddings
├── README.md
├── pipeline.py # Our approach's implementation as a HuggingFace pipeline
├── repo_sim.py
└── requirements.txt根据MIT许可分发。有关更多信息,请参见LICENSE 。