abstracts-search is a project about indexing 95 million academic publications into a single semantic search engine. The method behind it is to take the publicly-available abstracts in the OpenAlex dataset and generate embeddings using the all-MiniLM-L6-v2 model provided by sentence-transformers.
該項目分為三個存儲庫:
abstracts-search : Hosts build.py and train.py , the embedding and indexing scripts respectivelyabstracts-embeddings : Hosts the raw embeddings (released under CC0) as a Hugging Face Datasetabstracts-index : Hosts the index and app.py , the search interface, as a Hugging Face Space (also released under CC0)該項目未提供與出版物(標題,摘要,作者等)相關的所有數據。相反,它僅包含標記為OpenAlex ID的嵌入式,並且ID用於從Openalex API獲取該數據。因此,始終需要互聯網連接。儘管如此,在本地運行語義搜索還是可取的。
If that is the case, the only repo that needs to be cloned is abstracts-index :
git lfs install
git clone https://huggingface.co/spaces/colonelwatch/abstracts-index
cd abstracts-index
pip3 install -r requirements.txt
python3 app.py
所有建築物都是在一台具有16 GB RAM(加上16 GB交換),RTX 2060 6GB和1 TB刮擦磁盤的機器上完成的,因此現在提出了最低要求。
There are two ways to build the index: from the abstracts-embeddings (recommended) or from the OpenAlex S3 bucket.
To build from abstracts-embeddings , make sure conda and gcc-12 are available:
git lfs install
git clone https://github.com/colonelwatch/abstracts-search
env CC=gcc-12 conda env create -f environment.yml
conda activate abstracts-search
git submodule update --init abstracts-embeddings
cd abstracts-embeddings
cat embeddings_*.memmap > embeddings.memmap
cd ..
env GIT_LFS_SKIP_SMUDGE=1 git submodule update --init abstracts-index
python train.py
如果您想從Openalex S3存儲桶中構建,則可能也有興趣保留GIT歷史記錄。 Again, make sure conda and gcc-12 are available.
git lfs install
git clone https://github.com/colonelwatch/abstracts-search
env CC=gcc-12 conda env create -f environment.yml
conda activate abstracts-search
env GIT_LFS_SKIP_SMUDGE=1 git submodule update --init abstracts-embeddings
rm abstracts-embeddings/embeddings_*.memmap
rm abstracts-embeddings/openalex_ids.txt
python build.py
env GIT_LFS_SKIP_SMUDGE=1 git submodule update --init abstracts-index
python train.py