This repository contains code and resources to run a semantic and full-text search engine for books. It utilizes text embeddings and supports harvesting book metadata from various sources, using international standards like MARC21 and ONIX 3.
The application leverages Multilingual-E5-small for generating text embeddings and PostgreSQL with pgvector as vector store. This provides multilingual semantic search capabilities.
Follow these steps to set up and run the application:
Run the following command in the project directory:
docker compose upThis will start the PostgreSQL database with pgvector enabled.
Select and configure the appropriate gateway and service-uri for harvesting metadata by editing application.yaml.
Available options:
The first run may take some time as it will download the necessary embedding models. Once the models are in place, the application will be ready for use.
./gradlew bootRunVisit http://localhost:8080 in the browser and watch the results as the metadata harvesting progresses. For
semantic search enter a search query or leave it blank for a random choice (the first search hit will be the random
choice and the rest will be semantically similar books). For full-text search enter a search query.
The gateway abstracts away the details of the external services and transforms metadata from the external services into
a common model. The application supports three gateways: OAI-PMH (MARC21), Bokbasen (ONIX) and Bibbi. Custom mappers can
be implemented as needed and activated by configuring the appropriate values in application.yaml.
The OAI-PMH gateway harvests metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It supports retrieving bibliographic data in MARC21 format.
Additional documentation for OAI-PMH from Biblioteksentralen (https://www.bibsent.no/):
The Bokbasen gateway uses the ONIX format for metadata, commonly employed in the publishing industry. This is particularly useful for harvesting data from large-scale book vendors.
Additional documentation for ONIX from Bokbasen (https://www.bokbasen.no/):
The Bibbi gateway is used for integrating with the Bibbi metadata service. The gateway uses a format based on Schema.org.
Additional documentation for Bibbi from Biblioteksentralen (https://www.bibsent.no/):
Instructions for extracting a dataset for fine-tuning a BERT-based model for multi-label classification of book reviews: https://github.com/torleifg/book-reviews-genre-classification
psql -h localhost -p 5433 -U username -d postgresExtract example dataset using genre and form as labels.
copy (
select
concat(metadata ->>'title', '. ', metadata ->>'description') as text,
metadata ->>'genreAndForm' as labels
from
book
where
metadata->>'description' is not null
and metadata->>'description' <> ''
and length(metadata->>'description') > 200
and metadata->>'genreAndForm' is not null
and metadata->>'genreAndForm' <> '[]'
) to '~/dataset.csv' with csv header delimiter ';';