Explore the use of DSPy for extracting features from PDFs. This repository provides a simple example of how to use this framework to predict the sub-category of a Computer Science paper from arXiv.
The dataset is a selection of 150 arXiv papers (metadata + pdf) from the computer science category.
To build the database:
dspy-arxiv directory.arxiv.json.data.ipynb from top to bottom.At the end, you should have two directories:
If you want to add RAG to the pipeline, it's handy to have the data in a vector database for fast retrieval. Check out database.py for an example script to set up chromadb and populate it with arXiv metadata.
The notebook features.ipynb can be seen as a simple tutorial on how to use DSPy to programmatically prompt LLM for feature extraction (in this case, predicting the sub-category of a Computer Science paper from arXiv).
You can also take a look at the slides generated from this notebook.