How to construct knowledge graphs from unstructured data sources.
Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a package library or product.
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt The full demo app is in demo.py:
python3 demo.pyThis demo scrapes text sources from articles about the linkage between
dementia and regularly eating processed red meat, then produces a graph
using NetworkX, a vector database of text chunk embeddings using
LanceDB, and an entity embedding model using gensim.Word2Vec,
where the results are:
data/kg.json -- serialization of NetworkX graphdata/lancedb -- vector database tablesdata/entity.w2v -- entity embedding modelkg.html -- interactive graph visualization in PyVisA collection of Jupyter notebooks illustrate important steps within this workflow:
./venv/bin/jupyter-labconstruct.ipynb -- detailed KG construction using a lexical graphchunk.ipynb -- simple example of how to scrape and chunk textvector.ipynb -- query LanceDB table for text chunk embeddings (after running demo.py)embed.ipynb -- query the entity embedding model (after running demo.py)Objective: Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.
These steps define a generalized process, where this tutorial picks up at the lexical graph:
Semantic overlay:
Data graph:
Lexical graph:
This approach is in contrast to using a large language model (LLM) as a one size fits all "black box" approach to generate the entire graph automagically. Black box approaches don't work well for KG practices in regulated environments, where audits, explanations, evidence, data provenance, etc., are required.
Better yet, review the intermediate results after each inference step to
collect human feedback for curating the KG components, e.g., using
Argilla.
KGs used in mission-critical apps such as investigations generally rely on updates, not a one-step construction process. By producing a KG based on the steps above, updates can be handled more effectively. Downstream apps such as Graph RAG for grounding the LLM results will also benefit from improved data quality.
spaCy: https://spacy.io/GLiNER: https://github.com/urchade/GLiNERGLiREL: https://github.com/jackboyla/GLiRELOpenNRE: https://github.com/thunlp/OpenNRENetworkX: https://networkx.org/PyVis: https://github.com/WestHealth/pyvisLanceDB: https://github.com/lancedb/lancedbgensim: https://github.com/piskvorky/gensimpandas: https://pandas.pydata.org/Pydantic: https://github.com/pydantic/pydanticPyinstrument: https://github.com/joerick/pyinstrumentNote: you must use the nre.sh script to load OpenNRE pre-trained models before running the opennre.ipynb notebook.