strwythura Download - strwythura Source Code Download

strwythura

Other source code

1.0.0

Download

GraphGeeks.org talk 2024-08-14

How to construct knowledge graphs from unstructured data sources.

event: https://live.zoho.com/PBOB6fvr6c
video: https://youtu.be/B6_NfvQL-BE
slides: https://derwen.ai/s/2njz#1

Caveat: this repo provides the source code and notebooks which accompany an instructional tutorial; it is not intended as a package library or product.

Set up

python3 -m venv venv
source venv/bin/activate
python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt

Run demo

The full demo app is in demo.py:

python3 demo.py

This demo scrapes text sources from articles about the linkage between dementia and regularly eating processed red meat, then produces a graph using NetworkX, a vector database of text chunk embeddings using LanceDB, and an entity embedding model using gensim.Word2Vec, where the results are:

data/kg.json -- serialization of NetworkX graph
data/lancedb -- vector database tables
data/entity.w2v -- entity embedding model
kg.html -- interactive graph visualization in PyVis

Explore notebooks

A collection of Jupyter notebooks illustrate important steps within this workflow:

./venv/bin/jupyter-lab

Part 1: construct.ipynb -- detailed KG construction using a lexical graph
Part 2: chunk.ipynb -- simple example of how to scrape and chunk text
Part 3: vector.ipynb -- query LanceDB table for text chunk embeddings (after running demo.py)
Part 4: embed.ipynb -- query the entity embedding model (after running demo.py)

Generalized, unbundled process

Objective: Construct a knowledge graph (KG) using open source libraries where deep learning models provide narrowly-focused point solutions to generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up at the lexical graph:

Semantic overlay:

load any pre-defined controlled vocabularies directly into the KG

Data graph:

load the structured data sources or updates into a data graph
perform entity resolution (ER) on PII extracted from the data graph
use ER results to generate a semantic overlay as a "backbone" for the KG

Lexical graph:

parse the text chunks, using lemmatization to normalize token spans
construct a lexical graph from parse trees, e.g., using a textgraph algorithm
analyze named entity recognition (NER) to extract candidate entities from NP spans
analyze relation extraction (RE) to extract relations between pairwise entities
perform entity linking (EL) leveraging the ER results
promote the extracted entities and relations up to the semantic overlay

This approach is in contrast to using a large language model (LLM) as a one size fits all "black box" approach to generate the entire graph automagically. Black box approaches don't work well for KG practices in regulated environments, where audits, explanations, evidence, data provenance, etc., are required.

Better yet, review the intermediate results after each inference step to collect human feedback for curating the KG components, e.g., using Argilla.

KGs used in mission-critical apps such as investigations generally rely on updates, not a one-step construction process. By producing a KG based on the steps above, updates can be handled more effectively. Downstream apps such as Graph RAG for grounding the LLM results will also benefit from improved data quality.