This is the research project component developed under the guidance of Dr. Zachary Ives. The initial goal is to develop a graph layer on top of the Pennsieve database and enable machine learning through effective data extraction of medical data from complex and versatile file formats. This component enables natural language interaction with the database.
Note: All methods were implemented on the underlying graph built on Neo4j using another repository which will be linked once it is public. This project is ready to be used out of the box, however, without the underlying graph filled in you will not get any results.
__init__.py: Initializes the app package.
config.py: Handles configuration and environment variables.
database.py: Manages Neo4j database connection.
setup_neo4j_graph() returns a Neo4j graph configured with URL, username, and password provided in the .env file.setup_neo4j_graph() returns the Langchain Neo4j database wrapper. Important methods used: query() and refresh_schema(). Langchain Neo4jGraph Documentationmain.py: Entry point of the application. Pass the user query and retrieves the result by calling run_query(user_query: str) from qa_chain.py. It abstracts away all the complexities and provides a simple interface to interact with the system.dataguide.py: Extracts dataguide paths from the database and formats them into Cypher paths.
extract_dataguide_paths(graph: Neo4jGraph): Extracts dataguide paths from root to leaf using a Cypher query.format_paths_for_llm(results: List[Dict[str, Any]]): Formats results from extract_dataguide_paths into valid Cypher paths for MATCH queries.test.py: Tests the connection with Neo4j graph, extraction of dataguide paths, and formatting them. Outputs the time taken for each part.
prompt_generator.py: This module is responsible for creating and combining Langchain system and human prompts into langchain.prompts.ChatPromptTemplate. It is a crucial part of the project as it defines how the prompts are structured and used in the Langchain framework.
get_cypher_prompt_template(): This method returns the ChatPromptTemplate instance created in this file. It combines system and human prompts into a single template that can be used to generate Cypher queries from GraphCypherQAChain in qa_chain.py.input_variables, which specify the variables to be included in the prompt, and template, which defines the prompt's text.prompt, which defines the system message's text.prompt, which defines the human message's text.from_messages(), which takes a list of message templates and combines them into a chat prompt.qa_chain.py: Defines the run_query(user_query: str) function, which integrates all project components and runs a GraphCypherQAChain on the user query.
ChatOpenAI with AzureChatOpenAI if needed.__init__.py: Initializes the app package.
generate_descriptions.py: Defines the system prompt to generate descriptions from LLMs for Cypher paths.
generate_path_descriptions(all_paths: List[str]): Generates descriptions for the given paths using the LLM. Outputs a list of descriptions.generate_embedding(path_description: str): Generates embeddings for the given path description using the OpenAI embeddings API.random_path_generator.py: Provides methods to generate random paths from the database and format them into Cypher paths.vectorDB_setup.py: Provides methods to start Milvus container, connect with it, define collection schema, create collection, insert data, and conduct vector similarity searches.
main.py: Wrapper functions that combine all functionalities from this directory. For example, get_similar_paths_from_milvus is used in app/qa_chain.py to conduct vector similarity search with user queries.test.py: Methods to test various functionalities. Currently commented out.
write_read_data.py: Simple write and read methods to store Cypher paths and descriptions generated from API calls.
fill_collection_with_random_paths in paths_vectorDB/main.py writes down the paths and descriptions generated from API calls into data.txt.env.sample: Make a copy of this in your project root directory and rename it to .env. Fill in the values..gitignore: Specifies files and directories to be ignored by Git.README.md: Project documentation.docker-compose.yml: Docker file for Milvus DB. If there is a new version, replace this file. Ensure it is named docker-compose.yml and placed in the root directory.requirements.txt: Python dependencies and their compatible versions used for development. Note: The requirements.txt file was created through pipenv.Getting started with this project is simple. You can follow the steps below:
Clone the repository:
git clone https://github.com/hussainzs/chat-with-pennsieve.git
cd project_rootnote: Make sure you are in the project root directory before proceeding with the next steps.
Install dependencies:
pip install -r requirements.txtSet up environment variables:
env.sample and rename the file to .env and fill in the required values.Set up Neo4j Desktop:
.env file with the Neo4j connection details (URL, username, password). Default values have been filled in.Run app/main.py:
app directory and run main.py. Make sure your desired user query is passed as an argument to the run_query(user_query) function.docker-compose.yml in the root directory. When you run app/main.py, the Milvus containers will start automatically by running terminal commands. Check out paths_vectorDB/vectorDB_setup.py for more information.volumes. The folder contains 3 subfolders: milvus, minio, and etcd.Note: For further clarification of expected output when you run app/main.py, I'm attaching 2 pdfs of output generated from the system in the folder called Expected Outputs.
first_output.pdf shows what's expected when user runs the app/main.py for the first time in a new session with default values. (When you run it for the first time ever, it may take a while to download everything)regular_output.pdf shows what's expected when user runs the app/main.py in a regular session with default values.app and paths_vectorDB can significantly improve LLM performance. I witnessed that high quality examples in system prompt will increase the quality of description generation for paths. System Prompt also significantly effects the final answer from LLM.search_similar_vectors method inside of paths_vectorDB/vectorDB_setup.py for better results.app/main.py.