This project provides a set of scripts and tools for converting XML files into JSON format. It is designed to work with different XML data sources and is fully customizable, supporting multiple conversion modules. The project is divided into separate Python modules for handling different kinds of data, including documents, persons, and archives.
The solution includes:
convert/
archiveLinkConvert.py # Handles conversion of archive link XMLs
artworkConvert.py # Handles artwork XML data
commonConvert.py # Contains common conversion utilities
personConvert.py # Handles conversion of person-related XMLs
docs/
pictures/ # Picture documentation related to the project
Analyza_SP.md # Analysis related documentation
documentaria_rudolphina.md # Project-specific documentation
model/
ArchiveLink.py # Data model for archive links
Document.py # Data model for documents
Person.py # Data model for person records
scripts/
main_convert.py # Main script to execute conversion
.gitignore # Git ignore configuration
README.md # This documentation fileTo use this tool, you'll need Python and pip installed.
Then, run the following command:
pip install -r requirements.txtThis will install necessary libraries to run the script. Then simply run the main_convert.py script with the appropriate options. Here are the main commands to run the program from the XMLtoJSON directory:
Display help information:
python3 scripts/main_convert.py --helpor
python3 scripts/main_convert.py --hConvert all types of XML files:
python3 scripts/main_convert.py --type all --input_path "path_for_input_data" --output_path "path_for_output_data"Convert name-related XML files:
python3 scripts/main_convert.py --type names --input_path "path_for_input_data" --output_path "path_for_output_data"Convert register-related XML files:
python3 scripts/main_convert.py --type registers --input_path "path_for_input_data" --output_path "path_for_output_data"Convert archive-related XML files:
python3 scripts/main_convert.py --type archive --input_path "path_for_input_data" --output_path "path_for_output_data"The input data folder should be structured as follows:
input_data/
Archiv/ # Archive-related XML files
Regesten/ # Register-related XML files
Namen/ # Name-related XML files
Indicies/ # Index-related XML files git clone https://github.com/VandlJ.git
cd XMLtoJSONTo begin the conversion, use the main conversion script. For example, to convert all XML files:
python3 scripts/main_convert.py --type all --input_path "../test_data" --output_path "../test_data/output"You can also check out all available options and get detailed information by running:
python3 scripts/main_convert.py --helpThis command will start processing the XML files in the specified --input_path directory and output the results to the --output_path directory.
This project was inherited from another team, and we made several significant improvements and fixes to enhance its functionality and reliability:
Error Handling: Spaces/Blank Characters for Indentation in Text - in Regesten Files
display: This field is used for displaying text on the frontend, ensuring it retains the original formatting for readability.processable: This field contains a cleaner version of the text, optimized for computer processing and analysis.Metadata Handling: Problem Metadata in Regesten
.p in the Regesten files. Some elements were missing or incorrectly captured. We conducted a thorough review and ensured that all metadata elements are now accurately captured and processed in our iteration of the program.Enhanced Interactivity: Add Information onmouseover="highlightWords(event, '...')" in Regesten
onmouseover attribute was added to highlight words when hovered over. The processed data now includes:
"names": [
{
"Aichholz_Johann": "Johann Aichholz",
"alias": "Johann Aichholz Ehrzney doctor"
},
{
"Strauben_Franz": "Franz Strauben",
"alias": "Frannzen Strauben"
}
]Name Processing: Splitting First Name and Last Name via External Tool - GettyULAN
Unified Main Script for Conversion
--type, --input_path, and --output_path. This change simplifies the execution and enhances the flexibility of the conversion process.Improved Documentation and Setup Instructions
Performance Enhancements and Bug Fixes
Fixes in Archiv Type JSON Output
hasSublink, linkTo, and next_link variables in the output JSON files for the Archiv type. This ensures that these variables are accurately represented and linked in the JSON output.These improvements have significantly enhanced the functionality, usability, and reliability of the XML to JSON Converter project, making it more robust and user-friendly.