autoPDFtagger is a Python tool designed for efficient home-office organization, focusing on digitizing and organizing both digital and paper-based documents. By automating the tagging of PDF files, including image-rich documents and scans of varying quality, it aims to streamline the organization of digital archives.
In the advancing digital age, many documents are now delivered digitally, yet significant documents often still arrive in paper form. Looking towards a digital future, the consolidation of these documents into a unified digital archive becomes increasingly valuable. Simple scanning using smartphone cameras has made this practical. However, the reliability of existing OCR technologies and their limited ability to effectively index non-textual content like drawings or photos hampers the searchability of these documents. autoPDFtagger aims to bridge this gap by offering AI-assisted analysis and organization of PDF files, enhancing their searchability and organization with a level of precision comparable to human effort.
At the moment, there exists a functional prototype in the form of a terminal program with a Python module, which demonstrates its functionality and has already achieved impressive results for me. For a broader application, many detailed improvements are certainly necessary, especially in testing, promt-optimization, error handling and documentation.
If you find this tool helpful and have ideas to improve it, feel free to contribute. While I'm not a full-time programmer and i'm not feeling professional at all, any suggestions or enhancements are welcome. Submit bug reports, feature requests, or any other feedback. Thanks for stopping by!
$ pip install git+https://github.com/Uli-Z/autoPDFtaggerCreate configuration file and save it to ~/.autoPDFtagger.conf:
; Configuration for autoPDFtagger
[DEFAULT]
language = {YOUR LANGUAGE}
[OPENAI-API]
API-Key = {INSERT YOUR API-KEY}The program is fundamentally structured as follows:
file analysis)text analysis)image analysis)tag analysis)Note: Principally, (almost) all options are combinable. The order of the individual steps is fixed, however; they are processed in the order mentioned above. Instead, the use of piping in the terminal is explicitly considered, allowing to pass the state of the database to another instance of the program. This makes it possitble to check and modify each step (e.g., first text analysis, then filtering by quality, followed by image analysis, then re-filtering, and finally exporting the PDF files). Using JSON-Output, the results of the program can be piped directly to another instance of the program.
$ autoPDFtagger --help
usage: autoPDFtagger [-h] [--config-file CONFIG_FILE] [-b [BASE_DIRECTORY]] [-j [JSON]] [-s [CSV]] [-d {0,1,2}] [-f] [-t] [-i] [-c] [-e [EXPORT]] [-l]
[--keep-above [KEEP_ABOVE]] [--keep-below [KEEP_BELOW]] [--calc-stats]
[input_items ...]
Smart PDF-analyzing Tool
positional arguments:
input_items List of input PDFs and folders, alternativly you can use a JSON- or CSV-file
options:
-h, --help show this help message and exit
--config-file CONFIG_FILE
Specify path to configuration file. Defaults to ~/.autoPDFtagger.conf
-b [BASE_DIRECTORY], --base-directory [BASE_DIRECTORY]
Set base directory
-j [JSON], --json [JSON]
Output JSON-Database to stdout. If filename provided, save it to file
-s [CSV], --csv [CSV]
Output CSV-Database to specified file
-d {0,1,2}, --debug {0,1,2}
Debug level (0: no debug, 1: basic debug, 2: detailed debug)
-f, --file-analysis Try to conventionally extract metadata from file, file name and folder structure
-t, --ai-text-analysis
Do an AI text analysis
-i, --ai-image-analysis
Do an AI image analysis
-c, --ai-tag-analysis
Do an AI tag analysis
-e [EXPORT], --export [EXPORT]
Copy Documents to a target folder
-l, --list List documents stored in database
--keep-above [KEEP_ABOVE]
Before applying actions, filter out and retain only the documents with a confidence index greater than or equal to a specific
value (default: 7).
--keep-below [KEEP_BELOW]
Analogous to --keep-above. Retain only document with an index less than specified.
--calc-stats Calculate statistics and (roughly!) estimate costs for different analysesRead all pdf files from a folder pdf_archive, do a basic file analysis (-f) and store information in a JSON-database files.json (-j [filename]):
$ autoPDFtagger ./pdf_archive --file-analysis --json allfiles.jsonRead a previous created JSON-database an do an AI-text-analysis, storing the results in a new JSON-file
$ autoPDFtagger allfiles.json --ai-text-analysis --json textanalysis.jsonDo an AI-image-analysis for all files with estimated low-quality metadata.
$ autoPDFtagger textanalysis.json --keep-below --ai-image-analysis --json imageanalysis.jsonRecollect all together, analyse and organize tags
$ autoPDFtagger textanalysis.json imageanalysis.json --ai-tag-analysis --json final.jsonCopy the files to a new folder new_archive setting new metadata and assigning new filenames. The original folder structure remains unchanged.
$ autoPDFtagger final.json -e ./new_archiveDo everything at once:
$ autoPDFtagger pdf_archive -ftic -e new_archivemain.py: The terminal interface for the application.autoPDFtagger.py: Manages the core functionalities of the tool.AIAgents.py: Base classes for AI agent management, including OpenAI API communication.AIAgents_OPENAI_pdf.py: Specific AI agents dedicated to text, image, and tag analysis.PDFDocument.py: Handles individual PDF documents, managing metadata reading and writing.PDFList.py: Oversees a database of PDF documents, their metadata, and provides export functions.config.py: Manages configuration files.autoPDFtagger_example_config.conf: An example configuration file outlining API key setup and other settings.GPL-3