llmdocparser下載llmdocparser源代碼下載

llmdocparser

其他源碼

1.0.0

下載

llmdocparser

用於解析PDF的軟件包並使用LLMS分析其內容。

該軟件包是基於GPTPDF概念的改進。

方法

GPTPDF使用pymupdf來解析PDF，識別文本和非文本區域。然後，它根據某些規則合併或過濾文本區域，並將最終結果輸入到多模式模型中進行解析。此方法特別有效。

基於這個概念，我做出了一些小改進。

主要過程

使用佈局分析模型，對PDF的每個頁面進行解析，以識別每個區域的類型，其中包括文本，標題，圖，圖形說明，表格，表格，標題，標題，頁腳，頁腳，參考和方程。還獲得了每個區域的坐標。

佈局分析結果示例：

 [{'header': ((101, 66, 436, 102), 0)},
 {'header': ((1038, 81, 1088, 95), 1)},
 {'title': ((106, 215, 947, 284), 2)},
 {'text': ((101, 319, 835, 390), 3)},
 {'text': ((100, 565, 579, 933), 4)},
 {'text': ((100, 967, 573, 1025), 5)},
 {'text': ((121, 1055, 276, 1091), 6)},
 {'reference': ((101, 1124, 562, 1429), 7)},
 {'text': ((610, 565, 1089, 930), 8)},
 {'text': ((613, 976, 1006, 1045), 9)},
 {'title': ((612, 1114, 726, 1129), 10)},
 {'text': ((611, 1165, 1089, 1431), 11)},
 {'title': ((1011, 1471, 1084, 1492), 12)}]

該結果包括每個區域的類型，坐標和閱讀順序。通過使用此結果，可以將更精確的規則設置為解析PDF。

最後，將相應區域的圖像輸入到多模型模型（例如GPT-4O或QWEN-VL）中，以直接獲得對RAG解決方案友好的文本塊。

img_path	類型	page_no	文件名	內容	filepath
{absolute_path}/page_1_title.png	標題	1	注意就是您所需要的	[文本塊1]	{file_absolute_path}
{absolute_path}/page_1_text.png	文字	1	注意就是您所需要的	[文本塊2]	{file_absolute_path}
{absolute_path}/page_2_figure.png	數字	2	注意就是您所需要的	[文本塊3]	{file_absolute_path}
{absolute_path}/page_2_figure_caption.png	圖標題	2	注意就是您所需要的	[文本塊4]	{file_absolute_path}
{absolute_path}/page_3_table.png	桌子	3	注意就是您所需要的	[文本塊5]	{file_absolute_path}
{absolute_path}/page_3_table_caption.png	桌子標題	3	注意就是您所需要的	[文本塊6]	{file_absolute_path}
{absolute_path}/page_1_header.png	標題	1	注意就是您所需要的	[文本塊7]	{file_absolute_path}
{absolute_path}/page_2_footer.png	頁尾	2	注意就是您所需要的	[文本塊8]	{file_absolute_path}
{absolute_path}/page_3_reference.png	參考	3	注意就是您所需要的	[文本塊9]	{file_absolute_path}
{absolute_path}/page_1_equation.png	方程	1	注意就是您所需要的	[文本塊10]	{file_absolute_path}

請參閱llm_parser.py主函數中的更多內容。

安裝

 pip install llmdocparser

從源安裝

要從源安裝此項目，請執行以下步驟：

克隆存儲庫：
首先，將存儲庫克隆到您的本地計算機。打開終端並運行以下命令：
```
git clone https://github.com/lazyFrogLOL/llmdocparser.git
cd llmdocparser
```
安裝依賴項：
該項目使用詩歌進行依賴管理。確保安裝了詩歌。如果沒有，您可以按照詩歌安裝指南中的說明進行操作。
安裝詩歌后，在項目的根目錄中運行以下命令以安裝依賴項：
```
poetry install
```
這將讀取pyproject.toml文件，並安裝項目的所有必要依賴項。

用法

 from llmdocparser . llm_parser import get_image_content

content , cost = get_image_content (
    llm_type = "azure" ,
    pdf_path = "path/to/your/pdf" ,
    output_dir = "path/to/output/directory" ,
    max_concurrency = 5 ,
    azure_deployment = "azure-gpt-4o" ,
    azure_endpoint = "your_azure_endpoint" ,
    api_key = "your_api_key" ,
    api_version = "your_api_version"
)
print ( content )
print ( cost )

參數

llm_type：str
選項是Azure，OpenAi，Dashscope。
PDF_Path：Str
PDF文件的路徑。
output_dir：str
輸出目錄以存儲所有解析的圖像。
max_concurrency：int
GPT解析工人線程的數量。批次通話詳細信息：批次支持

如果使用Azure，則需要傳遞Azure_Deployment和Azure_endpoint參數；否則，只需要提供API密鑰。

base_url：str
OpenAI兼容服務器URL。詳細信息：與OpenAI兼容的服務器

成本

使用“注意力是您需要”的論文進行分析，選擇的模型為GPT-4O，成本如下：

 Total Tokens: 44063
Prompt Tokens: 33812
Completion Tokens: 10251
Total Cost (USD): $0.322825

平均每頁費用：$ 0.0215

星曆史

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-04-19
大小 1.19MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部