tarsier下載 - tarsier源代碼下載

tarsier

Ai源碼

v.0.6.0 - Microsoft OCR Support

下載

？網絡交互代理的視覺公用事業？

？主站點• ？ Twitter • ？不和諧

Tarsier

如果您嘗試使用LLM自動化Web互動，那麼您可能會遇到以下問題：

您應該如何將網頁饋送到LLM？（例如HTML，可訪問性樹，屏幕截圖）
您如何將LLM響應映射回Web元素？
您如何將僅文本LLM告知有關頁面的視覺結構的信息？

在Reworkd，我們迭代了成千上萬個真正的Web任務中的所有這些問題，以為Web代理建立強大的感知系統... Tarsier！在下面的視頻中，我們使用Tarsier為簡約的GPT-4 Langchain Web代理提供網頁感知。

tarsier.mp4

它如何工作？

tarsier在視覺上通過括號 + ID EG [23]在頁面上可相互作用的元素標記。為此，我們為LLM提供了元素和ID之間的映射，以採取行動（例如， CLICK [23] ）。我們將可相互作用的元素定義為頁面上可見的按鈕，鏈接或輸入字段；如果您通過tag_text_elements=True tarsier也可以標記所有文本元素。

此外，我們已經開發了一種OCR算法，將頁面屏幕截圖轉換為一個空間結構的字符串（幾乎像ASCII Art），即使沒有視覺也可以理解LLM。由於當前的視覺模型仍然缺乏Web交互任務所需的細粒度表示，因此這很關鍵。在我們的內部基準測試中，Un-Imodal GPT-4 + Tarsier-Texts擊敗GPT-4V + Tarsier-Screenshot 10-20％！

標記的屏幕截圖	標記的文本表示

安裝

pip install tarsier

用法

使用Tarsier訪問我們的食譜以獲取代理示例：

自主蘭鍊網絡代理？ ⛓️
自主Llamaindex網絡代理？

我們目前支持2個OCR引擎：Google Vision和Microsoft Azure。要為Google創建服務帳戶憑據，請按照此說明進行回答https://stackoverflow.com/a/46290808/1780891

Microsoft Azure的憑據存儲為簡單的JSON，由API密鑰和端點組成

{
  "key" : " <enter_your_api_key> " ,
  "endpoint" : " <enter_your_api_endpoint> "
}

這些值可以在計算機視覺資源的鍵和端點部分中找到。請參閱https://learn.microsoft.com/en-us/answers/questions/854952/dont-find-find-your-key-and-your-and-your-endpoint上的說明

否則，基本的tarsier用法可能看起來如下：

 import asyncio

from playwright . async_api import async_playwright
from tarsier import Tarsier , GoogleVisionOCRService , MicrosoftAzureOCRService
import json

def load_ocr_credentials ( json_file_path ):
    with open ( json_file_path ) as f :
        credentials = json . load ( f )
    return credentials

async def main ():
    # To create the service account key, follow the instructions on this SO answer https://stackoverflow.com/a/46290808/1780891

    google_cloud_credentials = load_ocr_credentials ( './google_service_acc_key.json' )
    #microsoft_azure_credentials = load_ocr_credentials('./microsoft_azure_credentials.json')

    ocr_service = GoogleVisionOCRService ( google_cloud_credentials )
    #ocr_service = MicrosoftAzureOCRService(microsoft_azure_credentials)

    tarsier = Tarsier ( ocr_service )

    async with async_playwright () as p :
        browser = await p . chromium . launch ( headless = False )
        page = await browser . new_page ()
        await page . goto ( "https://news.ycombinator.com" )

        page_text , tag_to_xpath = await tarsier . page_to_text ( page )

        print ( tag_to_xpath )  # Mapping of tags to x_paths
        print ( page_text )  # My Text representation of the page


if __name__ == '__main__' :
    asyncio . run ( main ())

請記住，Tarsier對不同類型的元素進行不同的標籤，以幫助您的LLM確定每個元素上的操作。具體來說：

[#ID] ：文本可用字段（例如textarea ，帶有文本類型的input ）
[@ID] ：超鏈接（ <a>標籤）
[$ID] ：其他可相互作用的元素（例如button ， select ）
[ID] ：純文本（如果您通過tag_text_elements=True ）

地方發展

設定

我們提供了一個方便的設置腳本，可以通過Tarsier開發來啟動和運行。

./script/setup.sh

如果修改Tarsier使用的任何打字稿文件，則需要執行以下命令。這將打字稿編譯到JavaScript中，然後可以在Python軟件包中使用。

npm run build

測試

我們使用pytest進行測試。要運行測試，只需運行：

poetry run pytest .

覆蓋

在提交潛在的PR之前，請運行以下格式化您的代碼：

./script/format.sh

支持的OCR服務

Google Cloud Vision
亞馬遜士兵（即將推出）
Microsoft Azure計算機視覺（即將推出）

路線圖

引用

 bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

展開

附加信息

版本 v.0.6.0 - Microsoft OCR Support
類型 Ai源碼
更新時間 2025-08-24
大小 259.19MB
來自於 Github

相關應用

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部