Recently, Alibaba Tongyi Lab announced the open source of its latest R&D result - ViDoRAG, a search enhanced generation (RAG) system designed specifically for visual document understanding. ViDoRAG's test on the GPT-4o model showed that its accuracy rate reached an impressive 79.4%, which is more than 10% higher than traditional RAG systems. This breakthrough marks an important step in the field of visual document processing and provides new possibilities for the application of artificial intelligence in complex document understanding.

ViDoRAG is not a traditional single model, but adopts an innovative multi-agent framework design. The system combines dynamic iterative inference agents and hybrid retrieval technology based on GMM (Gaussian hybrid model). This approach allows ViDoRAG to extract and infer key information more accurately when processing visual documents containing images and text. Compared with the limitations of traditional RAG systems relying solely on text retrieval, ViDoRAG significantly improves performance through multimodal data fusion.
Tongyi Lab describes in detail how ViDoRAG works in its published papers and code repository. Its core lies in dynamically adjusting the search and generation process through the collaboration of multiple agents, thereby reducing "illusion" phenomena in complex scenarios (i.e., the model generates inaccurate or fabricated content) and improving the reliability and contextual relevance of answers.
The system has an accuracy of 79.4% on GPT-4o, a figure that not only demonstrates its excellent performance, but also compares it with traditional RAG systems. While traditional RAG systems perform well in text generation tasks, they are often limited to the retrieval ability of a single mode when processing visual documents, and their accuracy usually hover at a low level. ViDoRAG has increased the accuracy rate by more than 10 percentage points by introducing deep integration of visual information and text information. This advancement is of great significance for scenarios that require high-precision document understanding, such as legal document analysis, medical report interpretation and enterprise data processing.
Alibaba Tongyi Lab's move to open source ViDoRAG has also sparked heated discussions on Twitter. Users believe that the disclosure of this system not only reflects Alibaba's technical strength in the field of AI, but also provides a valuable resource for global developers and researchers. Through public papers and codes (relevant links have been shared in Twitter posts), ViDoRAG is expected to accelerate the research and application of visual document RAG technology and promote the further development of multimodal AI systems.
The release and open source of ViDoRAG have undoubtedly opened up new directions for RAG technology. With the increasing demand for visual document processing, the emergence of ViDoRAG may just be the beginning, and we may see more similar innovative systems emerging in the future.
Project: https://github.com/Alibaba-NLP/ViDoRAG