In the field of computer science, processing complex documents and converting them into structured data has always been a challenging problem. Traditional methods often rely on complex model combinations or large multimodal models, which, while powerful, often have hallucinations and are computationally expensive.
Recently, IBM and Hugging Face collaborated to launch SmolDocling, an open source vision-language model (VLM) with only 256M parameters, designed to solve multimodal document conversion tasks end-to-end. SmolDocling is unique in its compact size and powerful capabilities, which significantly reduces computational complexity and resource requirements.
SmolDocling's architecture is based on Hugging Face's SmolVLM-256M, and achieves a significant reduction in computational complexity through optimized tokenization and aggressive visual feature compression methods. Its core advantage lies in the innovative DocTags format, which can clearly separate document layout, text content, and visual information such as tables, formulas, code snippets and charts.
To train more efficiently, SmolDocling adopts a course learning approach, first "freezing" the visual encoder and then gradually fine-tuning using a richer dataset to enhance visual semantic alignment between different document elements. Thanks to its efficiency, SmolDocling processes the entire document page very quickly, taking only 0.35 seconds per page on consumer GPUs and consumes less than 500MB of video memory.
In performance testing, SmolDocling performed well, significantly outperforming many larger competitive models. For example, in the full-page document OCR task, SmolDocling achieved significantly higher accuracy compared to Qwen2.5VL with 7 billion parameters and Nougat with 350 million parameters, with lower editing distance (0.48) and higher F1 score (0.80).
In terms of formula transcription, SmolDocling also reached an F1 score of 0.95, comparable to state-of-the-art models such as GOT. What is even more commendable is that SmolDocling has set a new benchmark in code snippet recognition, with accuracy and recall rates as high as 0.94 and 0.91 respectively.
SmolDocling differs from other document OCR solutions in that it is able to handle various complex elements in a document, including code, charts, formulas, and various layouts. Its capabilities are not limited to common scientific papers, but also reliable processing of patents, forms and commercial documents.
With DocTags providing comprehensive structured metadata, SmolDocling removes ambiguity inherent in formats such as HTML or Markdown, thereby improving downstream availability of document transformations. Its compact size also enables large-scale batch processing with extremely low resource requirements, providing cost-effective solutions for large-scale deployments.
In short, the release of SmolDocling represents a major breakthrough in document conversion technology. It strongly demonstrates that compact models not only compete with large base models, but also significantly surpass them in mission-critical tasks. The researchers successfully demonstrated that through targeted training, innovative data augmentation, and new markup formats like DocTags, the limitations traditionally related to model size and complexity can be overcome. SmolDocling's open source not only sets new standards of efficiency and versatility for OCR technology, but also provides a valuable resource for the community through open data sets and efficient and compact model architecture.