image to text models

15 models · ranked by HuggingFace downloads

GLM-OCR

GLM-OCR is a multilingual OCR and document understanding model from ZhipuAI, built on the GLM architecture and supporting text recognition across Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. It treats OCR as a sequence generation task, enabling structured text extraction from document images and screenshots. MIT licensed.

3,080,576 ↓ · 1,894 ♡

blip-image-captioning-base

BLIP (Bootstrapped Language-Image Pretraining) base model for image captioning, using a vision encoder connected to a decoder via cross-attention. It introduced a bootstrapping approach that filters noisy web-crawled image-text pairs during training.

1,826,334 ↓ · 864 ♡

blip-image-captioning-large

blip-image-captioning-large generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.

743,803 ↓ · 1,477 ♡

PP-OCRv5_server_det

PP-OCRv5_server_det generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.

569,922 ↓ · 73 ♡

NuExtract3

NuExtract3 is NuMind's document-understanding model fine-tuned from Qwen3.5-4B for structured information extraction. It converts documents, images, and PDFs to structured Markdown or JSON output, targeting RAG preprocessing and enterprise document pipelines.

520,207 ↓ · 272 ♡

UVDoc

UVDoc is Baidu's document image unwarping model using PaddleOCR infrastructure, designed to correct perspective distortions and page curling in scanned documents before OCR. It uses a PaddlePaddle backend rather than PyTorch or JAX. Supports Chinese and English documents. Apache-2.0 licensed.

510,475 ↓ · 11 ♡

PP-LCNet_x1_0_doc_ori

PP-LCNet_x1_0_doc_ori is a lightweight document orientation classifier from PaddleOCR that determines whether a scanned document page is upright, rotated 90°, 180°, or 270°. It is a pre-processing component in PaddleOCR's document digitalisation pipeline, ensuring OCR models receive correctly oriented input. The x1.0 scale balances classification speed and accuracy for batch document processing.

444,436 ↓ · 16 ♡

trocr-small-handwritten

trocr-small-handwritten is Microsoft's small-scale TrOCR model fine-tuned specifically for handwritten text recognition, combining a Vision Transformer image encoder with a Transformer text decoder. It is described in arXiv:2109.10282 and uses a vision-encoder-decoder architecture that was pre-trained on large OCR datasets before handwriting-specific fine-tuning. The small variant is optimized for deployment contexts where inference speed and memory are constrained.

443,600 ↓ · 63 ♡

trocr-base-printed

trocr-base-printed accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.

421,235 ↓ · 209 ♡

manga-ocr-base

manga-ocr-base is a vision-encoder-decoder model fine-tuned on the Manga109s dataset for Japanese OCR specifically targeting manga panels and speech bubbles. It handles the challenges of vertical text, stylized fonts, and low-contrast artwork that defeat general-purpose OCR engines. The model is Japanese-only and is not designed for natural scene text or printed documents.

400,764 ↓ · 176 ♡

granite-vision-3.3-2b

IBM Granite Vision 3.3-2B is a compact multimodal model based on LLaVA-NeXT, supporting image understanding and visual question answering at 2B parameters. Targets edge and resource-constrained deployments, with training methodology described in arXiv:2502.09927.

371,052 ↓ · 85 ♡

en_PP-OCRv5_mobile_rec

PP-OCRv5 mobile recognition model from Baidu PaddlePaddle for English text recognition in OCR pipelines. Optimized for mobile deployment with a lightweight backbone while targeting competitive text recognition accuracy on printed and scene text.

346,891 ↓ · 2 ♡

nougat-base

Nougat is Meta's document understanding model that converts scientific PDFs (including LaTeX equations, tables, and figures) into structured Markdown text. It uses a vision encoder to process PDF page images and a text decoder to produce formatted output.

313,087 ↓ · 189 ♡

blip2-opt-2.7b-coco

blip2-opt-2.7b-coco accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.

310,532 ↓ · 11 ♡

pix2text-mfr

pix2text-mfr is an openly licensed image to text model. pix2text-mfr is MIT-licensed, clearing it for closed-source and paid products. Evaluate pix2text-mfr on your own data before trusting it in production.

297,733 ↓ · 54 ♡