Nanonets-OCR-s – OCR model that transforms documents into structured markdown

2025-06-166:1435577huggingface.co

","pad_token":"","unk_token":null},"chat_template_jinja":"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and…

","pad_token":"<|endoftext|>","unk_token":null},"chat_template_jinja":"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"},"createdAt":"2025-06-10T10:02:05.000Z","discussionsDisabled":false,"downloads":7961,"downloadsAllTime":7961,"id":"nanonets/Nanonets-OCR-s","isLikedByUser":false,"availableInferenceProviders":[],"inference":"","lastModified":"2025-06-14T19:53:55.000Z","likes":282,"pipeline_tag":"image-text-to-text","librariesOther":[],"trackDownloads":true,"model-index":null,"private":false,"repoType":"model","gated":false,"pwcLink":{"error":"Unknown error, can't generate link to Papers With Code."},"tags":["safetensors","qwen2_5_vl","OCR","pdf2markdown","image-text-to-text","conversational","en","base_model:Qwen/Qwen2.5-VL-3B-Instruct","base_model:finetune:Qwen/Qwen2.5-VL-3B-Instruct","region:us"],"tag_objs":[{"id":"image-text-to-text","label":"Image-Text-to-Text","type":"pipeline_tag","subType":"multimodal"},{"id":"safetensors","label":"Safetensors","type":"library"},{"id":"en","label":"English","type":"language"},{"id":"qwen2_5_vl","label":"qwen2_5_vl","type":"other"},{"id":"OCR","label":"OCR","type":"other"},{"id":"pdf2markdown","label":"pdf2markdown","type":"other"},{"id":"conversational","label":"conversational","type":"other"},{"id":"base_model:Qwen/Qwen2.5-VL-3B-Instruct","label":"base_model:Qwen/Qwen2.5-VL-3B-Instruct","type":"other"},{"id":"base_model:finetune:Qwen/Qwen2.5-VL-3B-Instruct","label":"base_model:finetune:Qwen/Qwen2.5-VL-3B-Instruct","type":"other"},{"type":"region","label":"🇺🇸 Region: US","id":"region:us"}],"widgetData":[{"text":"Hi, what can you help me with?"},{"text":"What is 84 * 3 / 2?"},{"text":"Tell me an interesting fact about the universe!"},{"text":"Explain quantum computing in simple terms."}],"safetensors":{"parameters":{"BF16":3754622976},"total":3754622976,"sharded":true},"hasBlockedOids":false,"region":"us","isQuantized":false,"xetEnabled":false},"discussionsStats":{"closed":0,"open":5,"total":5},"query":{},"inferenceContextData":{"billableEntities":[],"entityName2Providers":{}}}>

Nanonets-OCR-s is a powerful, state-of-the-art image-to-markdown OCR model that goes far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging, making it ideal for downstream processing by Large Language Models (LLMs).

Nanonets-OCR-s is packed with features designed to handle complex documents with ease:

  • LaTeX Equation Recognition: Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. It distinguishes between inline ($...$) and display ($$...$$) equations.
  • Intelligent Image Description: Describes images within documents using structured <img> tags, making them digestible for LLM processing. It can describe various image types, including logos, charts, graphs and so on, detailing their content, style, and context.
  • Signature Detection & Isolation: Identifies and isolates signatures from other text, outputting them within a <signature> tag. This is crucial for processing legal and business documents.
  • Watermark Extraction: Detects and extracts watermark text from documents, placing it within a <watermark> tag.
  • Smart Checkbox Handling: Converts form checkboxes and radio buttons into standardized Unicode symbols (, , ) for consistent and reliable processing.
  • Complex Table Extraction: Accurately extracts complex tables from documents and converts them into both markdown and HTML table formats.

📢 Read the full announcement | 🤗 Hugging Face Space Demo

Usage

Using transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText model_path = "nanonets/Nanonets-OCR-s" model = AutoModelForImageTextToText.from_pretrained( model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2"
)
model.eval() tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path) def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096): prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.""" image = Image.open(image_path) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image", "image": f"file://{image_path}"}, {"type": "text", "text": prompt}, ]}, ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt") inputs = inputs.to(model.device) output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False) generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) return output_text[0] image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

Using vLLM

vllm serve nanonets/Nanonets-OCR-s
from openai import OpenAI
import base64 client = OpenAI(api_key="123", base_url="http://localhost:8000/v1") model = "nanonets/Nanonets-OCR-s" def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8") def ocr_page_with_nanonets_s(img_base64): response = client.chat.completions.create( model=model, messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}, }, { "type": "text", "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.", }, ], } ], temperature=0.0, max_tokens=15000 ) return response.choices[0].message.content test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

Using docext

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

Checkout GitHub for more details.

BibTex

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}
Downloads last month
7,961


Read the original article

Comments

  • By PixelPanda 2025-06-166:147 reply

    Full disclaimer: I work at Nanonets

    Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:

    LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.

    Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.

    Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.

    Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.

    Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.

    Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

    Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s

    Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

    • By RicoElectrico 2025-06-1619:11

      Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?

    • By arkh 2025-06-179:17

      So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).

    • By uselesswords 2025-06-1714:15

      Have you found it has better accuracy or scales with larger models? Or are the improvements, if any, marginal compared to the 3B VLM model?

    • By wisdomseaker 2025-06-177:14

      Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc

    • By gibsonf1 2025-06-1619:582 reply

      Does it hallucinate with the LLM being used?

      • By michaelt 2025-06-1622:47

        Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.

        Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.

        Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.

        [1] https://imgur.com/a/8rJeHf8

      • By nattaylor 2025-06-1620:341 reply

        The base model is Qwen2.5-VL-3B and the announcement says a limitation is "Model can suffer from hallucination"

        • By gibsonf1 2025-06-1622:441 reply

          Seems a bit scary that the "source" text from the pdfs could actually be hallucinated.

          • By prats226 2025-06-1623:55

            Given that input is image and not raw pdf, its not completely unexpected

    • By generalizations 2025-06-1616:251 reply

      Does it have a way to extract the images themselves, or is that still a separate process later?

      • By j45 2025-06-1620:091 reply

        If you are after extracting images from pdfs there’s plenty of tools that do that just fine without LLMs.

        • By generalizations 2025-06-1620:151 reply

          I mean, ideally it would be in context, so the generated markdown references the correct image at the correct location in the doc. Unless that's what you're talking about? In which case I don't know about those tools.

  • By kordlessagain 2025-06-1616:152 reply

    I created a Powershell script to run this locally on any PDF: https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...

    It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.

    Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.

    • By kordlessagain 2025-06-1617:22

      Reporting back on this, here's some sample output from https://www.sidis.net/animate.pdf:

        THE ANIMATE
        AND THE INANIMATE
      
        WILLIAM JAMES SIDIS
      
        <img>A black-and-white illustration of a figure holding a book with the Latin phrase "ARTI et VERITATI" below it.</img>
      
        BOSTON
      
        RICHARD G. BADGER, PUBLISHER
      
        THE GORHAM PRESS
      
        Digitized by Google
      
      I haven't see ANY errors in what it has done, which is quite impressive.

      Here, it's doing tables of contents (I used a slightly different copy of the PDF than I linked to):

        <table>
          <tr>
            <td>Chapter</td>
            <td>Page</td>
          </tr>
          <tr>
            <td>PREFACE</td>
            <td>3</td>
          </tr>
          <tr>
            <td>I. THE REVERSE UNIVERSE</td>
            <td>9</td>
          </tr>
          <tr>
            <td>II. REVERSIBLE LAWS</td>
            <td>14</td>
          </tr>
      
      Other than the fact it is ridiculously slow, this seems to be quite good at doing what it says it does.

    • By 2pointsomone 2025-06-1619:111 reply

      Very very interested!

  • By el_don_almighty 2025-06-1613:112 reply

    I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

    Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

    • By pxc 2025-06-1616:47

      Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?

    • By toledocavani 2025-06-1713:57

      Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me

HackerNews