OpenDataLoader-PDF: An open source tool for structured PDF parsing

2025-09-2313:5810928github.com

Safe, Open, High-Performance — PDF for AI. Contribute to opendataloader-project/opendataloader-pdf development by creating an account on GitHub.

License Java Python Maven Central PyPI version npm version GHCR Version Coverage CLA assistant


Safe, Open, High-Performance — PDF for AI

OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).

It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.


  • 🧾 Rich, Structured Output — JSON, Markdown or Html
  • 🧩 Layout Reconstruction — Headings, Lists, Tables, Images, Reading Order
  • Fast & Lightweight — Rule-Based Heuristic, High-Throughput, No GPU
  • 🔒 Local-First Privacy — Runs fully on your machine
  • 🛡️ AI-Safety — Auto-Filters likely prompt-injection content - Learn more about AI-Safety
  • 🖍️ Annotated PDF Visualization — See detected structures overlaid on the original

Download Annotated PDF Sample

Annotated PDF Preview


  • 🖨️ OCR for scanned PDFs — Extract data from image-only pages
  • 🧠 Table AI option — Higher accuracy for tables with borderless or merged cells
  • Performance Benchmarks — Transparent evaluations with open datasets and metrics, reported regularly
  • 🛡️ AI Red Teaming — Transparent adversarial benchmarks with datasets and metrics, reported regularly

  • Java 11 or higher must be installed and available in your system's PATH.
  • Python 3.9+

pip install -U opendataloader-pdf
  • input_path can be either the path to a single document or the path to a folder.
  • If you don’t specify an output_folder, the output data will be saved in the same directory as the input document.
import opendataloader_pdf opendataloader_pdf.run( input_path="path/to/document.pdf", output_folder="path/to/output", generate_markdown=True, generate_html=True, generate_annotated_pdf=True,
)

The main function to process PDFs.

Parameter Type Required Default Description
input_path str ✅ Yes Path to the input PDF file or folder.
output_folder str No input folder Path to the output folder.
password str No None Password for the PDF file.
replace_invalid_chars str No " " Character to replace invalid or unrecognized characters (e.g., �, \u0000)
content_safety_off str No None Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg.
generate_markdown bool No False If True, generates a Markdown output file.
generate_html bool No False If True, generates an HTML output file.
generate_annotated_pdf bool No False If True, generates an annotated PDF output file.
keep_line_breaks bool No False If True, keeps line breaks in the output.
html_in_markdown bool No False If True, uses HTML in the Markdown output.
add_image_to_markdown bool No False If True, adds images to the Markdown output.
debug bool No False If True, prints CLI messages to the console during execution.

Note: This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.

  • Java 11 or higher must be installed and available in your system's PATH.
npm install @opendataloader/pdf
  • inputPath can be either the path to a single document or the path to a folder.
  • If you don’t specify an outputFolder, the output data will be saved in the same directory as the input document.
import { run } from '@opendataloader/pdf'; async function main() { try { const output = await run('path/to/document.pdf', { outputFolder: 'path/to/output', generateMarkdown: true, generateHtml: true, generateAnnotatedPdf: true, debug: true, }); console.log('PDF processing complete.', output); } catch (error) { console.error('Error processing PDF:', error); }
} main();

run(inputPath: string, options?: RunOptions): Promise<string>

The main function to process PDFs.

Parameters

Parameter Type Required Description
inputPath string ✅ Yes Path to the input PDF file or folder.
options RunOptions No Configuration options for the run.

RunOptions

Property Type Default Description
outputFolder string undefined Path to the output folder. If not set, output is saved next to the input.
password string undefined Password for the PDF file.
replaceInvalidChars string " " Character to replace invalid or unrecognized characters (e.g., , \u0000).
contentSafetyOff string undefined Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg.
generateMarkdown boolean false If true, generates a Markdown output file.
generateHtml boolean false If true, generates an HTML output file.
generateAnnotatedPdf boolean false If true, generates an annotated PDF output file.
keepLineBreaks boolean false If true, keeps line breaks in the output.
htmlInMarkdown boolean false If true, uses HTML in the Markdown output.
addImageToMarkdown boolean false If true, adds images to the Markdown output.
debug boolean false If true, prints CLI messages to the console during execution.

For various example templates, including Gradle and Maven, please refer to https://github.com/opendataloader-project/opendataloader-pdf/tree/main/examples/java.

To include OpenDataLoader PDF in your Maven project, add the dependency below to your pom.xml file.

Check for the latest version on Maven Central.

<project> <!-- other configurations... --> <dependencies> <dependency> <groupId>org.opendataloader</groupId> <artifactId>opendataloader-pdf-core</artifactId> <version>1.0.0</version> </dependency> </dependencies> <repositories> <repository> <snapshots> <enabled>true</enabled> </snapshots> <id>vera-dev</id> <name>Vera development</name> <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <snapshots> <enabled>false</enabled> </snapshots> <id>vera-dev</id> <name>Vera development</name> <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url> </pluginRepository> </pluginRepositories> <!-- other configurations... -->
</project>

To integrate Layout recognition API into Java code, one can follow the sample code below.

import org.opendataloader.pdf.api.Config;
import org.opendataloader.pdf.api.OpenDataLoaderPDF; import java.io.IOException; public class Sample { public static void main(String[] args) { Config config = new Config(); config.setOutputFolder("path/to/output"); config.setGeneratePDF(true); config.setGenerateMarkdown(true); config.setGenerateHtml(true); try { OpenDataLoaderPDF.processFile("path/to/document.pdf", config); } catch (Exception exception) { //exception during processing
        }
    }
}

The full API documentation is available at javadoc


Download sample PDF

curl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003

Run opendataloader-pdf in Docker container

docker run --rm -v "$PWD":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf --markdown --html --pdf

Build and install using Maven command:

mvn clean install -f java/pom.xml

If the build is successful, the resulting jar file will be created in the path below.

java/opendataloader-pdf-cli/target
java -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>

This generates a JSON file with layout recognition results in the specified output folder. Additionally, annotated PDF with recognized structures, Markdown and Html are generated if options --pdf, --markdown and --html are specified.

By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.

The option --keep-line-breaks to preserve the original line breaks text content in JSON and Markdown output. The option --content-safety-off disables one or more content safety filters. Accepts a comma-separated list of filter names. The option --markdown-with-html enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags. The option --markdown-with-images enables inclusion of image references into the output Markdown. The option --replace-invalid-chars replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character. The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.

Options:
-o,--output-dir <arg>           Specifies the output directory for generated files
--keep-line-breaks              Preserves original line breaks in the extracted text
--content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page, tiny, hidden-ocg
--markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
--markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
--markdown                      Sets the data extraction output format to Markdown
--html                          Sets the data extraction output format to HTML
-p,--password <arg>             Specifies the password for an encrypted PDF
--pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations
--replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character

Root json node

Field Type Optional Description
file name string no Name of processed pdf file
number of pages integer no Number of pages in pdf file
author string no Author of pdf file
title string no Title of pdf file
creation date string no Creation date of pdf file
modification date string no Modification date of pdf file
kids array no Array of detected content elements

Common fields of content json nodes

Field Type Optional Description
id integer yes Unique id of content element
level string yes Level of content element
type string no Type of content element
Possible types: footer, header, heading, line, table, table row, table cell, paragraph, list, list item, image, line art, caption, text block
page number integer no Page number of content element
bounding box array no Bounding box of content element

Specific fields of text content json nodes (caption, heading, paragraph)

Field Type Optional Description
font string no Font name of text
font size double no Font size of text
text color array no Color of text
content string no Text value

Specific fields of table json nodes

Field Type Optional Description
number of rows integer no Number of table rows
number of columns integer no Number of table columns
rows array no Array of table rows
previous table id integer yes Id of previous connected table
next table id integer yes Id of next connected table

Specific fields of table row json nodes

Field Type Optional Description
row number integer no Number of table row
cells array no Array of table cells

Specific fields of table cell json nodes

Field Type Optional Description
row number integer no Row number of table cell
column number integer no Column number of table cell
row span integer no Row span of table cell
column span integer no Column span of table cell
kids array no Array of table cell content elements

Specific fields of heading json nodes

Field Type Optional Description
heading level integer no Heading level of heading

Specific fields of list json nodes

Field Type Optional Description
number of list items integer no Number of list items
numbering style string no Numbering style of this list
previous list id integer yes Id of previous connected list
next list id integer yes Id of next connected list
list items array no Array of list item content elements

Specific fields of list item json nodes

Field Type Optional Description
kids array no Array of list item content elements

Specific fields of header and footer json nodes

Field Type Optional Description
kids array no Array of header/footer content elements

Specific fields of text block json nodes

Field Type Optional Description
kids array no Array of text block content elements

We believe that great software is built together.

Your contributions are vital to the success of this project.

Please read CONTRIBUTING.md for details on how to contribute.

Have questions or need a little help? We're here for you!🤗

We love our brand and want to protect it!

This project may contain trademarks, logos, or brand names for our products and services.

To ensure everyone is on the same page, please remember these simple rules:

  • Authorized Use: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
  • No Confusion: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
  • Third-Party Brands: Any use of trademarks or logos from other companies must follow that company’s specific policies.

This project is licensed under the Mozilla Public License 2.0.

For the full license text, see LICENSE.

For information on third-party libraries and components, see:


Read the original article

Comments

  • By emilburzo 2025-09-2317:203 reply

    I just tested it on one of my nemeses: PDF bank statements. They're surprisingly tough to work with if you want to get clean, structured transaction data out of them.

    The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.

    Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."

  • By hermitcrab 2025-09-2320:201 reply

    I got excited until I read that it was Java/Python based.

    I'm looking for a library that can extract data tables from PDF and can be called from a C++ program (for https://www.easydatatransform.com). If anyone can suggest something, I'm all ears.

    • By therealpygon 2025-09-2323:221 reply

      What makes Java/Python not able to be called from C++, or did you mean you have other requirements that make the project unsuitable?

      • By hermitcrab 2025-09-2418:47

        I can fire up a Java program in a separate process. But it is slow and passing data backwards and forwards is clunky. Much better to be able to do it all in one process.

  • By trevor-e 2025-09-2315:583 reply

    I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.

    I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.

    • By s0rce 2025-09-2319:561 reply

      Doesn't Latex do this?

      • By trevor-e 2025-09-2416:331 reply

        Yea I think Latex is capable of much of this but it's also cursed

        • By s0rce 2025-09-2518:20

          Don't need to convince me. I typeset my wife's PhD thesis in LaTeX and it looks great but it was so frustrating that after I did mine in Word.

    • By Jaxan 2025-09-2316:582 reply

      Wouldn’t it be better to invest in a human-friendly format first (which also could be AI-friendly).

      • By dotancohen 2025-09-2318:50

        If you can convince your bank to make available your bank statement in Markdown, let us know.

        Your transactions are probably already available in CSV.

      • By trevor-e 2025-09-2317:17

        Not really sure what you mean by a "human-friendly" file format, can you elaborate? File formats are inherently not friendly to humans, they are a bag of bytes. But that doesn't mean they can't be better consumed by tools which is what I mean by "AI friendly".

    • By kykat 2025-09-248:09

      Sounds like you want XML

HackerNews