The PDF Parsing Secret That Outperforms AI by 1000x

Every enterprise today deals with an enormous volume of PDF documents. From financial reports and contracts to regulatory filings and internal memos, PDFs have become ubiquitous due to their portability and visual consistency. Yet putting the data these PDFs contain into a format enterprise AI applications can use is a roadblock 🚧.

Beneath their convenience lies a simple fact: PDFs have no easily parsable structures, or metadata, describing what elements are, or which text might be the title, a heading, or a page footer--and this makes recognizing these different elements painfully difficult. While AI-based parsers produce impressive results, they do so at significant cost and still struggle to attain reliability and consistency across different types of documents 📑.

So what if we could get the information those PDFs contain into a usable format without parsing PDFs at all? The key to answering that is actually pretty easy once you consider enterprise workflows and where those PDFs came from in the first place: It's lineage.

4,628 AI Models to Work Harder, not Smarter?

As I write this, Hugging Face lists 4,628 AI models for Image-Text-to-Text. Turns out printing PDFs to images and using AI for Optical Character Recognition (OCR) works better than having AI parse the PDF directly.

There are dozens of free and paid online converters (see our list here), Python libraries, and projects on GitHub dedicated to parsing PDFs. Adobe themselves have one, and so do all of the leading AI providers:

OpenAI files API
Anthropic via thier Converse and InvokeModel APIs
Mistral OCR-based PDF conversion
LlamaIndex LlamaParse
Microsoft Azure Document Intelligence
Google Document AI

While these methods can produce impressive results, the reality of scalability quickly becomes evident:

A single PDF can take tens of seconds to minutes to process and parse accurately.
Many enterprises have millions of PDFs stored which can quickly balloon processing into years of compute time and enormous associated costs.
GPUs help, but not as much as you might think, by providing 3x to 4x increase in processing speed

AI-Based PDF Parsing Performance Summary

Processor	Total Time (s)	Sec/Page	Bytes/sec	Success Rate
Mistral	14.71	0.28	11493	100.0%
PyMuPDF4LLM	15.00	0.27	11916	100.0%
Docling (no GPU)	94.90	1.74	2415	100.0%
Docling (GPU)	78.20	1.45	2938	100.0%

Total files processed: 12
Total pages processed: 2,736

These figures are consistent with those presented in the Docling Research paper and highlight an inescapable truth: parsing PDFs using LLMs, despite being powerful, won't scale to meet the millions of documents needed to be parsed in typical production enterprise use.

The Overlooked Power of Data Lineage

If you go to the web pages of Apple, Inc., and Tesla, Inc., the sources for the SEC financial filings we used to obtain the results above, you'll see that every 10-K and 10-Q statement is provided in both HTML and PDF formats. In fact, the EDGAR portal used to file these financial statements only accepts submissions in structured formats -- either HTML or XBRL.

It's no surprise that the vast majority of enterprise PDFs were never original creations. They're outputs—exports from structured sources such as HTML web pages, Word documents, Excel spreadsheets, or database records. These original formats inherently encode structure, with rich metadata that describes tables, headings, and key semantic content in a clear, parse-friendly manner that conversion to the PDF format discards.

HTML to Markdown Processing Performance

Processor	Total Time (s)	KB/sec	Elements/sec	Output Bytes	Success Rate
Markdownify	1.06	2627.9	33288	405516	100.0%
html-to-markdown	0.49	4064.8	55248	403604	100.0%
html2text	0.19	8691.6	126795	381705	100.0%
MarkItDown	1.12	2418.0	30704	385882	100.0%

Total files processed: 12

Compare the Bytes/sec and KB/sec columns in each graph. Parsing the HTML sources sources directly can be completed in milliseconds—literally ⚡️1,000 times faster⚡️ than parsing the PDFs. Moreover, since all the structure and metadata exists in the HTML, the conversion is 100% accurate because it's deterministic.

The profound speed and accuracy advantages of parsing structured documents over PDFs become starkly evident at scale, making lineage knowledge incredibly valuable 💰💵💸.

Trading One Hard Problem for Another?

While it's easy to "just use AI" to solve hard problems, blind recourse to brute force inherently suffers from inefficiency and unpredictability. As in the example above, where an ounce of insight unlocks a thousand-fold increase in performance, a deep understanding of enterprise workflows, how enterprise data is created, shared, and used, can unlock better ways to use that data with AI.

If you are thinking, "🐂💩! How am I going to find the HTML, XLS, DOCX, or whatever source the PDFs were created from?"

We're with you...or at least we were before that question drove us to a solution.

Data Bread Crumbs and Stepping Stones

Enterprises inherently generate massive amounts of duplicate data. Both the PDF and the HTML it was created from are stored, as are the emails that were sent with one of these in the attachments. Formatting differences hinder straight up comparison of every PDF to every DOCX file, but while we are embedding them into Retrieval Augmented Generation (RAG) stores we break the files down into sentences, paragraphs, tables and other chunks of data.

These chunks of data form natural, deterministic connections between documents, database tables, applications, AI agents, users, and even the APIs that move the data. Like in the neural networks fundamental to today's AI, these connections are as important as the data itself to knowing what the data is, what business context applies to it, and how to best utilize the data.

Knowing that the majority of PDFs exist alongside the documents and data sources they were created from, building RAGs in two-stages would give a significant performance increase:

Parse all non‑PDF documents first, build a RAG from them.
Look up PDFs, after parsing with fast, non‑AI, tools like pypdfium2, and only parse with AI and add to the RAG if no good match is found.

Like graph-based RAG or GraphRAG, knowledge graphs are key to putting these thoughts into practice. But unlike GraphRag, we need to look beyond relationships inside documents, and leverage the relationships between them. Even incomplete knowledge about the lineage of data can yeild significant benefits. By explicitly mapping these relationships, we can uncover the origin, context, and optimal use of data, enabling the precise identification and control over data needed to optimize AI outcomes.

It's Time to Work Smarter 🧐 With AI 🤖

The limitations of brute-force PDF parsing through LLMs illustrate an essential point: true scalability and efficiency require understanding and leveraging data relationships and lineage. PDF parsing difficulties underscore a broader enterprise truth—effective data utilization isn’t merely about computational power but rather about intelligently understanding and managing the underlying data.

If you would like to dive into the results presented in this post, the data and code are publicly available on Caber's GitHub page at the link below:

https://github.com/Caber-Systems/ComparePDFparsers