By Caber Team
16 Jul 2025
Every enterprise today deals with an enormous volume of PDF documents. From financial reports and contracts to regulatory filings and internal memos, PDFs have become ubiquitous due to their portability and visual consistency. Yet putting the data these PDFs contain into a format enterprise AI applications can use is a roadblock 🚧.
Beneath their convenience lies a simple fact: PDFs have no easily parsable structures, or metadata, describing what elements are, or which text might be the title, a heading, or a page footer--and this makes recognizing these different elements painfully difficult. While AI-based parsers produce impressive results, they do so at significant cost and still struggle to attain reliability and consistency across different types of documents 📑.
So what if we could get the information those PDFs contain into a usable format without parsing PDFs at all? The key to answering that is actually pretty easy once you consider enterprise workflows and where those PDFs came from in the first place: It's lineage.
As I write this, Hugging Face lists 4,628 AI models for Image-Text-to-Text. Turns out printing PDFs to images and using AI for Optical Character Recognition (OCR) works better than having AI parse the PDF directly.
There are dozens of free and paid online converters (see our list here), Python libraries, and projects on GitHub dedicated to parsing PDFs. Adobe themselves have one, and so do all of the leading AI providers:
While these methods can produce impressive results, the reality of scalability quickly becomes evident:
Processor | Sec/doc (av) | Sec/Page | Bytes/sec | Success Rate |
---|---|---|---|---|
Mistral | 14.71 | 0.28 | 11493 | 100.0% |
PyMuPDF4LLM | 15.00 | 0.27 | 11916 | 100.0% |
Docling (no GPU) | 94.90 | 1.74 | 2415 | 100.0% |
Docling (GPU) | 78.20 | 1.45 | 2938 | 100.0% |
These figures highlight an inescapable truth: parsing PDFs using LLMs, despite being powerful, won't scale to meet the millions of documents needed to be parsed in typical production enterprise use.
If you go to the web pages of Apple, Inc., and Tesla, Inc., the sources for the SEC financial filings we used to obtain the results above, you'll see that every 10-K and 10-Q statement is provided in both HTML and PDF formats. In fact, the EDGAR portal used to file these financial statements only accepts submissions in structured formats -- either HTML or XBRL.
It's no surprise that the vast majority of enterprise PDFs were never original creations. They're outputs—exports from structured sources such as HTML web pages, Word documents, Excel spreadsheets, or database records. These original formats inherently encode structure, with rich metadata that describes tables, headings, and key semantic content in a clear, parse-friendly manner that conversion to the PDF format discards.
Processor | Sec/doc (av) | KB/sec | Elements/sec | Output Bytes | Success Rate |
---|---|---|---|---|---|
Markdownify | 1.06 | 2627.9 | 33288 | 405516 | 100.0% |
html-to-markdown | 0.49 | 4064.8 | 55248 | 403604 | 100.0% |
html2text | 0.19 | 8691.6 | 126795 | 381705 | 100.0% |
MarkItDown | 1.12 | 2418.0 | 30704 | 385882 | 100.0% |
Compare the Bytes/sec and KB/sec columns in each table. Parsing the HTML sources sources directly can be completed in milliseconds—literally ⚡️1,000 times faster⚡️ than parsing the PDFs. Moreover, since all the structure and metadata exists in the HTML, the conversion is 100% accurate because it's deterministic.
The profound speed and accuracy advantages of parsing structured documents over PDFs become starkly evident at scale, making lineage knowledge incredibly valuable 💰💵💸.
While it's easy to "just use AI" to solve hard problems, blind recourse to brute force inherently suffers from inefficiency and unpredictability. As in the example above, where an ounce of insight unlocks a thousand-fold increase in performance, a deep understanding of enterprise workflows, how enterprise data is created, shared, and used, can unlock better ways to use that data with AI.
If you are thinking, "🐂💩! How am I going to find the HTML, XLS, DOCX, or whatever source the PDFs were created from?"
We're with you...or at least we were before that question drove us to a solution.
Enterprises inherently generate massive amounts of duplicate data. Both the PDF and the HTML it was created from are stored, as are the emails that were sent with one of these in the attachments. Formatting differences hinder straight up comparison of every PDF to every DOCX file, but while we are embedding them into Retrieval Augmented Generation (RAG) stores we break the files down into sentences, paragraphs, tables and other chunks of data.
These chunks of data form natural, deterministic connections between documents, database tables, applications, AI agents, users, and even the APIs that move the data. Like in the neural networks fundamental to today's AI, these connections are as important as the data itself to knowing what the data is, what business context applies to it, and how to best utilize the data.
Like graph-based RAG ro GraphRAG, knowledge graphs are key to putting these thoughts into practice. But unlike GraphRag, we need to look beyond relationships inside documents, and leverage the relationships between them.
The limitations of brute-force PDF parsing through LLMs illustrate an essential point: true scalability and efficiency require understanding and leveraging data relationships and lineage. PDF parsing difficulties underscore a broader enterprise truth—effective data utilization isn't merely about computational power but rather about intelligently understanding and managing the underlying data.
If you would like to dive into the results presented in this post, the data and code are publicly available on Caber's GitHub page at the link below: