PRODUCT

Solving the Content Conundrum: Semantic DocPrep for GenAI

Introducing Vertesia's Semantic DocPrep API service that eliminates LLM hallucinations and generates accurate, relevant results.


In the generative AI (GenAI) world, everyone is obsessed with prompt engineering and the constant evolution and emergence of new models, but hardly anyone is addressing the real elephant in the room: how do we get large language models (LLMs) to properly understand complex documents?

Properly prepared content is the foundation for getting relevant, accurate results from GenAI. Whatever task or prompt you give to an LLM, it is dependent on the text (or images) that you give it in order to generate a response. Documents – most typically PDFs – need to be converted to text in order for an LLM to read and understand the content. And herein lies the real challenge.

The OCR obsession

Starting originally with Amazon Textract and, more recently, GenAI solutions like Mistral OCR, there has been a lot of focus on OCR over the last couple of years, with each progressive model claiming superiority or a new breakthrough in document understanding.

However, OCR isn’t really the issue. Firstly, OCR stands for optical character recognition and is a technology that was first commercialized in the 1970s to convert typed or printed characters into machine-encoded text. Over the last 50 years or so, this technology has been proven across a variety of enterprise applications – this problem is largely solved. Yes, in certain circumstances, Vision Language Models (VLMs) can add something to the equation but OCR isn’t really the problem.

We have also seen people do some really crazy things, converting PDFs into images only to use various LLMs to OCR these documents and convert them to text. This is like printing out an email only to scan it. Why would you do this? There is also the issue with hallucinations. Anytime you ask a model to generate output there is the risk of a hallucination. We’ve frequently observed models try to correct text and, in many cases, erroneously. In one case, we observed an LLM rewrite Bangladesh as “Nangladesh,” for example. Depending on your use case, this could be hugely problematic.

So, the challenge isn’t really OCR or converting PDFs to text. There really are lots of ways to do this – all of which will produce reasonably accurate results.

The Content Conundrum

The real challenge is that – when you are working with long-form, complex documents – there’s a lot more to process than just plain text. Many documents contain charts, graphs, and other graphics that provide vital detail and data. Many documents contain tables, some of which are quite long and complex. Some documents contain images with embedded text and may actually need to be OCR processed. All of this potentially vital information is readily apparent and intelligible to a human but, when converted to text, is either lost or becomes unintelligible to an LLM. For example, with charts and graphs, any correlation is lost between the image and data and, in many cases, LLMs fail to completely capture the data. Regardless, even if they do, all context is lost. Similarly, with tables, headers, columns, rows, totals, etc. – all are lost and the table is flattened. As a result, an LLM can no longer correlate columnar data with a header or identify rows of information.

Add to this the fact that some documents are just plain too long for a particular GenAI model’s context window and need to be chunked but, in the process, sometimes critical concepts or, for example, essential contractual terms get separated. So, while an LLM may get all of the text it needs to generate a response, the correlation between various chunks is lost and the model is no longer aware of how different concepts or terms may relate to each other.

To put it simply, there is a lot more to getting content ready for use by an LLM than just converting them to a simple text format.

The Markdown Compromise

Let’s also talk about why certain text formats are less than ideal for LLMs. Take Markdown for example. And don’t get us wrong, we love Markdown for what it is: a simple way to write lightly formatted documents and web content. But it was literally created in reaction to more complex markup languages – hence the name. And it is intentionally very limited.

So, when you take a document like a commercial invoice and represent it in Markdown, here is what you get:

# GLOBAL TECH SOLUTIONS

## (GTS INC)

### COMMERCIAL INVOICE

Global Tech Solutions Inc., Innovation Park, 1250 Silicon Avenue, Suite 300, San Francisco, CA 94107, UNITED STATES

| Purchase Order | Part Number/Description | Qty | UOM | Unit Price (USD) | Total Price (USD) |

| GT78923456 | A27X-9800 QUANTUM PROCESSOR MODULE SERIES X | 5.00 | EA | 3,299.9500 | 16,499.75 |

| GT78923457 | B19C-4500 NEURAL NETWORK ACCELERATOR | 8.00 | EA | 1,249.9900 | 9,999.92 |

| GT78923458 | C35T-2200 QUANTUM-RESISTANT ENCRYPTION MODULE | 12.00 | EA | 875.5000 | 10,506.00 |

| GT78923459 | D42R-7600 HOLOGRAPHIC STORAGE ARRAY 1PB | 2.00 | EA | 12,750.0000 | 25,500.00 |

| GT78923460 | E56P-3100 BIOMETRIC AUTHENTICATION SYSTEM | 6.00 | EA | 2,199.5000 | 13,197.00 |

 

Yes. It’s readable for humans and, if you squint, you can even sort of see the table structure. But there’s no page information, no element ids (for precise reference), and no metadata about the table structure. There are also no semantic relationships between the different textual elements.

It’s all lost in Markdown. And, while this may work for a simple invoice like this, it also may not. And what happens if your invoice has 50 line items? Or 300? Just like a human, an LLM will have a hard time understanding this structure and interpreting the vertical bars correctly. And, the longer the table, the more likely it is that the LLM will begin to return erroneous responses. In short, the LLM will get the content but will lose the context. And that is even without taking into account the machine usability of it for integration with other applications.

Why Does All This Matter?

It all comes down to one thing: hallucinations. Not my favorite term, hallucinations. The reality is, if an LLM doesn’t know the answer to something, it doesn’t really hallucinate – it just makes stuff up.

When it comes to working with enterprise businesses, we like to keep this kind of behavior to a minimum, and this begins with giving information – data – to an LLM in a form that it can precisely understand.

When you strip away all the noise and the hype, LLMs really need three things to truly understand a document:

  1. Structure - What are the various elements that make up the document: headings, paragraphs, tables, charts, images, lists, etc.
  2. Context - How the different elements relate to each other. What is the hierarchy? 
  3. Referenceability - Can we easily point to specific elements? A specific line or cell in a table, specific paragraph, for citation, specific table or chart, etc.

That’s it. If you can nail these three things, you will have a GenAI solution that is capable of delivering the accuracy and relevancy your business demands.

So, What’s the Answer?

Talk about burying the lead. What I set out to do today is announce a new service and we’re two-and-a-half pages deep already. But yes, there is an answer.

Today, I am pleased to announce the launch of our Semantic DocPrep service which is available to developers via a set of high-performance APIs. This is the same functionality that is available to Vertesia Platform customers and has been proven in high-volume, complex enterprise applications. If you want to dive right in and learn more about this service and our APIs, the documentation is available here.

And, if you want to learn more about why this is a truly revolutionary approach to preparing complex documents for GenAI, keep reading.

Taking a Semantic Approach to Document Preparation

The first thing to know is that we call this Semantic Document Preparation (DocPrep) for a reason: everything begins with a semantic understanding of the document. We first utilize an LLM to determine what the component parts of the document are and we break content down, page by page:

  • Does the page contain images? Are these images critical to the document, e.g. illustrations, charts, graphs, etc.? Or, are they non-vital information, like logos?
  • If the page contains images, do they embed text and need to be OCR’d?
  • If the page should be processed as text, or just visually (for visual slides or maps for example)
  • If the page contains OCR text, how good is the quality of the OCR text? Does it need to be re-OCR’d?
  • Does the page contain tables or other columnar data?

We also create element ids and map the location of every element in the document. The reason for this will become obvious in a moment but another key benefit is, if you subsequently want to edit a document – for example, if you want to use GenAI to amend and  mark up a contract – you need to know where to insert suggested edits and text.

Once we know what each page contains, we then know how to process it. Images get sent off to Vision-enabled LLMs for rich descriptions and to OCR any embedded text. Tables, charts, and graphs are sent off to different models for processing. Every data element is richly tagged to preserve rows, columns, headers, and to enable future queries.

Once all of the component parts have been appropriately processed, we then reassemble everything, using our element ids, back into a cohesive whole and output all of this as XML. I will write a future blog post on why we chose XML later this month but, for now, suffice to say that XML is richly descriptive, fully tagged, and readily understandable by LLMs. It has everything you need to effectively support enterprise use cases with GenAI. We call this XML output – a fully digital representation of your document – a “semantic layer” and we are capable of storing this semantic layer in our platform, or in your storage environment, to support any number of future LLM runs and queries. For all intents and purposes, this semantic layer is your document, just in a form that is imminently intelligible to an LLM.

Semantic-DocPrep-Featured-Image-600x350

Oh, and the most important thing, no rewrites! One of the critical things we do is not allow any of the models that we use for this service to rewrite text. This eliminates the opportunity for hallucinations and ensures that the XML that you get back contains only the data that was originally in the document.

So, there you have it. A fully agentic service, available via a set of easy-to-use APIs, that you can use to fully prep even the most complex documents for any GenAI use case. Revolutionary? We think so and we have also filed for five patents related to our approach.

Why Do All This?

If you’ve read this far, bear with me just a bit more. Why did we do this? Well, the answer to this is simple: our customers’ use cases demanded it.

Oddly enough, we still have customers that are struggling with things like invoice processing and bills of lading. Some of these documents are incredibly complex and span dozens of pages. They are virtually impossible to process with OCR or even Intelligent Document Processing (IDP) – I’m talking recognition rates of ~30%. So, we started working with some of these customers on utilizing GenAI to more accurately process these documents and we started working with AI models like TextTract. We quickly realized two things:

  1. The models weren’t up to the task and couldn’t, by themselves, produce the results that our customers demanded
  2. Many of these models are prohibitively expensive – so expensive that any ROI from automation quickly disappears

In short, we thought there had to be a better way and, given our long background in working with content, we started to experiment with different document preparations to improve results. And, across hundreds of different use cases and dozens of organizations, we continued to harden and refine our approach.

Which brings us here. A recent survey that we conducted illustrates that organizations typically spend about 50% of their time in building GenAI apps and agents with data preparation. Literally half their time for each app or agent is spent in getting the data right. So we saw the need for a new service, one that would greatly reduce the time and effort that is required to get GenAI projects to production.

Ready to Get Started?

All of this sounds great, right? But the proof is in the doing, as they say. So we are offering free trials to get you up and running and to demonstrate how good this service really is.

And don’t worry, we’ve also priced our service very aggressively and transparently – it’s right there on our website.

NEXT STEPS

Our Semantic DocPrep service is designed for developers who are looking to eliminate LLM hallucinations

Sign up today and get started for free.

Similar posts

Get notified on new blog articles

Be the first to know about new blog articles from Vertesia. Stay up to date on industry trends, news, product updates, and more.

Subscribe to our blog