DOCUMENT PREPARATION FOR GENAI

Semantic DocPrep

Vertesia's agentic API service converts complex documents to XML for Retrieval-Augmented Generation (RAG)
Semantic-DocPrep-Featured-Image-600x350

SEMANTIC DOCUMENT PREPARATION

Prevent LLM hallucinations with semantic document preparation

Large language models (LLMs) often struggle to understand PDFs and other complex documents that contain images, tables, charts, graphs, and other difficult-to-process elements

This is why we developed a revolutionary approach for document preparation which converts complex documents into richly structured XML and preserves the exact original text while adding semantic understanding that makes document content truly machine-readable
Original document
LVMH-Financial-Highlights

->

Vertesia-Semantic-LVMH-Results

# FINANCIAL HIGHLIGHTS 

## Revenue

(EUR millions)
![img-0.jpeg](img-0.jpeg)

2022
2023

## Profit from recurring operations

(EUR millions)
![img-1.jpeg](img-1.jpeg)

2022
2023

| Change in revenue by business group <br> (EUR millions and percentages) | 2024 | 2023 | 2024/2023 Change |  | 2022 |
| :--: | :--: | :--: | :--: | :--: | :--: |
|  |  |  | Published | Organic (a) |  |
| Wines and Spirits | 5,862 | 6,602 | $-11 \%$ | $-8 \%$ | 7,099 |
| Fashion and Leather Goods | 41,060 | 42,169 | $-3 \%$ | $-1 \%$ | 38,648 |
| Perfumes and Cosmetics | 8,418 | 8,271 | $2 \%$ | $4 \%$ | 7,722 |
| Watches and Jewelry | 10,577 | 10,902 | $-3 \%$ | $-2 \%$ | 10,581 |
| Selective Retailing | 18,262 | 17,885 | $2 \%$ | $6 \%$ | 14,852 |
| Other activities and eliminations | 504 | 324 | - | - | 281 |
| Total | 84,683 | 86,153 | $-2 \%$ | 1\% | 79,184 |

(a) On a constant consolidation scope and currency basis. The net impact of exchange rate fluctuations on Group revenue was -2\% and the net impact of changes in the scope of consolidation was $-1 \%$. The principles used to determine the net impact of exchange rate fluctuations on the revenue of entities reporting in foreign currencies and 

FINANCIAL HIGHLIGHTS
Revenue
Change in revenue by business group
2024
2023
2024/2023 Change
2022
(EUR millions)
(EUR millions and percentage)
Published
Organic
(a)
86,153 84,683
Wines and Spirits
5,862
6,602
-11%
-8%
7,099
79,184
Fashion and Leather Goods
41,060
42,169
-3%
-1%
38,648
Perfumes and Cosmetics
8,418
8,271
2%
4%
7,722
Watches and Jewelry
10,577

AGENTIC PREPROCESSING FOR RAG

Document Preparation as a Service

Available as a scalable and secure Cloud service, Vertesia’s patent-pending Semantic DocPrep enables users to rapidly convert PDFs and other complex documents to a machine-readable XML format. Our agentic service leverages a number of different LLMs to efficiently and accurately process complex documents.

prompt-transformation
Intrinsic content referencing
Eliminate AI hallucinations by never rewriting or altering the original text
table-icon
Table normalization
Normalize tables into consistent formats utilizing a dedicated API interface
content-extraction
Structured content extraction
Perform targeted extraction of specific content types with full preservation of relationships
hierarchical-structure
Hierarchy preservation
Create proper parent-child relationships between document elements
tagging
Explicit tagging
Assign explicit IDs to every tag to enable accurate downstream operations like insertions
filter-v2
Content filtering
Provide users fine-grained control over what is input to the LLM to enable more efficient processing
reprocess
Stateful reprocessing
Reprocess failed runs with stateful, automatic retries and even dynamically failover to alternate models
xml-document-1
XML output
Leverage eXtensible Markup Language (XML), a well-proven standard for transporting data that is readily understandable by LLMs
We do not allow models to rewrite or alter the original text of the document

Our approach eliminates GenAI hallucinations which is particularly valuable in regulated industries where unintended rewrites can have significant consequences

SEMANTIC DOCPREP SERVICE

Extensive API endpoints

Our API endpoints expose our service to transform content into XML files

/analyze

Trigger content analysis for an object

/status

Get the status of a previously requested analysis

 

/results

Retrieve the result of an analysis once it is completed. The response will contain the XML conversion of the object.

/xml

Fetch the object's corresponding XML string once the analysis is completed

/tables

Fetch the object's table content once the analysis is completed

/images

Retrieve information about the images that are embedded in a PDF

 

/annotated

Get a rendition of the PDF annotated with block outlines and IDs

/adapt_tables

Transform tables within a PDF to the format of your choice. The service will identify relevant tables and map columns to the requested format.

/adapt_tables/:runId

Retrieve the adapted tables when processing is complete

 

ENTERPRISE-READY

Flexible, enterprise-grade service

Deploy on AWS, Google Cloud, Azure, or any private cloud infrastructure
no-hardware
No hardware investment
Eliminate the need to purchase, maintain, or scale specialized hardware
no-ai-model-management
No model management

Skip the complexity of running and tuning your own GenAI models

failover-v2
Automated failover

System intelligently switches between models for maximum reliability

Our production-ready service offers a curated set of pre-qualified GenAI models optimized for each element of document preparation
AFFORDABLE PRICING

Get started today

Our pricing is designed to grow with you as you scale your use of our Cloud service. Starting at around 65% of the price of AWS Textract, you’ll get greater performance and more accurate, more usable results as well as cost savings.
Pricing example:

A 14 page PDF has 3 pages with multiple images and 2 pages with multiple tables: 

$0.001 x 14 = $0.014
$0.002 x 3 = $0.006
$0.003 x 2 = $0.006
Total price to process the complete PDF = $0.026

For more information, check out the 'Getting Started' documentation