PREVENT LLM HALLUCINATIONS

Semantic DocPrep prepares your content for AI

Preserve the semantic context of your documents for more reliable and accurate results.

 

Prepare your content for AI
WHY SEMANTIC DOCPREP?

Prevent LLM hallucinations

LLMs struggle to understand PDFs that contain images, tables, charts, and graphs. Semantic DocPrep –  document preprocessing for RAG – turns complex PDFs into clear, structured, machine-readable XML for accurate retrieval in RAG workflows.

Original document
LVMH-Financial-Highlights

->

Vertesia-Semantic-LVMH-Results

# FINANCIAL HIGHLIGHTS 

## Revenue

(EUR millions)
![img-0.jpeg](img-0.jpeg)

2022
2023

## Profit from recurring operations

(EUR millions)
![img-1.jpeg](img-1.jpeg)

2022
2023

| Change in revenue by business group <br> (EUR millions and percentages) | 2024 | 2023 | 2024/2023 Change |  | 2022 |
| :--: | :--: | :--: | :--: | :--: | :--: |
|  |  |  | Published | Organic (a) |  |
| Wines and Spirits | 5,862 | 6,602 | $-11 \%$ | $-8 \%$ | 7,099 |
| Fashion and Leather Goods | 41,060 | 42,169 | $-3 \%$ | $-1 \%$ | 38,648 |
| Perfumes and Cosmetics | 8,418 | 8,271 | $2 \%$ | $4 \%$ | 7,722 |
| Watches and Jewelry | 10,577 | 10,902 | $-3 \%$ | $-2 \%$ | 10,581 |
| Selective Retailing | 18,262 | 17,885 | $2 \%$ | $6 \%$ | 14,852 |
| Other activities and eliminations | 504 | 324 | - | - | 281 |
| Total | 84,683 | 86,153 | $-2 \%$ | 1\% | 79,184 |

(a) On a constant consolidation scope and currency basis. The net impact of exchange rate fluctuations on Group revenue was -2\% and the net impact of changes in the scope of consolidation was $-1 \%$. The principles used to determine the net impact of exchange rate fluctuations on the revenue of entities reporting in foreign currencies and 

FINANCIAL HIGHLIGHTS
Revenue
Change in revenue by business group
2024
2023
2024/2023 Change
2022
(EUR millions)
(EUR millions and percentage)
Published
Organic
(a)
86,153 84,683
Wines and Spirits
5,862
6,602
-11%
-8%
7,099
79,184
Fashion and Leather Goods
41,060
42,169
-3%
-1%
38,648
Perfumes and Cosmetics
8,418
8,271
2%
4%
7,722
Watches and Jewelry
10,577

HOW IT WORKS

What is Semantic DocPrep?

Semantic DocPrep is a secure, scalable, and cloud-based API service, available in the Vertesia platform. With a free trial and flexible pricing, it’s easy to convert even the most complex PDF into a LLM-friendly format. 

prompt-transformation
Intrinsic content referencing
Ensure accuracy by never rewriting or altering original text.
table-icon
Table normalization
Use our API to convert tables into consistent formats that LLMs can easily read.
content-extraction
Structured content extraction
Extract specific content types with full preservation of relationships and context.
hierarchical-structure
Hierarchy preservation
Maintain document hierarchy to preserve original context.
tagging
Explicit tagging
Assign IDs to every tag for accurate downstream operations like insertions.
filter-v2
Content filtering
Control what is input into the LLM for faster, more efficient processing.
reprocess
Stateful reprocessing
Reprocess failed runs with automatic retries or failover to other models.
xml-document-1
Extensible Markup Language (XML) output
Make unstructured content easy for LLMs to understand using industry-standard XML.

Enterprise-grade service

Work with a curated set of pre-qualified GenAI models optimized for each element of document preparation. Deploy on AWS, Google Cloud, Azure, or any private cloud infrastructure.

no-hardware
No hardware investment
Eliminate the need to purchase, maintain, or scale specialized hardware
no-ai-model-management
No model management

Skip the complexity of running and tuning your own GenAI models

failover-v2
Automated failover

Intelligently switch between models for maximum reliability.

API ENDPOINTS

Flexible endpoints for custom content structuring

Access our range of API endpoints to transform and organize your content exactly how you need it – from structured XML and Markdown to custom document formats.

/analyze

Trigger content analysis for an object.

/status

Get the status of a previously requested analysis.

/results

Retrieve the result of an analysis once it is completed.

/xml

Fetch the object's corresponding XML once the analysis is completed.

/tables

Fetch the object's table content once the analysis is completed.

/images

Retrieve information about the images that are embedded in a PDF.

/annotated

Get a rendition of the PDF annotated with block outlines and IDs.

/adapt_tables

Identify relevant tables and map columns to transform the format.

/adapt_tables/:runId

Retrieve the adapted tables when processing is complete.

PRICING

Only pay for what you use

Starting at around 35% cheaper than AWS Textract, you’ll spend less, get great performance, and accurate results.

Multi-Page-Complex-PDF

PDF Pages

Pricing is per page.

  • 1—1,000 = FREE
  • 1,001—1,000,000 = $0.001
  • 1,000,001+ = $0.0008

Embedded images and tables are priced separately.

Embedded-Image-in-PDF

Images

Pricing is per page with one or more images

  • 1—1,000 = FREE
  • 1,001—1,000,000 = $0.002
  • 1,000,001+ = $0.0016

Pricing example:
One page has two images: $0.001 + $0.002 = $0.003

Embedded-Table-in-PDF

Tables

Pricing is per page with one or more tables

  • 1—1,000 = FREE
  • 1,001—1,000,000 = $0.003
  • 1,000,001+ = $0.0024

Pricing example:
One page has three tables: $0.001 + $0.003 = $0.004

Pricing example:

A 14 page PDF has 3 pages with multiple images and 2 pages with multiple tables: 

$0.001 x 14 = $0.014
$0.002 x 3 = $0.006
$0.003 x 2 = $0.006
Total price to process the complete PDF = $0.026

For more information on converting PDF to XML, check out the 'Getting Started' documentation