PDF TO XML

Convert files into accurate, LLM-readable knowledge

Preserve the semantic context of your documents. Convert any PDF to XML for reliable RAG with our Semantic DocPrep service.
PDF to XML with Semantic DocPrep
WHY SEMANTIC DOCPREP?

Prevent LLM hallucinations

LLMs struggle to understand PDFs that contain images, tables, charts, and graphs. Semantic DocPrep –  document preprocessing for RAG – turns complex PDFs into clear, structured, machine-readable XML for accurate retrieval in RAG workflows.

Original document
LVMH-Financial-Highlights

->

Vertesia-Semantic-LVMH-Results

# FINANCIAL HIGHLIGHTS 

## Revenue

(EUR millions)
![img-0.jpeg](img-0.jpeg)

2022
2023

## Profit from recurring operations

(EUR millions)
![img-1.jpeg](img-1.jpeg)

2022
2023

| Change in revenue by business group <br> (EUR millions and percentages) | 2024 | 2023 | 2024/2023 Change |  | 2022 |
| :--: | :--: | :--: | :--: | :--: | :--: |
|  |  |  | Published | Organic (a) |  |
| Wines and Spirits | 5,862 | 6,602 | $-11 \%$ | $-8 \%$ | 7,099 |
| Fashion and Leather Goods | 41,060 | 42,169 | $-3 \%$ | $-1 \%$ | 38,648 |
| Perfumes and Cosmetics | 8,418 | 8,271 | $2 \%$ | $4 \%$ | 7,722 |
| Watches and Jewelry | 10,577 | 10,902 | $-3 \%$ | $-2 \%$ | 10,581 |
| Selective Retailing | 18,262 | 17,885 | $2 \%$ | $6 \%$ | 14,852 |
| Other activities and eliminations | 504 | 324 | - | - | 281 |
| Total | 84,683 | 86,153 | $-2 \%$ | 1\% | 79,184 |

(a) On a constant consolidation scope and currency basis. The net impact of exchange rate fluctuations on Group revenue was -2\% and the net impact of changes in the scope of consolidation was $-1 \%$. The principles used to determine the net impact of exchange rate fluctuations on the revenue of entities reporting in foreign currencies and 

FINANCIAL HIGHLIGHTS
Revenue
Change in revenue by business group
2024
2023
2024/2023 Change
2022
(EUR millions)
(EUR millions and percentage)
Published
Organic
(a)
86,153 84,683
Wines and Spirits
5,862
6,602
-11%
-8%
7,099
79,184
Fashion and Leather Goods
41,060
42,169
-3%
-1%
38,648
Perfumes and Cosmetics
8,418
8,271
2%
4%
7,722
Watches and Jewelry
10,577

HOW IT WORKS

Using Semantic DocPrep 

Semantic DocPrep is secure, scalable, and cloud-based. With a free trial and flexible pricing, it’s easy to convert even the most complex PDF into a LLM-friendly format. 
prompt-transformation
Intrinsic content referencing
Ensure accuracy by never rewriting or altering original text.
table-icon
Table normalization
Use our API to convert tables into consistent formats that LLMs can easily read.
content-extraction
Structured content extraction
Extract specific content types with full preservation of relationships and context.
hierarchical-structure
Hierarchy preservation
Maintain document hierarchy to preserve original context.
tagging
Explicit tagging
Assign IDs to every tag for accurate downstream operations like insertions.
filter-v2
Content filtering
Control what is input into the LLM for faster, more efficient processing.
reprocess
Stateful reprocessing
Reprocess failed runs with automatic retries or failover to other models.
xml-document-1
Extensible Markup Language (XML) output
Make unstructured content easy for LLMs to understand using industry-standard XML.

Enterprise-grade service

Work with a curated set of pre-qualified GenAI models optimized for each element of document preparation. Deploy on AWS, Google Cloud, Azure, or any private cloud infrastructure.
no-hardware
No hardware investment
Eliminate the need to purchase, maintain, or scale specialized hardware
no-ai-model-management
No model management

Skip the complexity of running and tuning your own GenAI models

failover-v2
Automated failover

Intelligently switch between models for maximum reliability.

Flexible endpoints for custom content structuring

Access our range of API endpoints to transform and organize your content exactly how you need it – from structured XML and Markdown to custom document formats.

/analyze

Trigger content analysis for an object.

/status

Get the status of a previously requested analysis.

 

/results

Retrieve the result of an analysis once it is completed.

/xml

Fetch the object's corresponding XML once the analysis is completed.

/tables

Fetch the object's table content once the analysis is completed.

/images

Retrieve information about the images that are embedded in a PDF.

 

/annotated

Get a rendition of the PDF annotated with block outlines and IDs.

/adapt_tables

Identify relevant tables and map columns to transform the format.

/adapt_tables/:runId

Retrieve the adapted tables when processing is complete.

Pricing that scales with you

Only pay for what you use. Starting at around 35% cheaper than AWS Textract, you’ll spend less, get great performance, and accurate results.
Pricing example:

A 14 page PDF has 3 pages with multiple images and 2 pages with multiple tables: 

$0.001 x 14 = $0.014
$0.002 x 3 = $0.006
$0.003 x 2 = $0.006
Total price to process the complete PDF = $0.026

For more information on converting PDF to XML, check out the 'Getting Started' documentation