PDF TO XML
Convert files into accurate, LLM-readable knowledge
Preserve the semantic context of your documents. Convert any PDF to XML for reliable RAG with our Semantic DocPrep service.
Prevent LLM hallucinations
LLMs struggle to understand PDFs that contain images, tables, charts, and graphs. Semantic DocPrep – document preprocessing for RAG – turns complex PDFs into clear, structured, machine-readable XML for accurate retrieval in RAG workflows.
->
# FINANCIAL HIGHLIGHTS
## Revenue
(EUR millions)

2022
2023
## Profit from recurring operations
(EUR millions)

2022
2023
| Change in revenue by business group <br> (EUR millions and percentages) | 2024 | 2023 | 2024/2023 Change | | 2022 |
| :--: | :--: | :--: | :--: | :--: | :--: |
| | | | Published | Organic (a) | |
| Wines and Spirits | 5,862 | 6,602 | $-11 \%$ | $-8 \%$ | 7,099 |
| Fashion and Leather Goods | 41,060 | 42,169 | $-3 \%$ | $-1 \%$ | 38,648 |
| Perfumes and Cosmetics | 8,418 | 8,271 | $2 \%$ | $4 \%$ | 7,722 |
| Watches and Jewelry | 10,577 | 10,902 | $-3 \%$ | $-2 \%$ | 10,581 |
| Selective Retailing | 18,262 | 17,885 | $2 \%$ | $6 \%$ | 14,852 |
| Other activities and eliminations | 504 | 324 | - | - | 281 |
| Total | 84,683 | 86,153 | $-2 \%$ | 1\% | 79,184 |
(a) On a constant consolidation scope and currency basis. The net impact of exchange rate fluctuations on Group revenue was -2\% and the net impact of changes in the scope of consolidation was $-1 \%$. The principles used to determine the net impact of exchange rate fluctuations on the revenue of entities reporting in foreign currencies and
FINANCIAL HIGHLIGHTS
Revenue
Change in revenue by business group
2024
2023
2024/2023 Change
2022
(EUR millions)
(EUR millions and percentage)
Published
Organic
(a)
86,153 84,683
Wines and Spirits
5,862
6,602
-11%
-8%
7,099
79,184
Fashion and Leather Goods
41,060
42,169
-3%
-1%
38,648
Perfumes and Cosmetics
8,418
8,271
2%
4%
7,722
Watches and Jewelry
10,577
Using Semantic DocPrep
Semantic DocPrep is secure, scalable, and cloud-based. With a free trial and flexible pricing, it’s easy to convert even the most complex PDF into a LLM-friendly format.
Intrinsic content referencing
Table normalization
Structured content extraction
Hierarchy preservation
Explicit tagging
Content filtering
Stateful reprocessing
Extensible Markup Language (XML) output
Enterprise-grade service
Work with a curated set of pre-qualified GenAI models optimized for each element of document preparation. Deploy on AWS, Google Cloud, Azure, or any private cloud infrastructure.
No hardware investment
No model management
Skip the complexity of running and tuning your own GenAI models
Automated failover
Intelligently switch between models for maximum reliability.
Flexible endpoints for custom content structuring
Access our range of API endpoints to transform and organize your content exactly how you need it – from structured XML and Markdown to custom document formats.
/analyze
Trigger content analysis for an object.
/status
Get the status of a previously requested analysis.
/results
Retrieve the result of an analysis once it is completed.
/xml
Fetch the object's corresponding XML once the analysis is completed.
/tables
Fetch the object's table content once the analysis is completed.
/images
Retrieve information about the images that are embedded in a PDF.
/annotated
Get a rendition of the PDF annotated with block outlines and IDs.
/adapt_tables
Identify relevant tables and map columns to transform the format.
/adapt_tables/:runId
Retrieve the adapted tables when processing is complete.
Pricing that scales with you
Only pay for what you use. Starting at around 35% cheaper than AWS Textract, you’ll spend less, get great performance, and accurate results.
Pricing example:
A 14 page PDF has 3 pages with multiple images and 2 pages with multiple tables:
$0.001 x 14 = $0.014
$0.002 x 3 = $0.006
$0.003 x 2 = $0.006
Total price to process the complete PDF = $0.026
For more information on converting PDF to XML, check out the 'Getting Started' documentation