There is a problem with text extraction when using the OCR tool

Question

2.00K views2025-01-07PDF24 Creator

0

gilbertovargas 6 2025-01-07 0 Comments

The PDF OCR tool generates a searchable PDF file from a PDF that does not yet have a text layer. When the process is finished, the searchable document and the text contained within the PDF are generated correctly.

Likewise, the PDF OCR tool also allows us to download the text contained in the PDF. However, I have a problem: when I try to extract the text from the searchable file later, I get incorrect information. This is because PDF24's text extraction tools duplicate some parts of the text generated in the searchable file. Here is the result of a conversion using PDF24 OCR.

This is the original text as it can be read directly in the document:

Humans (Homo sapiens, meaning 'thinking man'' or 'wise man'') or modern humans (sometimes Homo sapiens) are the most common and widespread species of primate, and the last surviving species of the genus Homo and the broader australopithecine subtribe..

This is the text generated and extracted from the searchable file, using PDF24's text extraction tools:

Humaannss (HHomo sapiienss,, meaning 'tthhiinnkkinng maann'' or 'wwise maann'') or mooddeern humaannss (someettii- mees Homo sapiens sapienss) are the moosstt commmmon and wiidespread species of primaattee,, and the last survivinng species of the genus Homo and the broader australoopithhecine subttrribee..

As you can see, the searchable PDF document generated by PDF24 contains duplicate letters. The problem arises when a programmer or user needs to extract the text later. What can cause the problem? And how can it be fixed?

Thanks

gilbertovargas Asked question 2025-01-07

0 Answers