I have a PDF of an academic paper which looks as though it's been scanned, and is not searchable. I wanted to convert it to a Word file wherein the pages are text rather than images. I did so by using PDF24's OCR and then its Convert function. However, although OCR'ing generates a searchable PDF, when I convert that to DOCX, I get a file whose pages are images. How can I get round this?
I'll finish by inserting screenshots of what I did. This is using a copy of PDF24 installed yesterday onto Windows 10.
-
- The original PDF. As I said, this looks like someone's scan. The first word on the page is "purple", but Find can't recognise it, showing that the file is not searchable.2
- The OCR'd PDF. This shows an earlier page. The paper is about Roman textiles, and a search for "Roman" found 30 occurrences, of which the second is shown. This PDF clearly is searchable.
- The DOCX file to which the OCR'd PDF was converted. This shows it open in LibreOffice Writer (I don't have Word). You can see a box around the content. Dragging its top left vertex pulls the content along as if it were an image. Searching for the word "Roman" (and for other words I know are there, such as "wool") finds nothing.
OCR adds a text layer to the PDF file. If the PDF only contains images, then conversion to Word creates a Word file with that images. OCR does not detect content of the images and creates a new PDF based on that detected content.
It looks as though this problem isn't only with papers that start off as images. I've asked a new question at https://help.pdf24.org/en/questions/question/why-does-converting-a-text-pdf-to-word-leave-it-as-images/ . Here, I tried converting an academic paper downloaded from Academia.edu . It doesn't look scanned, and is searchable. I can copy text from it and paste that into Notepad. Yet, when I convert it to Word, the result is images. I demonstrated with a screenshot in the new question, and put up a temporary copy of the paper at http://www.filedropper.com/ddwild2002thetextileindustriesofromanbritain . Is it possible that I've accidentally turned on a switch that says "Convert all PDFs to images?".
This is Phil van Kleur, who wrote the question above. I spent over half an hour preparing screenshots to include in the question, only to find when I'd uploaded them all that the system refused to accept the question. It kept saying "There is an error in your image", but would not explain what. By trial and error, I found I had to remove all the images before I could submit. I was expecting I could then go back and remove the text referring to them, but I can't find a control for editing submitted questions. Unlike StackExchange, don't you allow that, or am I missing something?
Anyway, I would really appreciate it if the site would explain why it won't accept my images, instead of just saying there's an error. I prepared them carefully to show exactly what each file looked like, and what were the settings I used.