0 Byte file after OCR

Question

1.16K views2025-02-04General

0

crazz 4 2024-06-25 2 Comments

I am working on a personal project to make my work life easier and automated. I have files that I scan to PDF, then OCR, then extract data and sort. When I use the online OCR tool, or the downloaded OCR tool with GUI the file conversion process works great. However, I want to automate the process utilizing the OCR command line interface to minimize the interaction.

I have followed directions on downloading local copies of the OCR tessdata files and such. I am using Python as my language of choice for this project. I am currently only testing 2 files, and after the OCR process via cmd I get 0 byte files. Am I doing something wrong my command? Is there a file or setting that I am missing? If I do the OCR through the GUI everything works as intended, it just produces a 0 byte output file through CLI.

Here is my cmd call in python:

os.system(f'pdf24-Ocr.exe -outputFile "{tempPath}" -language eng -dpi 300 -skipFilesWithText -skipPagesWithText -deskew -autoRotatePages "{oldPath}"')

Below is the cmd output:

Attached
Optimizing PDF
================
"C:\Program Files\PDF24\jre\bin\java.exe" -cp "C:\Program Files\PDF24\lib\jar\*" -Dwindows.acp=1252 -XX:MaxRAMPercentage=80 "org.pdf24.OcrPdfOptimizer" "-deskew" "G:\0.DLB\testing\test.pdf" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_0_28864312_224013669_opt.pdf"
----------------
OPTIMIZE> PDF24 PDF OCR Optimizer
OPTIMIZE>
OPTIMIZE> 2024.06.24 22:11:49 main: Start
OPTIMIZE> 2024.06.24 22:11:49 main: Images collected: 1
OPTIMIZE> 2024.06.24 22:11:49 main: Processing image: 1/1
OPTIMIZE> 2024.06.24 22:11:49 main: Skip, suffix not supported: tiff
OPTIMIZE> 2024.06.24 22:11:49 main: Done
================
================
"C:\Program Files\PDF24\gs\bin\gswinc.exe" -dBATCH -dNOPAUSE -dSAFER -dALLOWPSTRANSPARENCY "-sFONTPATH=C:\WINDOWS\Fonts" -dNEWPDF=true -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=png16m -dDownScaleFactor=1 "-sOutputFile=C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_1_28864750_2964422123.png" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_0_28864312_224013669_opt.pdf"
----------------
GPL Ghostscript 10.03.1 (2024-05-02)
Copyright (C) 2024 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
================
================
"C:\Program Files\PDF24\tesseract\tesseract.exe" "--tessdata-dir" "C:\Program Files\PDF24\tesseract\tessdata" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_1_28864750_2964422123.png" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_2_28865328_3823876569" "-l" "eng" "-c" "textonly_pdf=1" "--dpi" "300" "--oem" "3" "--psm" "1" "pdf" "txt"
----------------
================
Auto Rotating PDF Pages
================
"C:\Program Files\PDF24\jre\bin\java.exe" -cp "C:\Program Files\PDF24\lib\jar\*" -Dwindows.acp=1252 -XX:MaxRAMPercentage=80 "org.pdf24.PdfPagesAutoRotator" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_3_28866453_624368124_ocred.pdf" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_4_28866453_4095855484_rot.pdf"
----------------
AUTOROT> PDF24 PDF Pages Auto Rotator
AUTOROT>
AUTOROT> Jun 24, 2024 10:11:52 PM org.apache.fontbox.ttf.PostScriptTable read
AUTOROT> WARNING: No PostScript name data is provided for the font null
================
Old: G:\0.DLB\testing\test.pdf | 72.506
New: G:\0.DLB\testing\test_ocred.pdf | 0.0
Attached
Optimizing PDF
================
"C:\Program Files\PDF24\jre\bin\java.exe" -cp "C:\Program Files\PDF24\lib\jar\*" -Dwindows.acp=1252 -XX:MaxRAMPercentage=80 "org.pdf24.OcrPdfOptimizer" "-deskew" "G:\0.DLB\testing\testing2.pdf" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_0_28868437_3141584949_opt.pdf"
----------------
OPTIMIZE> PDF24 PDF OCR Optimizer
OPTIMIZE>
OPTIMIZE> 2024.06.24 22:11:53 main: Start
OPTIMIZE> 2024.06.24 22:11:54 main: Images collected: 1
OPTIMIZE> 2024.06.24 22:11:54 main: Processing image: 1/1
OPTIMIZE> 2024.06.24 22:11:54 main: Skip, suffix not supported: tiff
OPTIMIZE> 2024.06.24 22:11:54 main: Done
================
================
"C:\Program Files\PDF24\gs\bin\gswinc.exe" -dBATCH -dNOPAUSE -dSAFER -dALLOWPSTRANSPARENCY "-sFONTPATH=C:\WINDOWS\Fonts" -dNEWPDF=true -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r300 -dFirstPage=1 -dLastPage=1 -sDEVICE=png16m -dDownScaleFactor=1 "-sOutputFile=C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_1_28868859_308087591.png" "C:\Users\Zach\AppData\Local\Temp\PDF24\ocr_0_28868437_3141584949_opt.pdf"
----------------
GPL Ghostscript 10.03.1 (2024-05-02)
Copyright (C) 2024 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
================
================
"C:\Program Files\PDF24\tesseract\tesseract.exe" "--tessdata-dir" "C:\Program Files\PDF24\tesseract\tessdata" "C:\Users\----\AppData\Local\Temp\PDF24\ocr_1_28868859_308087591.png" "C:\Users\----\AppData\Local\Temp\PDF24\ocr_2_28869406_921088819" "-l" "eng" "-c" "textonly_pdf=1" "--dpi" "300" "--oem" "3" "--psm" "1" "pdf" "txt"
----------------
================
Auto Rotating PDF Pages
================
"C:\Program Files\PDF24\jre\bin\java.exe" -cp "C:\Program Files\PDF24\lib\jar\*" -Dwindows.acp=1252 -XX:MaxRAMPercentage=80 "org.pdf24.PdfPagesAutoRotator" "C:\Users\----\AppData\Local\Temp\PDF24\ocr_3_28872484_2749727237_ocred.pdf" "C:\Users\----\AppData\Local\Temp\PDF24\ocr_4_28872484_2568455164_rot.pdf"
----------------
AUTOROT> PDF24 PDF Pages Auto Rotator
AUTOROT>
AUTOROT> Jun 24, 2024 10:11:58 PM org.apache.fontbox.ttf.PostScriptTable read
AUTOROT> WARNING: No PostScript name data is provided for the font null
================
Old: G:\0.DLB\testing\testing2.pdf | 76.97
New: G:\0.DLB\testing\testing2_ocred.pdf | 0.0

crazz Posted new comment 2025-02-04

despdf24 commented 2025-02-03

Same problem. Any suggestions?

crazz commented 2025-02-04

I found out that my issue was related to the CLI not liking the fact that my files were stored on a seperate drive than the PDF24 install drive. Was able to remedy my issue by adjusting my script to move all files into a temp folder on same drive as PDF24 install, OCR them, then ship them where they needed to go. Unsure if you're experiencing same issue, but I can try and help troubleshoot if you give some more information?

I found out that my issue was related to the CLI not liking the fact that my files were stored on a seperate drive than the PDF24 install drive. Was able to remedy my issue by adjusting my script to move all files into a temp folder on same drive as PDF24 install, OCR them, then ship them where they needed to go. Unsure if you're experiencing same issue, but I can try and help troubleshoot if you give some more information?

0 Answers