Home › Forums › PDF24 Creator › General › PDF24 is working with Tesseract-ocr, through a simple script
- This topic has 1 reply, 2 voices, and was last updated 9 years, 4 months ago by pdf24.
-
AuthorPosts
-
2014-11-19 at 22:32 #792fredreichParticipant
PDF24 can be plugged in to Tesseract-ocr for deliver a fully searchable PDF file or even deliver it to MSWord as text-only format.
How is this achievable ?
A simple solution.
Firstly: "setup.bat" (assuming that you have manually installed the Tesseract-ocr in the path as appears here ):setx path "%path%;%programfiles%Tesseract-ocr" /m
setx TESSDATA_PREFIX "%programfiles%Tesseract-ocr" /m
vcredist_2013x86.exe /s
#This latter Microsoft pack is necessary for "Tesseract 3.03" for functionality of the "Tesseract 3.03"Secondly: "OCR.cmd" (or .bat):
@set arc=%1
@echo.
@echo.
@echo.
@color F0
@choice /c DP /m " DIGITE: D = OPEN DOCUMENT IN WORD or P = OPEN AS SEARCHEABLE PDF"@IF errorlevel 2 goto PDF
@IF errorlevel 1 goto word:PDF
@COLOR 4F
@START /b /wait /REALTIME tesseract %arc% doc -l %2 PDF@IF %errorlevel% EQU -1073741819 goto error
@IF %errorlevel% equ 255 goto error@del /Q *.tiff
@start DOC.PDF
@exit
@:WORD
@COLOR 1F
@ECHO OFF
@MSG.VBS@START /b /wait /REALTIME tesseract %arc% Document -l %2
@IF %errorlevel% EQU -1073741819 goto error
@IF %errorlevel% equ 255 goto error@if exist Document.txt ping -n 04 localhost >nul
@start winword.exe /t Document.txt
@del /Q *.tiff
@exit
@:error
@color 4F
@cls
@prompt $S
@ERROR.VBS
@del /Q *.doc
@del /Q *.tiff
@del /Q *.txt
@EXITWhere,
MSG.VBS:
Const wshYes = 6
Const wshNo = 7
Const wshYesNoDialog = 4
Const wshQuestionMark = 32Set objShell = CreateObject("Wscript.Shell")
intReturn = objShell.Popup("WANT TO OPEN THE RECOGNIZED TEXT IN WORD?", _
0, "SENDING TO WORD", wshYesNoDialog + wshQuestionMark)If intReturn = wshYes Then
Wscript.Echo "WAIT TILL W O R D TO SEEM IN THE SCREEN . CLICK ' OK ' TO GO ON."
ElseIf intReturn = wshNo Then
Wscript.Echo "SUBMISSION FOR THE WORD NOW MAY BE CANCELLED. CLOSE WINDOW BY CLICKING THE 'X' "
Set wshShell = WScript.CreateObject ("WSCript.shell")
WshShell.SendKeys "^C"
End IfERROR.VBS:
msgbox "The DOCUMENT CONTAINS A COMPLEX LAYOUT. TRY AGAIN USING SCREEN CAPTURE OF 24 PDF CREATOR, SELECTING AN 'AREA DEFINED BY USER' WHERE ONLY 'TEXT' APPEARS "
Save all this stuff ( *.bat, *.vbs files, in the same folder of Tesseract-ocr installation).
Thirdly:
In the PDF24 "save profiles" tool:
Save a new profile for "TIFF" format with 300 dpi, color (or something else), mark the Multi page option; Name it, for example as "OCR";Configure "PDF Printer":
Settings > PDF Printer Tool > "Open created PDF files in creator" = ok, "Open created PDF files in creator if application is opened" = ok> Auto Save > "Automatically save documents after printed" = ok, "Output Directory" = %tmp%%Y, "Filename" = %H, %M, %S, ${fileName}, "Profile" = OCR, "Show Progress while saving" = ok, "Open folder after saving" = No, "Execute the following command after saving" = ok == start ocr.cmd "${file}" eng;
No work with "Assitant" :
> "Close assistant after file saving" = empty
> "Close assistant after mailing the file" = empty
> "Open PDF file after file saving" = emptyFinally: "Apply" or "OK"
Result: PDF24 Creator "plugged" with an "OCR" feature !
Implement, test and see ....
Good luck !
PDF24 is the versatile software that hitherto I have ever used.
😉 !2014-12-15 at 08:31 #1673pdf24MemberThank for your contribution.
-
AuthorPosts
- You must be logged in to reply to this topic.