PDF24 is working with Tesseract-ocr, through a simple script

Home Forums PDF24 Creator General PDF24 is working with Tesseract-ocr, through a simple script

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #792
    fredreich
    Participant

    PDF24 can be plugged in to Tesseract-ocr for deliver a fully searchable PDF file or even deliver it to MSWord as text-only format.
    How is this achievable ?
    A simple solution.
    Firstly: "setup.bat" (assuming that you have manually installed the Tesseract-ocr in the path as appears here ):

    setx path "%path%;%programfiles%Tesseract-ocr" /m
    setx TESSDATA_PREFIX "%programfiles%Tesseract-ocr" /m
    vcredist_2013x86.exe /s
    #This latter Microsoft pack is necessary for "Tesseract 3.03" for functionality of the "Tesseract 3.03"

    Secondly: "OCR.cmd" (or .bat):

    @set arc=%1
    @echo.
    @echo.
    @echo.
    @color F0
    @choice /c DP /m " DIGITE: D = OPEN DOCUMENT IN WORD or P = OPEN AS SEARCHEABLE PDF"

    @IF errorlevel 2 goto PDF
    @IF errorlevel 1 goto word

    :PDF
    @COLOR 4F
    @START /b /wait /REALTIME tesseract %arc% doc -l %2 PDF

    @IF %errorlevel% EQU -1073741819 goto error
    @IF %errorlevel% equ 255 goto error

    @del /Q *.tiff

    @start DOC.PDF

    @exit

    @:WORD
    @COLOR 1F
    @ECHO OFF
    @MSG.VBS

    @START /b /wait /REALTIME tesseract %arc% Document -l %2

    @IF %errorlevel% EQU -1073741819 goto error
    @IF %errorlevel% equ 255 goto error

    @if exist Document.txt ping -n 04 localhost >nul

    @start winword.exe /t Document.txt

    @del /Q *.tiff

    @exit

    @:error
    @color 4F
    @cls
    @prompt $S
    @ERROR.VBS
    @del /Q *.doc
    @del /Q *.tiff
    @del /Q *.txt
    @EXIT

    Where,

    MSG.VBS:
    Const wshYes = 6
    Const wshNo = 7
    Const wshYesNoDialog = 4
    Const wshQuestionMark = 32

    Set objShell = CreateObject("Wscript.Shell")

    intReturn = objShell.Popup("WANT TO OPEN THE RECOGNIZED TEXT IN WORD?", _
    0, "SENDING TO WORD", wshYesNoDialog + wshQuestionMark)

    If intReturn = wshYes Then
    Wscript.Echo "WAIT TILL W O R D TO SEEM IN THE SCREEN . CLICK ' OK ' TO GO ON."
    ElseIf intReturn = wshNo Then
    Wscript.Echo "SUBMISSION FOR THE WORD NOW MAY BE CANCELLED. CLOSE WINDOW BY CLICKING THE 'X' "
    Set wshShell = WScript.CreateObject ("WSCript.shell")
    WshShell.SendKeys "^C"
    End If

    ERROR.VBS:

    msgbox "The DOCUMENT CONTAINS A COMPLEX LAYOUT. TRY AGAIN USING SCREEN CAPTURE OF 24 PDF CREATOR, SELECTING AN 'AREA DEFINED BY USER' WHERE ONLY 'TEXT' APPEARS "

    Save all this stuff ( *.bat, *.vbs files, in the same folder of Tesseract-ocr installation).

    Thirdly:
    In the PDF24 "save profiles" tool:
    Save a new profile for "TIFF" format with 300 dpi, color (or something else), mark the Multi page option; Name it, for example as "OCR";

    Configure "PDF Printer":
    Settings > PDF Printer Tool > "Open created PDF files in creator" = ok, "Open created PDF files in creator if application is opened" = ok

    > Auto Save > "Automatically save documents after printed" = ok, "Output Directory" = %tmp%%Y, "Filename" = %H, %M, %S, ${fileName}, "Profile" = OCR, "Show Progress while saving" = ok, "Open folder after saving" = No, "Execute the following command after saving" = ok == start ocr.cmd "${file}" eng;

    No work with "Assitant" :
    > "Close assistant after file saving" = empty
    > "Close assistant after mailing the file" = empty
    > "Open PDF file after file saving" = empty

    Finally: "Apply" or "OK"

    Result: PDF24 Creator "plugged" with an "OCR" feature !

    Implement, test and see ....

    Good luck !

    PDF24 is the versatile software that hitherto I have ever used.
    😉 !

    #1673
    pdf24
    Member

    Thank for your contribution.

Viewing 2 posts - 1 through 2 (of 2 total)
  • You must be logged in to reply to this topic.