Creating Searchable Text in PDF

Top  Previous  Next

The DTWAIN_AcquireFile and DTWAIN_AcquireFileEx allows an application to acquire images from a TWAIN device, and store those images as PDF files.  If an OCR engine is available, the PDF file can be created with searchable PDF text.  The text is generated by the OCR recognition of the acquired image for any text elements, and these text elements will be added to the PDF as searchable text.   The resulting PDF page will contain the image, plus any text that was recognized by the OCR engine.  The searchable text will be "invisible" text, i.e. the text will be part of the PDF page and searchable, but will not be visible (the image isn't altered, just the generated text will not be visible).

 

 

 

Searching for Text and Caveats

 

Starting with version 3.1.0.0 of DTWAIN, the searchable text is placed in the approximate area of the image where the text is found.   This means that when you do a text search in Adobe Acrobat (see next paragraph for more information on this) or any other PDF program that displays PDF files, the PDF program will "highlight" the appropriate text.  This by no means  simple feat -- most of the time, the OCR engine that analyzed the image will give accurate information on the text height, placement, etc., so that DTWAIN has correct information on where to place the text in the PDF page.  This works most of the time, however there will be the odd piece of text that will not position correctly (but still will be found).  Therefore, the accuracy of the OCR engine is highly important on how well DTWAIN puts together the searchable text in terms of the actual characters, positional information, font information, etc.

 

 

 

Important for Adobe Acrobat users:

 
It has come to our attention that Adobe Acrobat's search feature does not work correctly in many instances, thereby not finding text that is within the PDF document.   Adobe Acrobat is able to process the text (highlight, extract, etc.), but not search correctly 100% of the time.  We are attempting to address this issue with Adobe.   To that extent, we have had no issues with other non-Adobe, PDF rendering program's text search feature.  One such PDF program is NitroPDF by Nitro PDF Software.  Also, any program that merely searches or extracts PDF text within a document (one that can recognize compressed streams within PDF) will extract correctly.   For example, the free GhostView program that can view PostScript and PDF files has a text extraction feature that gathers all the text in a PDF document and saves it to a text file.

 

We are sorry for this inconvenience, but this issue is from what we have discovered, an issue with Acrobat's erratic text search,  and not DTWAIN.

 

 

 

Searchable Text Basics

 

By default, when DTWAIN creates a PDF file, the PDF file consists only of image data that cannot be searched for text.  One way to add text is to call DTWAIN_AddPDFText for each page that is generated.  Another way to add text is to let DTWAIN automatically generate the text for a page by using the OCR engine for the text generation.

 

To allow DTWAIN to automatically generate searchable text, two functions must be called.  The first function is DTWAIN_SetPDFOCRMode.  This function tells DTWAIN to generate PDF files with searchable text (if there is an available OCR engine) whenever DTWAIN_AcquireFile or DTWAIN_AcquireFileEx is called, and the file type to acquire is DTWAIN_PDF or DTWAIN_PDFMULTI.

 

The second function is DTWAIN_SetPDFOCRConversion.  This function is used to set up the OCR engine when a PDF page is acquired and what options to set for each type of page.   The DTWAIN_SetPDFOCRConversion function tells DTWAIN the following information:  for each PDF page type (black/white or color), the OCR options to use when creating the searchable text.  This means, for example, if you are acquiring color PDF files, and the OCR engine can only process black and white images, DTWAIN will:

 

1.

Create a temporary image file

2.

Start the OCR engine and scan the temporary image file for text

3.

Delete the temporary image

4.

Store the text in the PDF page.

 

Since DTWAIN always creates temporary image files when creating PDF files, regardless of whether OCR processing is being used, steps 1 and 3 above will be repeated for OCR engines that cannot process DTWAIN's original black/white and color image formats.  When acquiring to PDF files, DTWAIN will always generate a temporary, uncompressed TIFF (DTWAIN_TIFFNONE) files for PDF pages, and for color pages, a temporary JPEG (DTWAIN_JPEG) files.

 

If your OCR engine supports processing uncompressed TIFF files for black/white images, then it is wise to choose DTWAIN_TIFFNONE as the mode your OCR engine should use when processing black/white PDF pages.  Similarly, if your OCR engine supports JPEG images, select this OCR image type when acquiring color PDF pages.  The reason being that DTWAIN does not have to create another temporary image file to allow your OCR engine to scan the text correctly, and can just use the temporary file that DTWAIN always generates when acquiring to PDF files.

 

Otherwise, DTWAIN will convert the black/white or color image to a temporary image type that your OCR engine supports, and processes the page.

 

Your OCR engine must support processing at least one of the following image types for proper processing of searchable PDF text:

 

DTWAIN_TIFFNONE

Uncompressed TIFF

DTWAIN_TIFFG3

CCITTFaxDecode Group 3 TIFF

DTWAIN_TIFFG4

CCITTFaxDecode Group 4 TIFF

DTWAIN_TIFFLZW

TIFF/LZW compression

DTWAIN_BMP

Windows Bitmap file

DTWAIN_JPEG

JPEG file -- color PDF pages only

 

 

Example:


#include "dtwain.h"

 

void GetOCRText( )

{

   LONG nFormats;

   int isSupported = 0;

    DTWAIN_OCRENGINE SelectedEngine;

    DTWAIN_SOURCE SelectedSource;

 

    char *filename = "MyText.txt";

 

   /* Initialize DTWAIN Library */

  DTWAIN_SysInitialize( );

 

   /* Select the default TWAIN source */

   SelectedSource = DTWAIN_SelectDefaultSource( );

 

   if ( SelectedSource == 0 )

       return;  /* No Source was selected */

 

   /* Initialize the OCR interface */

  DTWAIN_InitOCRInterface( );

 

   /* Select the default OCR engine */

    SelectedEngine = DTWAIN_SelectDefaultOCREngine( );

 

    if ( SelectedEngine != 0 )

    {

          /* Tell DTWAIN to create searchable text when acquiring to PDF files */

        DTWAIN_SetPDFOCRMode( SelectedSource, TRUE );

 

          /* Set the PDF OCR conversion parameters */

          /* We know that our OCR engine has the ability to process 1 bpp BMP images for text,

              so we will inform the PDF processing that we want the OCR engine to

              generate a temporary 1-bpp BMP image for OCR processing */

        DTWAIN_SetPDFOCRConversion( SelectedEngine, DTWAIN_PDFPAGETYPE_BW, DTWAIN_BMP, DTWAIN_PT_BW, 1, DTWAIN_PDFOCR_CLEANTEXT1 );

        DTWAIN_SetPDFOCRConversion( SelectedEngine, DTWAIN_PDFPAGETYPE_COLOR, DTWAIN_BMP, DTWAIN_PT_BW, 1, DTWAIN_PDFOCR_CLEANTEXT1 );

 

         /* Start the acquisition of PDF pages from the selected TWAIN device */

        DTWAIN_AcquireFile( SelectedSource, "Searchable.pdf", DTWAIN_PDF, DTWAIN_USENAME, DTWAIN_PT_DEFAULT, 1, TRUE, TRUE, NULL );

 

         /* if no file errors occurred, then a PDF file with searchable text was created called "Searchable.pdf" */

 

        /* Destroy the TWAIN interface */

      DTWAIN_SysDestroy( );

   }