DTWAIN_SetPDFOCRConversion

Top  Previous  Next

The following function is available only with DTWAIN / PDF.

 

The DTWAIN_SetPDFOCRConversion sets the conversion parameters for the OCR engine when creating searchable PDF files.

 

DTWAIN_BOOL DTWAIN_SetPDFOCRConversion (

DTWAIN_OCRENGINE

Engine,

LONG

PageType,

LONG

FileType,

LONG

PixelType,

LONG

BitDepth,

LONG

Options );

 

Parameters

Engine

A selected OCR engine.

 

PageType

The PDF page type that the conversion parameters will be used on.

 

FileType

The image file type that the OCR engine supports for retrieving the text.

 

PixelType

Pixel type of the image file that the OCR engine supports.

 

BitDepth

Bit depth of the image file that the OCR engine supports.

 

Options

Other miscellaneous OCR options.

 

Return Values

The return value is TRUE if successful.  Otherwise FALSE is returned.

 

Comments

The DTWAIN_SetPDFOCRConversion sets up the conversion parameters to use when creating searchable PDF files using DTWAIN_AcquireFile or DTWAIN_AcquireFileEx (to allow creation of searchable PDF files, the DTWAIN_SetPDFOCRMode must be called with the second argument as TRUE).

 

When DTWAIN acquires pages to a PDF file, DTWAIN will create temporary image files, and from these files the PDF pages are produced.  Since these temporary image files may contain "text", the OCR engine can be allowed to recognize the text in the image file, and then save this text to the PDF page as searchable (invisible) text.

 

To allow DTWAIN to use the OCR engine, the OCR engine must be set up to recognize the text in the image file correctly.  The only way for this to work is to tell DTWAIN exactly how the selected OCR engine handles image files, for example, what types of image files that the OCR engine supports.  This is exactly what DTWAIN_SetPDFOCRConversion does -- informs DTWAIN how your OCR engine handles image files, and the proper way for DTWAIN to set up the OCR engine to recognize text that will be placed in the PDF file as searchable text.

 

The Engine parameter is the selected OCR engine.

 

The PageType parameter must be one of the following: DTWAIN_PDFPAGETYPE_BW for black and white PDF pages, and DTWAIN_PDFPAGETYPE_COLOR for color (RGB) PDF pages.  Since DTWAIN creates PDF pages as black and white or color (depending on the original acquired image), DTWAIN should be informed how the OCR engine handles each type.  If PageType is DTWAIN_PDFPAGETYPE_BW, the FileType, PixelType, and BitDepth describes the file format, pixel and bit depth of the type of image file that the OCR engine can use for text recoginition when the page is black and white.  If PageType is DTWAIN_PDFPAGETYPE_COLOR, the FileType, PixelType, and BitDepth parameters describes the file format, pixel and bit depth of the type of image file that the OCR engine can use to recognize text for color PDF pages.

 

Note that many OCR engines do not support recognizing text in color image files, however that does not mean that a color PDF file is not able to store searchable text if the OCR engine only supports black/white images.  The simplest way to overcome this limitation is to tell DTWAIN that for DTWAIN_PDFPAGETYPE_COLOR, use the same FileTye, PixelType, and BitDepth values for black/white images.  The reason why this works is that DTWAIN is intelligent enough to convert the color image into a temporary black/white image, call the OCR engine to recognize the text, and then destroy the temporary black/white image.

 

DTWAIN will only create a temporary file if it discovers that the PDF page type does not match the specified OCR engine conversion type.

 

The FileType parameter must be one of the following:

 

DTWAIN_TIFFNONE

Uncompressed TIFF

DTWAIN_TIFFG3

CCITTFaxDecode Group 3 TIFF

DTWAIN_TIFFG4

CCITTFaxDecode Group 4 TIFF

DTWAIN_TIFFLZW

TIFF/LZW compression

DTWAIN_BMP

Windows Bitmap file

DTWAIN_JPEG

JPEG file -- color PDF pages only

 

Your OCR engine must support processing at least one of the above image types for proper processing of searchable PDF text:

 

 

The PixelType parameter describes the color type that the image file that the OCR engine can support.  This value is either DTWAIN_PT_BW (black/whte) or DTWAIN_PT_RGB (color).

 

The BitDepth is the bits-per-pixel of the image that is supported by the OCR engine.  Usually for black/whte images, this value is 1.  For color images, this can be 4, 8, 16, 24, etc.

 

For the supported pixel types and bit depths of your OCR engine, you can call the DTWAIN_GetOCRCapValues using the DTWAIN_OCRCV_PIXELTYPE and DTWAIN_OCRCV_BITDEPTH capability values.

 

 

The Options parameter is one or more of the following values, added together to obtain the final Options value.

 

DTWAIN_PDFOCR_CLEANTEXT        1        1

 

 

The DTWAIN_PDFOCR_CLEANTEXT1 will inform DTWAIN to only store printable ASCII characters (characters in the range of ASCII 32 (space) to ASCII 255.  In other words, all control characters (characters that have ASCII values less than 32) are stripped out of the text and replace with spaces.

 

Example

 

Creating Searchable PDF files

 

Prerequisite Function Call(s)

DTWAIN_SysInitialize

DTWAIN_InitOCREngine

DTWAIN_SelectOCREngine, DTWAIN_SelectDefaultOCREngine, DTWAIN_SelectOCREngineByName