Gnostice Document Studio Java
Next-generation multi-format document-processing component suite for Java SE/EE developers
Compatibility
Java 1.6 and later

How to convert scanned images to searchable PDF using OCR in Java

Learn to use the new digitization feature of XtremeDocumentStudio (for Java).

In Version 2015 R3 of XtremeDocumentStudio (for Java), we introduced a document digitization feature. This is one of the features that many of customers have asked us in the past.

Digitization involves the recognition of specific content elements in the input document and converting them to a format that supports those elements. For example, a JPEG image might contain text but the JPEG raster content does store the text as text. It just appears like text to our eyes, as opposed to a paragraph of text in a web page, which would be wrapped as text in a paragraph tag (<p>...</p>). When the image is embedded in a PDF or a web page, the text is not going to be available as text. Wouldn't it be great if the text was selectable as text in a web page or a word document?

A lot of companies need to store image data (such as those of receipts or documents) and store them in their document storage system. They also need the ability for their users to easily select text from these images when required. This is where digitization comes in.

In this release, we added a new class called DigitizerSettings. The preferences property of document converter component exposes a DigitizerSettings instance. Using this DigitizerSettings instance, you need to specify what you would like to identify and how you would like to convert them to. Here is a simple code snippet that demonstrates how it can be done.

import com.gnostice.core.XDocException;
import com.gnostice.documents.ConverterDigitizerSettings;
import com.gnostice.documents.ConverterException;
import com.gnostice.documents.DocumentConverter;
import com.gnostice.documents.FormatNotSupportedException;
import com.gnostice.core.digitizationengine.*;

public class DigitizationDemo {
  static { XDocSetup.activate(); }
  
  public static void main(String[] args) {	  
    // Create a converter instance
    DocumentConverter dc = new DocumentConverter();
    
    // Change digitizer settings to recognize text from image data 
    ConverterDigitizerSettings cds = dc.getPreferences().getDigitizerSettings();
    cds.setDigitizationMode(DigitizationMode.ALL_IMAGES);
    cds.setRecognizeElementTypes(RecognizeElementTypes.TEXT);
    
    try {
      // Convert an image or scanned-PDF to PDF and digitize any text in it
      dc.convertToFile(
        "H:\\Screenshot-2.png", 
        "e:\\converted_image.pdf");
    } catch (FormatNotSupportedException e) {
      e.printStackTrace();
    } catch (ConverterException e) {
      e.printStackTrace();
    } catch (XDocException e) {
      e.printStackTrace();
    }

  }  
}
This screenshot of the input image and the output PDF shows that the text can now be selected.

For this experiment, a cropped screenshot image of the header of this product's page was used. When the image was converted to PDF, the text could be selected! Now, this was a small image and not all the text in the image was converted to text in the PDF. This is not a problem was a typical scanned image. Scanners generate images so wide that you need more than a couple of arms to wrap around them. Converting text from such images will not be a problem for XtremeDocumentStudio.

The digitization function was possible using the Tessaract OCR library. A Java wrapper to a Windows DLL has been used for this purpose. (You will need to add the library JARs files that you find in the bin folder of the XtremeDocumentStudio download ZIP file to enable this feature.) As this solution is based on a "C" library, your derived applications will no longer be 100% Java. If you want a 100% Java solution, then please do not refer to this library. Of course, the digitization feature will then not be available in the solution.

The JARs that need to be in the "Build Path".

In future, the digitization feature will be expanded to detection of bar codes and generation of corresponding barcode form fields. So, as they say in show business, watch this space.

---o0O0o---

Our .NET Developer Tools
Gnostice Document Studio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
Gnostice Document Studio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
Gnostice Document Studio Java

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2024 Gnostice Information Technologies Private Limited. All rights reserved.