PDFOne (for Java)
Create, edit, view, print & enhance PDF documents and forms in Java SE/EE
Compatibility
J2SE J2EE Windows Linux Mac (OS X)

Parsing PDF Page Elements Using PDFOne (for Java) Version 4

Learn to access PDF page elements such as text, images, shapes, and Form XObjects.
By V. Subhash

In Version 4 of PDFOne, we introduced a new method getPageElements() in the PdfDocument class.

List 	getPageElements(int pageNum,
                     int elementTypes) 

List 	getPageElements(String pageRange,
                     int elementTypes) 

This method returns a list containing PdfPageElement instances. But, PdfPageElement is the parent class of individual element classes, namely PdfPageCompositeElement, PdfPageImageElement, PdfPagePathElement, and PdfPageTextElement. You can directly access items in the returned list as instances of these derived classes.

The derived classes provide a lot more information about the retrieved page element. For example, with the PdfPageTextElement instance, you can not only find the actual text represented by the text element but also the location, font rotation (if any), and other details. In the following code snippet, we will see how this is done.

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;

import javax.imageio.ImageIO;

import com.gnostice.pdfone.PDFOne;
import com.gnostice.pdfone.PdfDocument;
import com.gnostice.pdfone.PdfException;
import com.gnostice.pdfone.PdfPageElement;
import com.gnostice.pdfone.PdfPageImageElement;
import com.gnostice.pdfone.PdfPageTextElement;


public class Page_Element_Parsing_Demo {

  public static void main(String[] args) throws IOException, PdfException, Exception {
        
    int i, n;
    PdfPageTextElement PdfPageTextElement1;
    PdfPageImageElement PdfPageImageElement1;
    BufferedImage BufferedImage1;
    
    // Load a PDF document
    PdfDocument doc = new PdfDocument();
    doc.load("sample.pdf");
    
    // Retrieve image elements from page 1 of the document
    ArrayList lstImageElements = 
        (ArrayList) doc.getPageElements(1, PdfPageElement.ELEMENT_TYPE_IMAGE);
    // Retrieve text elements from page 1 of the document
    ArrayList lstTextElements = 
        (ArrayList) doc.getPageElements(1, PdfPageElement.ELEMENT_TYPE_TEXT);
    
    // Iterate through retrieved image elements 
    n = lstImageElements.size();        
    for (i = 0; i < n; i++) {
      // Save image content of the current image element to file
      PdfPageImageElement1 = (PdfPageImageElement) lstImageElements.get(i);
      BufferedImage1 = PdfPageImageElement1.getImage();            
      File File1 = new File("page1_image" + (i+1) + ".png");
      try {
        ImageIO.write(BufferedImage1, "png", File1);
      } catch (Exception e) {
        System.out.println("Sorry, there was an error." + e.getMessage());
      }

      // Print details of the current image element
      System.out.println("Image Element #" + (i+1) + " saved to: " +
                         "page1_image" + (i+1) + ".bmp (" + 
                         PdfPageImageElement1.getImageHeight() +
                         " x " + 
                         PdfPageImageElement1.getImageWidth() + 
                         ")");
    }
    
    // Close the document - it needs to be loaded only when images  
    // need to be extracted - images are accessed only on-demand. 
    doc.close();           
    
    // Iterate through retrieved text elements
    n = lstTextElements.size();        
    for (i = 0; i < n; i++) {
      PdfPageTextElement1 = (PdfPageTextElement) lstTextElements.get(i);
      // Print details of the current text element
      System.out.println("Text Element #" + (i+1) + " \"" + 
                         PdfPageTextElement1.getText() + "\" uses font " + 
                         PdfPageTextElement1.getTextFontInfo().getFontName());
    }
  }    
}

This code snippet tries to parse text and image elements in page 1 of a document. The text elements are displayed in the console while image elements are saved to a file. Here is the document that was used to test this document.

Sample document used to test the Java program

Here is the output of the program when used with the above document. The output mentions the image and text elements that were found in page 1 of the document.

Output of the Java program

Here is the image element after it was saved to a file.

Extracted image

---o0O0o---

Our .NET Developer Tools
Gnostice Document Studio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
Gnostice Document Studio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
Gnostice Document Studio Java

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2024 Gnostice Information Technologies Private Limited. All rights reserved.