PDFOne .NET
Powerful all-in-one PDF library for .NET
Compatibility
VS 2005/2008/2010/2012/2013

Parsing PDF Page Elements Using PDFOne .NET v4

Learn to access PDF page elements such as text, images, shapes, and Form XObjects.
By V. Subhash
In Version 4 of PDFOne, we introduced a new method GetPageElements() in the PdfDocument class.

ArrayList
  GetPageElements(int pageNum,
                  PDFPageImageElement elementTypes)

List
  GetPageElements(String pageRange,
                   PDFPageImageElement elementTypes)

This method returns a list containing PDFPageElement instances. But, PDFPageElement is the parent class of individual element classes, namely PDFPageCompositeElement, PDFPageImageElement, PDFPagePathElement, and PDFPageTextElement. You can directly access items in the returned list as instances of these derived classes.

The derived classes provide a lot more information about the retrieved page element. For example, with the PDFPageTextElement instance, you can not only find the actual text represented by the text element but also the location, font rotation (if any), and other details. In the following code snippet, we will see how this is done.

static void Main(string[] args) {
  PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

  // Load PDF document
  PDFDocument1.Load("sample_doc.pdf");

  // Extract all image elements
  ArrayList ArrayList1 = PDFDocument1.GetPageElements(1, PDFPageElementType.IMAGE);

  // Save all image elements to file
  int n = ArrayList1.Count;
  Bitmap Bitmap1;
  for (int i = 0; i < n; i++) {
    PDFPageImageElement PDFPageImageElement1 = (PDFPageImageElement) ArrayList1[i];
    Bitmap1 = PDFPageImageElement1.GetImage();
    Bitmap1.Save("I:\\page" + (i+1).ToString() + ".bmp", ImageFormat.Bmp);
    Console.WriteLine("Image Element #" + (i + 1) + " (" +
                      PDFPageImageElement1.ImageHeight + " x " +
                      PDFPageImageElement1.ImageWidth + ")" + " saved to: " +
                      "page1_image" + (i + 1) + ".bmp");
  }

  // Extract all text elements
  ArrayList ArrayList2 = PDFDocument1.GetPageElements(1, PDFPageElementType.TEXT);

  // Save all image elements to file
  n = ArrayList2.Count;
  for (int i = 0; i < n; i++) {
    PDFPageTextElement PDFPageTextElement1 = (PDFPageTextElement) ArrayList2[i];
    Console.WriteLine("Text Element #" + (i + 1) + " \"" +
                      PDFPageTextElement1.Text + "\" uses font " +
                      PDFPageTextElement1.TextFontInfo.FontName);
  }

  // Close the document
  PDFDocument1.Close();
}

This code snippet tries to parse text and image elements in page 1 of a document. The text elements are displayed in the console while image elements are saved to a file. Here is the original document that was used to test this document.

Sample Document

Here is the output of the program when used with the above document. The output mentions the image and text elements that were found in page 1 of the document.

Parsed Text and Images

Here is the image element after it was saved to a file.

Extracted Image

---o0O0o---

Our .NET Developer Tools
Gnostice Document Studio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
Gnostice Document Studio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
Gnostice Document Studio Java

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2024 Gnostice Information Technologies Private Limited. All rights reserved.