PDFOne .NET
Powerful all-in-one PDF library for .NET
Compatibility
VS 2005/2008/2010/2012/2013

PDF Text Search and Extraction Using PDFOne .NET

Learn to search and extract text from PDF documents.
By V. Subhash

In Version 4 of PDFOne .NET, we have introduced methods to implement PDF text search.

public ArrayList Search(
   // search string
   string searchString,
   // page number
   int pageNum,
   // literal or regular expression
   PDFSearchMode searchMode,
   // generous-match, case-sensitive, whole-word
   PDFSearchOptions searchOptions
)

public ArrayList Search(
   // search begins from
   int startPageNum,
   string searchString,
   PDFSearchMode searchMode,
   PDFSearchOptions searchOptions
)

public void Search(
   string searchString,
   PDFSearchMode searchMode,
   PDFSearchOptions searchOptions,
   // event handler to be called when a match is found
   SearchElementHandler pdfSearchHandler,
   int startPageNum
)

The first two overloads return an array list containing the lines that were extracted. The third overload does not return anything. Instead, it calls the specified event handler whenever it finds a match. Inside the event handler, you will be able to access the search results from the parameters.

These methods enable you to perform simple text searches using literal strings and advanced text searches using regular expressions.

Simple Text Search

The following code snippet illustrates the former.

PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_doc.pdf");

// Obtain all instances of the word "bike" in page 4
ArrayList ArrayList1 =
      (ArrayList) PDFDocument1.Search("bike",
                                      1,
                                      PDFSearchMode.LITERAL,
                                      PDFSearchOptions.NONE);
// Close the document
PDFDocument1.Close();

// Iterate through all search results
PDFSearchElement PdfSearchElement1;
int n = ArrayList1.Count;
for (int i = 0; i < n; i++) {
  PdfSearchElement1 = (PDFSearchElement) ArrayList1[i];
  // Print search results to console output
  Console.WriteLine("Found \"" +
                         PdfSearchElement1.MatchString +
                         "\" in page #" +
                         PdfSearchElement1.PageNumber +
                         " text \"" +
                         PdfSearchElement1.LineContainingMatchString +
                         "\"" );
}

// Close the document
PDFDocument1.Close();
Console.ReadLine();

Here is the document we used for testing this code.

Sample Document

And, here is the output.

Text Search Output

Advanced PDF Text Search

Regular expressions are performance-multipliers. Using cleverly crafted regular expressions, you can eliminate several lines from you code. All the search() methods support regular expressions. The following code snippet shows how to use them.

PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_.pdf");


// Obtain all hyperlinks in page 2
ArrayList ArrayList1 =
      (ArrayList)PDFDocument1.Search(@"http://{1}",
                                      2,
                                      PDFSearchMode.REGEX,
                                      PDFSearchOptions.NONE);
// Close the document
PDFDocument1.Close();

// Iterate through all search results
PDFSearchElement PdfSearchElement1;
int n = ArrayList1.Count;
for (int i = 0; i < n; i++) {
  PdfSearchElement1 = (PDFSearchElement) ArrayList1[i];
  // Print search results to console output
  Console.WriteLine("Found \"" +
                         PdfSearchElement1.MatchString +
                         "\" in page #" +
                         PdfSearchElement1.PageNumber +
                         " text \"" +
                         PdfSearchElement1.LineContainingMatchString +
                         "\"" );
}

// Close the document
PDFDocument1.Close();
Console.ReadLine();

The above code snippet uses a simple regular expression that matches web page links. To test this code snippet, we used the following document.

Sample Document

And, here is the output. Note how all the hyperlinks have been neatly caught by the search.

Advanced Text Search Output

PDF Text Extraction

The search methods find text in the order it is available in the document. This may not always be in the order that a human reads a page - from top to bottom. If you want it all ordered, then you should first extract all text from the page and then search the extracted text. The following code snippet shows how to extract all text content from a PDF page.

// Create a PDF document object
PDFDocument PDFDocument1 = new PDFDocument("your-license-key");

// Load PDF document
PDFDocument1.Load("sample_doc.pdf");

// Extract text from page 1
ArrayList aExtractedText = PDFDocument1.ExtractText(1);

// Save extracted text to file
using (StreamWriter StreamWriter1 = File.CreateText("extracted_content.txt"))  {
  foreach (string sLine in aExtractedText) {
    StreamWriter1.Write(sLine);
  }

StreamWriter1.Close();

We tested this code snippet on a PDF document containing the license agreement of one of our products. Here is that document and the extracted text.

Original Document and Extracted Text

---o0O0o---

Our .NET Developer Tools
Gnostice Document Studio .NET

Multi-format document-processing component suite for .NET developers.

PDFOne .NET

A .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications.

Our Delphi/C++Builder developer tools
Gnostice Document Studio Delphi

Multi-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms.

eDocEngine VCL

A Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools.

PDFtoolkit VCL

A Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents.

Our Java developer tools
Gnostice Document Studio Java

Multi-format document-processing component suite for Java developers.

PDFOne (for Java)

A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications.

Our Platform-Agnostic Cloud and On-Premises APIs
StarDocs

Cloud-hosted and On-Premises REST-based document-processing and document-viewing APIs

Privacy | Legal | Feedback | Newsletter | Blog | Resellers © 2002-2023 Gnostice Information Technologies Private Limited. All rights reserved.