Developer Tools
|
Office Productivity Applications
|
Platform-Agnostic APIs
|
Home | Online Demos | Downloads | Buy Now | Support | About Us | News | Working Together | Contact Us
One of the new features that we introduced in Version 4 of PDFOne (for Java™) was text extraction. It was included because there were numerous requests for that feature from existing customers and trial users.
One of the methods that you could use to extract text is the PdfDocument.search()
method.
List search(int startPageNum, String searchString, int searchMode, int searchOptions ) List search(String searchString, int pageNum, int searchMode, int searchOptions) void search(String searchString, int searchMode, int searchOptions, PdfSearchHandler pdfSearchHandler, int startPageNum)
The search()
method finds all instances of the search text and returns a list containing the results. The following code snippet demonstrates how to use this method.
import java.io.IOException; import java.util.ArrayList; import com.gnostice.pdfone.PDFOne; import com.gnostice.pdfone.PdfDocument; import com.gnostice.pdfone.PdfException; import com.gnostice.pdfone.PdfSearchElement; import com.gnostice.pdfone.PdfSearchMode; import com.gnostice.pdfone.PdfSearchOptions; public class Text_Search_Demo { public static void main(String[] args) throws IOException, PdfException, Exception { int i, n; PdfSearchElement pseResult; // Load a PDF document PdfDocument doc = new PdfDocument(); doc.load("Input_Docs\\input_doc.pdf"); // Obtain all instances of the word "alcohol" in page 4 ArrayList lstSearchResults1 = (ArrayList) doc.search("alcohol", 4, PdfSearchMode.LITERAL, PdfSearchOptions.NONE); // Close the document doc.close(); // Iterate through all search results n = lstSearchResults1.size(); for (i = 0; i < n; i++) { pseResult = (PdfSearchElement) lstSearchResults1.get(i); // Print search results to console output System.out.println("Found \"" + pseResult.getMatchString() + "\" in page #" + pseResult.getPageNum() + " text \"" + pseResult.getLineContainingMatchString() + "\"" ); } } }
For testing this code snippet, we used this document.
And, here is the output.
You can also perform advanced text search using regex strings. In the above code snippet, we can modify the search method. We can use a regex that finds all text elements that contain a hyperlink.
// Obtain all website addresses in page 2 ArrayList lstSearchResults = (ArrayList) doc.search("http://{1}", // regular expression 2, // page number PdfSearchMode.REGEX, PdfSearchOptions.NONE);
Here is the output when we perform the text search using the regular expression.
Here is the document where the search was performed. Please note that text elements that contain multiple hyperlinks have been printed as many times.
You may have noted that list contains text elements in the order that they were found in the PDF document. If you would like to maintain the order that the text is found when a human reads the document, then you need to use the saveAsText()
method.
import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import com.gnostice.pdfone.PDFOne; import com.gnostice.pdfone.PdfDocument; import com.gnostice.pdfone.PdfException; public class Text_Export_Demo { public static void main(String[] args) throws IOException, PdfException, Exception { int i, n; // Create a file writer instance FileOutputStream fos = new FileOutputStream("Output_Docs\\extracted_text.txt"); OutputStreamWriter osw = new OutputStreamWriter(fos, "utf-8"); // Load a PDF document PdfDocument doc = new PdfDocument(); doc.load("Input_Docs\\sample_doc.pdf"); // Extract text from page 1 of the document // and save it to the file writer doc.saveAsText(1, osw); osw.close(); // Close the PDF document doc.close(); } }
---o0O0o---
Our .NET Developer Tools | |
---|---|
Gnostice Document Studio .NETMulti-format document-processing component suite for .NET developers. |
PDFOne .NETA .NET PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, and bookmark PDF documents in .NET applications. |
Our Delphi/C++Builder developer tools | |
---|---|
Gnostice Document Studio DelphiMulti-format document-processing component suite for Delphi/C++Builder developers, covering both VCL and FireMonkey platforms. |
eDocEngine VCLA Delphi/C++Builder component suite for creating documents in over 20 formats and also export reports from popular Delphi reporting tools. |
PDFtoolkit VCLA Delphi/C++Builder component suite to edit, enhance, view, print, merge, split, encrypt, annotate, and bookmark PDF documents. |
Our Java developer tools | |
---|---|
Gnostice Document Studio JavaMulti-format document-processing component suite for Java developers. |
PDFOne (for Java)A Java PDF component suite to create, edit, view, print, reorganize, encrypt, annotate, bookmark PDF documents in Java applications. |
Our Platform-Agnostic Cloud and On-Premises APIs | |
---|---|
StarDocsCloud-hosted and On-Premises REST-based document-processing and document-viewing APIs |
Privacy | Legal | Feedback | Newsletter | Blog | Resellers | © 2002-2024 Gnostice Information Technologies Private Limited. All rights reserved. |