IvyPdf

Version 1.46

IvyPdf helps you to extract valuable information from unstructured PDF documents in a quick and easy way. It can extract unlimited number of individual values and tables and provides powerful post-processing mechanism to further clean and format the data.

While PDFs are the main target of the library, it can be also used to parse Excel, Text, HTML and other file formats, thus allowing you to use a single tool for all your data processing needs.

 

How to use IvyPdf

IvyPdf can be used in a few different ways:

 

Getting Started

Download IvyPdf.zip and unzip it to a folder of your choice. If you have a license key please start IvyTemplateEditor.exe, go to Help/About and enter the key there. It will register Ivy on your machine. If you don't have a license key, the 30 days trial will begin from the day of the first use.

Check out Ivy examples (sample.tl and Disney.ipb) in the Samples folder.

Read Tutorial and Quick Hints sections below, or use API Reference to familiarize yourself with Ivy commands and syntax.

 

Ivy Template Editor

Even if you plan to use IvyPdf strictly from your program code, it's a good idea to use Ivy Template Editor to test the code and the extraction logic first.

  1. Start IvyTemplateEditor.exe
  2. Open a PDF document from the top menu, or drag-drop it on the editor pane 1534120612905
  3. Highlight a template and create a new field 1534122554298
  4. Go to the Code window and try a few commands. For example:

Click Evaluate to test your code.

Use Toolbox window to test different commands, for example Find, Right, Down. Try extracting tables with various parameters. The history of commands is shown there for your reference and can be copied to the code definition.

 


 

API Reference

PdfParser

PdfParser class is used for parsing token collection, extracted from PDF or IPB files.

 

Create parser

PdfParser() – default constructor.

PdfParser(IvyDocument ivyDocument) – create a parser from IvyDocument instance.

 

Open document

PdfParser(IvyDocumentReader.ReadPdf(string filename)) – open PDF file.

PdfParser(IvyDocumentReader.ReadIpb(string filename)) – open IPB file.

IPB files contain only text information from PDFs and therefore are smaller and faster to use. Consider converting PDF files to IPB if you are doing multiple extractions, working on extraction template and so on.

Documents can be open from memory:

PdfParser(IvyDocumentReader.ReadPdf(byte[] fileContent))

PdfParser(IvyDocumentReader.ReadIpb(byte[] fileContent))

 

Search text

Find(params string[] text) - search for next token containing text (partial match).

Find(Regex regex) - search for next token matching specified regular expression.

Find(Predicate<Token> predicate) - search for next token matching specified properties.

FindPattern(string pattern) - search for next token matching specified pattern. See Pattern Matching

 

You can search backwards:

FindPrev(params string text)

FindPrev(Regex regex)

FindPrev(Predicate<Token> predicate)

FindPrevPattern(string pattern)

 

Search by page number

FindPage(int pageNumber) - find first token on specified page.

 

Find a token relative to current

Left(float offsetY = 0, float deviation = 0) - find first token on the left of the current token. To be found the token should reside on the imaginary line that starts from the center of the current token. Optional offsetY parameter moves the line down (negative value moves up). Token will be captured if it's located deviation points away from the line.

Down(float offsetX = 0, float deviation = 0) - find first token located right below from the current token. To be found the token should reside on the imaginary line that starts from the center of the current token. Optional offsetX parameter moves the line right (negative value moves left). Token will be captured if it's located deviation points away from the line.

Right(float offsetY = 0, float deviation = 0) - find first token on the right of the current token.

Up(float offsetX = 0, float deviation = 0) - find first token located right above the current token

 

Find first token in line above or below the current

Above() - find first (left-most) token located in the line above the current token.

Below() - find first (left-most) token located in the line below the current token.

 

Define search region

All search is done within a "Region". Initial region is the whole document.

Here are methods to FILTER a region:

FilterWindow(float x1, float y1, float x2, float y2) - set region to the rectangular defined by X,Y coordinates on the current token page. Only tokens that completely fit into the window are included.

FilterCWindow(float x1, float y1, float x2, float y2) - set region to the rectangular defined by X,Y coordinates on the current token page. Tokens that partially overlap are included too (cross-window tool).

FilterOffset(float left, float up, float right, float down) - set region to the window located relative to top-left corner of the current token (within the same page).

FilterCOffset(float left, float up, float right, float down) - set region to the window located relative to top-left corner of the current token (within the same page). Tokens that partially overlap are included too.

FilterIndex(int fromIndex, int toIndex) - set region to range of tokens by their index.

FilterCurrentPage() - set region to the page of current token.

FilterPage(int fromPage, int toPage = -1) – set region to page range.

FilterText(params string[] text) – include only tokens that contain specified text.

FilterTextPattern(string pattern) - include only tokens matching specified pattern. See Pattern Matching

FilterRegex(Regex regex) – include only tokens that match provided regular expression.

Filter(Predicate<Token> predicate) – region will include only tokens matching the predicate conditions.

FilterClear() – remove filter. Region will be set to the whole document.

Reset() – remove filter and move to the first token in the document.

Reset(string sequenceName) - optionally set SequenceName that can be used by Exception handler to log errors.

 

Parser properties and methods

IvyDocument IvyDocument – reference to underline document used by Parser.

IvyDocument properties are also referenced by Parser object:

string SequenceName – current search Sequence, set by a call to Reset(string sequenceName)

List<string> ParsingHistory - history of commands since the last Reset

Token Token - current token (set by one of Find methods or Filter)

double GetPageWidth(int page) – returns page width

double GetPageHeight(int page) – returns page height

int GetPageCount() – returns number of pages in the document

string Version – returns version of the IvyPdf library

Clone() – returns an exact copy of the PdfParser object that can be used for sub-searches.

 

The following properties are referencing the current Token and can be used as shortcuts (for example, you can use Text instead of Token.Text):

string Text - token text

string Font - name of the font used to print this token in the PDF document

bool Bold - font bold flag

bool Italic - font italic flag

int Page - token page number

float Width - width of token bounding box

float Height - height of token bounding box (essentially font height)

float X- coordinate of top-level corner of token bounding box

float Y- coordinate of top-level corner of token bounding box

int Index - index of the token in the Tokens collection

string DataType - data type of the token, guessed from text. Can be "string", "number" or "date".

double? ToNumber() – auto-convert current token text to Double

DateTime? ToDate() – auto-convert current token text to DateTime

 

Text extraction

string ExtractText(TextPosition textPosition, bool removeBlankLines, bool trimSpaces, double linePixelDeviation)

Parameters:

string ExtractText() - extract text using default parameters.

 

Bookmarks

Bookmarks can be used for repetitive tasks. For example find a token “X”, filter and search in the filtered region, then come back to the token “X” and search again.

SetBookmark(string name) – set a bookmark with specified name

GoBookmark(string name) – set current Token to the bookmark

DeleteBookmark(string name) – delete a bookmark

DeleteAllBookmarks() – delete all bookmarks

 

Conditions and Loops

You can use If method to add conditional logic, checking for specific condition or successful code execution. You can conditionally execute an action or return a value.

PdfParser If(condition, thenAction, [elseAction])

PdfParser If(action, thenAction, [elseAction])

dynamic If(condition, thenValue, elseValue)

dynamic If(action, thenValue, elseValue)

 

Examples:

Go up or down, depending on a condition:

p.If(myvariable == 42, x=>x.Down(), x=>x.Up());

Version without “else”:

p.If(myvariable == 42, x=>x.Down());

Check if a word “test” exists, then search something else:

p.If(x=>x.Find(“test”), x=>x.Find(“word1”), x=>x.Find(“word2”));

If a word “test” exists return a string:

p.If(x=>x.Find(“test”), “Found”, “Not Found”);

Return a string based on condition:

p.If(myVariable == true, “Yes”, “No”);

 

In a similar way you can use While loop to test for a condition, or run a code until it succeeds:

While(condition, action)

While(testAction)

While(testAction, action)

Move down until a bold token is found:

p.While(x => !x.Bold, x => x.Down());

Find right-most token starting from your position:

p.While(x => x.Right());

Count number of occurrences of word “test”:

int counter=0; p.While(x=>x.Find(*"test"*), ()=>counter++);

 

You can test for successful code execution using Try method:

bool Try(action)

Count number of occurrences of word “test”:

while(p.Try(x=>x.Find(“test”))) counter++;

 

Table extraction

DataTable Table(PdfTableOptions tableOptions) - Use for tables with column headers. To start table extraction the current token should be a header token of one of the columns.

DataTable Grid(PdfTableOptions tableOptions) - Use for tables that don't have a header. Returns a table with generic columns (Field0, Field1, Field2...). Current token should be in the first row.

Parameters:

In Ivy Template Editor the settings can be defined on the template level, by adding this code to Template Settings:

protected override void Init() { p.Options.TableCellType = TableCellType.Token; }

Using Token objects allows you to get location information from PDFs, but makes coding more complicated. All Tokenproperties will be included in JSON or XML output in Ivy Template Editor.

 

DataSetParser

To parse Excel, CSV and other structured formats you can use DataSetParser class.

 

Create parser

DataSetParser(DataSet dataSet) – create from existing DataSet

DataSetParser(DataTable dataTable) – create from existing DataTable

 

Open document

DataSetParser(IvyDataSetReader.ReadExcel(string filename)) – open xls, xlsx, xlsm or csv file.

 

Search text

Find(params string[] text) - search for next cell containing text (partial match)

Find(Regex regex) - search for next cell matching specified pattern.

FindPattern(string pattern) - search for next token matching specified pattern. See Pattern Matching

 

You can search backwards:

FindPrev(params string text)

FindPrev(Regex regex)

FindPrevPattern(string pattern) - search for next token matching specified pattern. See Pattern Matching

 

Search sheet (tab) by number or name

FindSheet(int sheet) – move to first cell on the specified sheet

FindSheet(string sheetName) – find a sheet where name contains provided text

FindSheetPattern(string pattern) - find a sheet where name matches specified pattern. See Pattern Matching

 

Find a cell relative to current

Left(int steps = 0) – move current position to the left. If non-zero number is specified then move exactly that numbers of cells. Otherwise, move until non-empty cell is found.

Right(int steps = 0)

Up(int steps = 0)

Down(int steps = 0)

 

Find first non-empty cell in line above or below the current

Above() - finds left-most non-empty cell in the line above.

Below() - finds left-most non-empty cell in the line below.

 

Select table area

DataTable Table(int headerRows = 1) – auto-grow table from current position, in left, right and down directions until empty columns/rows encountered. The top headerRows rows will be used as a header (default = 1)

DataTable Table(bool left, bool up, bool right, bool down, int emptyColumnLimit, int emptyRowLimit, int headerRows = 1) – auto-grow in specific directions only, allow limited number of empty columns/rows on the way.

DataTable Table(int left, int top, int width, int height, int headerRows = 1) – select area relative to current position.

DataTable Table(int width, int height, int headerRows = 1) – select area starting from current position.

DataTable Grid() – auto-grow table from current position, in left, right and down directions until empty columns/rows encountered. The header will be Field1, Field2, …

DataTable Grid(bool left, bool up, bool right, bool down, int emptyColumnLimit, int emptyRowLimit) – auto-grow in specific directions only, allow limited number of empty columns/rows on the way.

DataTable Grid(int left, int top, int width, int height) – select area relative to current position.

DataTable Grid(int width, int height) – select area starting from current position.

 

DataSetParser properties

DataSet DataSet – reference to underline DataSet object

int Sheet – current sheet number (zero-based)

string SheetName – current sheet name

int X – current column

int Y – current row

string Text – text value of the current cell

 

IvyDocument

Ivy converts PDF files to collection of Tokens and Lines. PdfParser class can be used to search this collection and extract useful information. In addition the collection can be stored as IPB file - a special format that can be loaded much quicker than PDF.

 

Properties

List<Token> Tokens - collection of tokens extracted from PDF. (Tokens are text objects with location and size information.)

List<LineF> Lines - collection of lines extracted from PDF.

double[] PageSizesX - pages width

double[] PageSizesY - pages height

 

Methods

LoadIpb(string filename) - read IPB file

LoadIpb(byte[] fileContents) - read IPB file from in-memory byte array

SaveIpb(string filename) - save to IPB file

PerformTokenLayout(TokenLayoutType tokenLayoutType) - after tokens are loaded from PDF file Ivy performs some additional steps to make data easier to use. The following logic is applied:

 

You have an option to ignore default Ivy layout logic and apply your own instead. You can write your own logic completely, or you can use the following pre-defined methods:

CombineTokens(Comparison<Token> predicate) - combine tokens based on predicate logic

CombineTokens(Comparison<Token> predicate, string character) - combine tokens, adding a character in between (e.g. space)

SplitTokensByCharSequence(char[] chars, int minLength)- split tokens on provided characters, making sure the resulting tokens are not small than minLength

Append(IvyDocument anotherIvyDocument) - append another IvyDocument at the end of the current one. Can be used to combine multiple PDF files together.

 

IvyDocumentReader

This class is used to read PDF documents and convert them to token collection representation.

IvyDocument ReadPdf(string filename, TokenLayoutType tokenLayoutType) - read PDF document.

Optional parameter tokenLayoutType specifies layout logic applied to token collection. Default value is UseWhitespaceTokens

IvyDocument ReadPdf(byte[] fileContents, TokenLayoutType tokenLayoutType) - read PDF document from memory.

IvyDocument ReadIpb(string filename) - read IPB document

IvyDocument ReadIpb(byte[] fileContents) - read IPB document from memory

 

DataSetReader

This class is used to read Excel and CSV documents into a DataSet object.

DataSet ReadExcel(string filename) - read Excel or CSV document. Files should have extension xls, xlsx, xlsm, or csv

 

DataTable extensions

Ivy Library includes many extension methods that can be used to join and filter DataTable objects to get the data you need.

The methods below return DataTable object. We will skip the return data type for readability:

 

Filter columns

 

allowMissingColumns - optional parameter (default = false). If not set and the table doesn't contain specified column the exception is thrown.

addMissingColumnAsEmpty - optional parameter (default = false). If set and the table doesn't contain specified column, the column is added as empty. Otherwise this column is ignored.

 

Example:

 

Filter rows

 

 

Example:

 

Relational operations

 

Group by, rollup, transpose and reverse

 

Formatting and cleanup

 

Update values

 

Various functions

 

String extensions

 

Global options

IvyOptions class can be used to specify some global options. It has the following properties:

CultureInfo and ToDateFormatMonthFirst are set according to local machine settings by default.

In Ivy Template Editor the settings can be changed on the template level, by adding this code to Template Settings:

 

Reserved words

Ivy templates have the following pre-defined objects (please refer to definition of TemplateBase class, which can be found in Template Editor: Template Library \ Custom Code Modules collection):

pPdfParser object that is loaded from PDF or IPB files.

dDataSetParser object loaded from xls, xlsx, xlsm or csv files.

sstring object loaded from txt, htm, html, xml or json files.

In addition, there are the following public fields:

filename – contains the full path of the loaded file.

args[] – additional parameters that can be provided via command-line.

_Init() method initializes the objects above.

Init() method is meant to be overridden by code in templates, providing custom initialization when needed.

 

Command-line parameters

Ivy Template

-e Extract data and save into Excel, Json or XML:

IvyTemplate.exe -e InputFile OutputFile TemplateLibrary TemplateName|Auto [Parameters]

For the Auto-template selection the template should have a field called AutoTemplateSelectionCriteria, containing logic that returns “true” if the template should be selected.

OutputFile should have extension .xlsx, .json or .xml to determine output format

-v Validate extraction – run the template, but do not save the results:

IvyTemplate.exe -v InputFile TemplateLibrary TemplateName [Parameters]

-i Convert a PDF file to IPB:

IvyTemplate.exe -i InputFile OutputFile

 

Ivy Template Editor

-o Open for preview in GUI mode:

IvyTemplateEditor.exe -o InputFile TemplateLibrary

Returned %%ERRORLEVEL%% values:

0 - success

1 - error

2 - template validation failed

 

Examples

PdfParser

 

 

 

Don't forget to call p.Reset() between the calls (if needed).

 

DataSetParser

 

 

Pattern Matching

Functions that end with "Pattern" are accepting wildcards for text matching. The syntax is:

* - any sequence of characters

? - any character

| - divider between multiple patterns

All pattern searches are case insensitive.

Examples:

 

Using TemplateLib in .Net

Requires .Net Framework 4.5

You need to add references to the following DLLs:

 

Custom layout

In some cases default layout logic doesn't work and you may have to replace it with custom code. For your reference below is standard logic used to combine token collection.

 

Tutorial

PDF documents can be tricky. They range from simple and clean reports to extremely convoluted ones, with random artifacts and structural errors. The task of extracting specific values may be daunting, especially if you need to do it for large number of documents on multiple occasions. A computer can help with repeating tasks, but it's up to you to define the parameters and the kind of information you are looking for. Every document is different, however, there are some common scenarios, so let's try to break them down.

 

The data you need is always in the same exact location

This is pretty common for various receipts, financial statements and so on. And it's easy to filter out:

First get the page you need, e.g. it's on page 3.

P.FilterPage(3)

(In Ivy Template Editor p is a predefined PdfParser object)

Then filter for exact location:

.FilterWindow(10,10,50,20)

(Use Filter button in the template editor to create a window, then move it to the required location to get coordinates)

This will get you the first token in the selected "window". If you want all text that fits in there - just add .ExtractText()

 

The data you need follows a specific word

It may move to different places in the document, so exact position is not known. Let's say you want to get a number that follows a word "Total":

P.Find("Total").Right().Text

That was easy, wasn't it? Please be aware that Find() uses case-sensitive search and it will find any token that contains your string. You can also search for occurrence of multiple strings, like this:

Find("Total", "total")

You probably expect to find a number there, but it may have extra characters like currency signs, commas, percentage symbols. Simply use .ToNumber() function to clean it up. Another handy function is ToDate(). It can recognize dates in most formats, even surrounded by other text.

Hint: You don't necessarily need to type the search text yourself. In the Template Editor right-click the token you want to get and click "Suggest". You may need to clean up some code that is generated (the logic there is to search for a specific section, which would usually have larger font, then subsections, then the word next to the token you need. Some of these steps may be omitted in your case)

 

Extracting tables

Let's say there is a table in PDF and it's a tedious task to get it out manually, so let's see what we can do. Tables can quickly become tricky. There's no way to deal with every table out there, since there are way too many variations. Let's start with a simple one first.

 

Simple, rectangular table with a header

First you need to find a header. Right-click any header token and click "Suggest" - you should get a logic that brings you to that token. Let's say this:

P.Find("Field1")

Now just add .Table() and preview the results. In many cases this works right away and is really that simple.

By default IvyPdf does not use any graphical objects, like lines that surround table cells. Instead, the table is built using positions of the text tokens relative to each other. Due to this, some cells may get shifted to a wrong column or row. Also, you may get some unwanted data. You can try to fix this post-factum, using table extension methods like .Rollup, DeleteRows, DeleteColumns and so on.

 

Table without header

Just find any token in the top row. (For example it may be Below some specific text). Then use Grid() function.

 

Table with subtotals between rows

You have two options: you can get the whole table, then delete specific rows using DeleteRows method, or you can filter out unwanted tokens first. Let's assume the subtotals are in bold:

p.Filter(x=>x.Bold).Find("header1").Table()

 

Table with sub-headings that you want to add as a data column

This one is tricky. You would need some sort of loop for this. First create an empty DataTable object. Then find a sub-header and store it into a variable. Then move Below and call Grid(). Union the results with the empty table. Add new column, providing the stored header as a default value. Repeat. It may need a lot of tweaking to make this work, but you have a few helpful tools at your disposal:

 

Various collections

Let's say you want to find all telephone numbers in the document. First, Filter by a regular expression, then use resulting Tokens collection - loop or convert to a DataTable. IvyPdf is using DataTables extensively. We prefer DataTables over other collections for their flexibility. We extended their functionality, so you can use Join, Union, and many other handy functions. However, you can create any collections you like, use Linq, add your own extensions and so on.

 

Connect to a database, read a text file, get data from a web service

In the "Template Library Settings" you can add as many "Modules" as you want. The modules are C# classes that you can call from your expressions. In addition you can reference any .Net assembly and use its methods.

 

Quick Hints

Here are some quick hints on Ivy Template Editor (in no particular order):

 

FAQ

General questions

 

PDF Parsing

 

Licensing

IvyPdf is licensed per installation. Depending on how you use Ivy different types of licenses are required.

 

License code can be set programmatically.

License.SetLicense("your license code") - sets the license and registers IvyPdf on the current machine.

License.SetSessionLicense("your license code") - sets the license for the currently running program, but does not register IvyPdf on the machine. This is preferable way for OEM-distributed software.