API Reference

Create parser:

public PdfParser() – default constructor

public PdfParser(IvyDocument ivyDocument) – create a parser from IvyDocument instance

 

Open document:

public PdfParser(IvyDocumentReader.ReadPdf(string filename)) – Open PDF file

public PdfParser(IvyDocumentReader.ReadIpb(string filename)) – Open Ipb file

 

Ipb files contain only text information from PDFs and are much faster to use. Consider converting PDF files to Ipb if you are doing multiple extractions, working on extraction template, etc.

 

Documents can be open from memory:

public PdfParser(IvyDocumentReader.ReadPdf(byte[] fileContent))

public PdfParser(IvyDocumentReader.ReadIpb(byte[] fileContent))

 

Search token:

PdfParser Find(params string[] text) - search for next token containing text (partial match)

PdfParser Find(Regex regex) - search for next token matching specified pattern

PdfParser Find(Predicate<Token> predicate) - search for next token matching specified properties.

 

You can search backwards:

PdfParser FindPrev(params string text)

PdfParser FindPrev(Regex regex)

 

You can search by page number

PdfParser FindPage(int pageNumber) - find first token on specified page

 

Find a token relative to current:

PdfParser Left(float offsetY = 0, float deviation = 0)

PdfParser Right(float offsetY = 0, float deviation = 0)

PdfParser Up(float offsetX = 0, float deviation = 0)

PdfParser Down(float offsetX = 0, float deviation = 0)

 

 

Find first token in line above or below the current:

PdfParser Above()

PdfParser Below()

 

 

All search is done within "Region". Initial region is the whole document.

Here are methods to FILTER a region:

PdfParser FilterWindow(float x1, float y1, float x2, float y2) - set region to the rectangular defined by X,Y coordinates on the current token page. Only tokens that completely fit into the window are included.

 

 

PdfParser FilterCWindow(float x1, float y1, float x2, float y2) - cross-window - set region to the rectangular defined by X,Y coordinates on the current token page. Tokens that partially overlap are included too.

 

 

PdfParser FilterOffset(float left, float up, float right, float down) - set region to window relative to current token (within the same page).

 

PdfParser FilterIndex(int fromIndex, int toIndex) - set region to range of tokens by their index

PdfParser FilterCurrentPage() - set region to the page of current token

PdfParser FilterPage(int fromPage, int toPage = -1) – set region to page range

PdfParser FilterText(params string[] text) – include only tokens that contain specified text

PdfParser FilterRegex(Regex regex) – include only tokens that match provided regular expression

PdfParser Filter(Predicate<Token> predicate) – region will include only tokens matching the predicate conditions

PdfParser FilterClear() – remove filter. Region will be set to the whole document

PdfParser Reset() – remove filter and move to the first token in the document

PdfParser Reset(string sequenceName) - optionally set SequenceName that can be used by Exception handler to log errors

 

Parser has the following properties:

public IvyDocument IvyDocument – reference to underline document used by Parser IvyDocument properties are also referenced by Parser object:

    List Tokens – collection of all tokens (including filtered)

    public List Lines – collection of line objects

    public double[] PageSizesX – contains width of each page

    public double[] PageSizesY – contains height of each page

    public void SaveIpb(string filename) – Save current document as Ipb file for quick loading later

    public IvyDocument LoadIpb(string filename) – Load from Ipb file

string SequenceName – current search Sequence, set by a call to Reset(string sequenceName)

List<string> ParsingHistory - history of commands since the last Reset

Token Token - current token (set by one of Find methods or Filter)

public double GetPageWidth(int page) – returns page width

public double GetPageHeight(int page) – returns page height

public int GetPageCount() – returns number of pages in the document

public string Version – returns version of the IvyPdf library

public PdfParser Clone() – returns an exact copy of the PdfParser object that can be used for sub-searches.

 

The following properties are referencing the current Token and can be used as shortcuts
(e.g. you can use Text instead of Token.Text):

public string Text

public string Font

public string DataType

public bool Bold

public bool Italic

public int Page

public float Width

public float Height

public float X

public float Y

public int Index

public double? ToNumber() – auto-convert current token text to a number

public DateTime? ToDate() – auto-convert current token text to a DateTime

 

Extract region text:

string ExtractText(TextPosition textPosition, bool removeBlankLines, bool trimSpaces, double linePixelDeviation)

Parameters:

Text Position:

GeometricCompact - uses token coordinates to combine them into text. Ignores space between tokens.

GeometricSpaced - uses token coordinates to combine them into text, adding spaces between tokens according to their position. (Default option)

TokenOrder - uses order of tokens to prepare text.

removeBlankLines - when set the text won't have any empty lines. Default value = true

trimSpaces - will trim spaces around text. Default value = true

linePixelDeviation - allowed vertical deviation of tokens to belong to the same line. Default value = 5.0

 

string ExtractText() - extract text using default parameters.

 

Bookmarks:

Bookmarks can be used for repetitive tasks. E.g. find a token “X”, filter and search in the filtered region, then come back to the token “X” and search again.

public PdfParser SetBookmark(string name) – set a bookmark with specified name

public PdfParser GoBookmark(string name) – set current Token to the bookmark

public PdfParser DeleteBookmark(string name) – delete a bookmark

public PdfParser DeleteAllBookmarks() – delete all bookmarks

 

Conditions and Loops:

You can use If method to add conditional logic, checking for specific condition or successful code execution. You can conditionally execute an action or return a value.

PdfParser If(condition, thenAction, [elseAction])

PdfParser If(action, thenAction, [elseAction])

dynamic If(condition, thenValue, elseValue)

dynamic If(action, thenValue, elseValue)

 

Go up or down, depending on a condition:

p.If(myvariable == 42, x=>x.Down(), x=>x.Up());

 

Version without “else”:

p.If(myvariable == 42, x=>x.Down());

 

Check if a word “test” exists, then search something else:

p.If(x=>x.Find(“test”), x=>x.Find(“word1”), x=>x.Find(“word2”));

 

If a word “test” exists return a string:

p.If(x=>x.Find(“test”), “Found”, “Not Found”);

 

Return a string based on condition:

p.If(myVariable == true, “Yes”, “No”);

 

In a similar way you can use While loop to test for a condition, or run a code until it succeeds:

PdfParser While(condition, action)

PdfParser While(testAction)

PdfParser While(testAction, action)

 

Move down until a bold token is found:

p.While(x => !x.Bold, x => x.Down());

 

Find right-most token starting from your position:

p.While(x => x.Right());

 

Count number of occurrences of word “test”:

int counter=0; p.While(x=>x.Find("test"), ()=>counter++);

 

You can test for successful code execution using Try method:

bool Try(action)

 

Count number of occurrences of word “test”:

while(p.Try(x=>x.Find(“test”))) counter++;

 

Auto-recognize table:

public DataTable Table(PdfTableOptions tableOptions) - Use for tables with column headers.
To start table recognition the current token should be a header token of one of the columns.

 

public DataTable Grid(PdfTableOptions tableOptions) - Use for tables that don't have a header.
Returns a table with generic columns (Field0, Field1, Field2...).
Current token should be in the first row.

 

Parameters:

  • WhiteSpaceLimit – amount of white space that is used to determine the end of table, as ratio of table row height.
    (Default value is 2.5)
  • MaxRowHeight – determine end of table using absolute distance between rows.
    (Default value 0)
  • MultiPage – attempt to find the table on the next page(s). Table should have header row on every page.
    (Default true)
  • IncludeUnmatchedCells – extra cells that do not match to header will be added into a new column.
    (Default true)
  • ColumnBorders – location of every column that will be used to position tokens into columns. Starts with left border, up to right border (should have [number of columns] + 1).
    (Default null)
  • HeaderSeparatedByLine – use lines in the table header to determine column border locations.
    (Default false)
  • ColumnsSeparatedByLines – use lines in the table body to assign tokens to columns.
    (Default false)
  • TableCellType – defines whether returned DataTable contains Token objects or String objects. Possible values are TableCellType.Token, TableCellType.String or TableCellType.ParserDefault.

    ParserDefault is using global value in PdfParser.Options.TableCellType.

    In IvyTemplateEditor the settings can be defined on the template level, by adding this code to Template Settings:

    protected override void Init() { p.Options.TableCellType = TableCellType.Token; }

    Using Token objects allows you to get location information from PDFs, but may make coding more complicated. All Token properties will be included in JSON or XML output in IvyTemplateEditor.

 

DataTable extension methods:

Library includes many extension methods that can be used to join and filter DataTable objects to get the data you need.

DataTable SelectColumns(this DataTable dt, params int[] indexes)

DataTable SelectColumns(this DataTable dt, params string[] names)

DataTable SelectColumns(this DataTable dt, Predicate ColumnPredicate)

DataTable DeleteColumns(this DataTable dt, params int[] indexes)

DataTable DeleteColumns(this DataTable dt, params string[] names)

DataTable DeleteColumns(this DataTable dt, Predicate columnPredicate)

DataTable NameColumns(this DataTable dt, params string[] names)

DataTable AddColumn(this DataTable dt, string name, dynamic value = null)

DataTable SelectRows(this DataTable dt, string RowFilter)

DataTable SelectRows(this DataTable dt, Predicate RowPredicate)

DataTable DeleteRows(this DataTable dt, Predicate rowPredicate)

DataTable DeleteRowsRange(this DataTable dt, Predicate fromRowPredicate, Predicate toRowPredicate)

DataTable DeleteRowsContainingText(this DataTable dt, params string[] text)

DataTable Union(this DataTable First, DataTable Second)

DataTable Union(this DataTable First, DataTable Second, bool UseColumnNames)

DataTable Distinct(this DataTable Table, string Column)

DataTable Join(this DataTable First, DataTable Second, string FJC, string SJC)

DataTable LeftJoin(this DataTable First, DataTable Second, string FJC, string SJC)

DataTable FullOuterJoin(this DataTable First, DataTable Second, string FJC, string SJC)

DataTable Rollup(this DataTable dt, string nonEmptyColumn)

DataTable Rollup(this DataTable dt, bool fromTopToBottom, params int[] nonEmptyColumnIndexes)

DataTable Rollup(this DataTable dt, float minRolledUpLength, int ColumnToRollupIndex )

DataTable Reverse(this DataTable dt)

DataTable GroupBy(this DataTable Table, DataColumn[] Grouping, string[] AggregateExpressions, string[] ExpressionNames, Type[] Types)

DataTable Transpose(this DataTable inputTable)

DataTable TableFromArray(object[,] DataArray)

double GetSum(this DataTable dt, string colNameToSum, string whereClause)

bool HasText(this DataTable dt, string text)

DataTable ToNumber(this DataTable dt, params string[] names)

DataTable ToNumber(this DataTable dt, params int[] indexes)

DataTable ToDate(this DataTable dt, params string[] names)

DataTable ToDate(this DataTable dt, params int[] indexes)

DataTable Update(this DataTable dt, Action action)

DataTable Update(this DataTable dt, Action action, Predicate where)

bool HasEmptyRows(this DataTable dt)

bool HasEmptyColumns(this DataTable dt)

DataTable ReplaceText(this DataTable dt, int columnIndex, string find, string replaceTo)

DataTable ReplaceText(this DataTable dt, string find, string replaceTo)

bool HeaderMatches(this DataTable dt, params string[] names)

DataTable TokenToString(this DataTable dt)


 

Global options:

IvyOptions class can be used to specify some global options. It has the following properties:

TableCellType TableCellType – specifies default value for PdfParser.Options.TableCellType;

CultureInfo ToNumberCultureInfo – specifies culture settings used by ToNumber() function;

bool ToDateFormatMonthFirst – used by ToDate() function

CultureInfo and ToDateFormatMonthFirst are set according to local machine settings by default.

In IvyTemplateEditor the settings can be changed on the template level, by adding this code to Template Settings:

protected override void Init()
{
    IvyOptions.TableCellType = TableCellType.String;
    IvyOptions.ToNumberCultureInfo = new CultureInfo("fr-FR");
    IvyOptions.ToDateFormatMonthFirst = true;
}

 

Examples:

 

Load PDF file:

PdfParser p = new PdfParser(IvyDocumentReader.ReadPdf("mydoc.pdf"));

 

Get text from section "Revenue", on the right of word "Total":

string text = p.Find("Revenue").Find("Total").Right().Text;

 

Extract full text from page 15:

string text = p.FindPage(15).FilterCurrentPage().ExtractText();

 

Extract full text from page containing word "summary":

string text = p.Find("summary").FilterCurrentPage().ExtractText();

 

Find page containing word "summary" and extract text in the left upper corner:

string text = p.Find("summary").Window(0, 0, 100, 100).ExtractText();

 

Extract text between words "summary" and "total"

int fromToken = p.Find("summary").Index;

int toToken = p.Find("total").Index;

string text = p.FilterByIndex(fromToken, toToken).ExtractText();

 

Extract full document text:

string text = p.ExtractText();

 

Don't forget to call p.Reset() between the calls (if needed).

 

IvyTemplate and IvyTemplateEditor command-line options:                

 

-e    Extract data and save into Excel, Json or XML:

IvyTemplate.exe -e InputFile OutputFile TemplateLibrary TemplateName|Auto [Parameters]

For the Auto-template selection the template should have a field called AutoTemplateSelectionCriteria, containing logic that returns “true” if the template should be selected.

OutputFile should have extension .xlsx, .json or .xml to determine output format

 

-v    Validate extraction – run the template, but do not save the results:

IvyTemplate.exe -v InputFile TemplateLibrary TemplateName [Parameters]

 

-i    Convert a Pdf file to Ipb:

IvyTemplate.exe -i InputFile OutputFile

 

-o    Open for preview in GUI mode:

IvyTemplateEditor.exe -o InputFile TemplateLibrary

 

Returned %%ERRORLEVEL%% values:

0 - success

1 - error

2 - template validation failed

 

 

Example using TemplateLib from .Net code.

Required .Net Framework 4.5

You need to add references to the following DLLs:

IvyPdf.dll – to use Parser functionality

IvyDocumentReader.dll – to read PDF files

IvyTemplateLib.dll – to use templates created via IvyTemplateEditor.

using IvyPdf;
using IvyTemplateLib;

//Open template library file
TemplateLibrary tl = TemplateLibrary.LoadTemplateLibrary("sample_library.tl");
tl.StaticCodeMode = true; //To prevent recompile on every run


//Open PDF file
PdfParser p = new PdfParser(IvyDocumentReader.ReadPdf("sample_document.pdf"));

//Run a template and get the results
List results;
tl.RunTemplate("Template1", p, out results);

 

Tutorial

PDF documents can be tricky. They range from simple and clean reports to extremely convoluted ones, with random artifacts and structural errors.

The task of extracting specific values may be daunting, especially if you need to do it for large number of documents on multiple occasions.

A computer can help with repeating tasks, but it's up to you to define the parameters and the kind of information you are looking for.

Every document is different, however, there are some common scenarios, so let's try to break them down.

1. The data you need is always in the same exact location.

This is pretty common for various receipts, financial statements and so on. And it's easy to filter out:

First get the page you need, e.g. it's on page 3.

P.FilterPage(3)

Then filter for exact location:

.FilterWindow(10,10,50,20)

(Use Filter button in the template editor to create a window, then move it to the required location to get coordinates)

This will get you the first token in the selected "window". If you want all text that fits in there - just add .ExtractText()

2. You need data that follows a specific word.

It may move to different places in the document, so exact position is not known. Let's say you want to get a number that follows a word "Total":

P.Find("Total").Right().Text

That was easy, wasn't it? Please be aware that Find uses case-sensitive search and it will find any token that contains your string. You can also search for occurrence of multiple strings, like this:

Find("Total", "total")

You probably expect to find a number there, but it may have extra characters like dollar signs, commas, percentage symbols. Simply use .ToNumber() function to clean it up. Another handy function is ToDate(). It can recognize dates in most formats, even surrounded by other text.

Hint: You don't necessarily need to type the search text yourself. In the TemplateEditor right-click the token you want to get and click "Suggest". You may need to clean up some code that is generated (the logic there is to search for a specific section, which would usually have larger font, then subsections, then the word next to the token you need. Some of these steps may be omitted in your case)

3. Tables

Finally, we got to an interesting part. There is a table in PDF and it's a tedious job to get it out manually, so let's see what we can do. Tables can quickly become tricky. There's no way to deal with every table out there, since there are way too many variations. Let's start with a simple one first.

A. Simple, rectangular table with a header.

First you need to find a header. Right-click any header token and click "Suggest" - you should get a logic that brings you to that token. Let's say this:

P.Find("Field1")

Now just add .Table() and preview the results. In many cases this works right away and is really that easy.

By default IvyPdf does not use any graphical objects, like lines that surround table cells. Instead the table is built using positions of the text tokens relative to each other. Due to this some cells may get shifted to a wrong column or row. Also you may get some unwanted data. You can try to fix this post-factum, using table extension methods like .Rollup, DeleteRows, DeleteColumns and so on.

B. Table without header.

Just find any token in the top row. (For example it may be Below some specific text). Then use Grid() function.

C. Table with subtotals between rows.

You have two options: you can get the whole table, then delete specific rows using DeleteRows method, or you can filter out unwanted tokens first. Let's assume the subtotals are in bold:

p.Filter(x=>x.Bold).Find("header1").Table()

D. Table with sub-headings that you want to add as a data column.

This one is tricky. You would need some sort of loop for this. First create an empty DataTable object. Then find a sub-header and store it into a variable. Then move Below and call Grid(). Union the results with the empty table. Add new column, providing the stored header as a default value. Repeat. It may need a lot of tweaking to make this work, but you have a few helpful tools in your disposal:

  • Bookmarks are handy when you need to go back and fourth, especially if you applied a filter and need to move to previous position.
  • Conditional If and While loop - you can write a regular C# code instead, but these two provide a concise syntax that can make your code easier to read. (Just don't forget to add comments.)
  • Table and Grid functions have one very important option - Whitespace limit. You already pointed to the table header, but how does it know where the table ends? Well, usually there's some gap between end of the table and the following text, which is larger than spacing between rows. By default limit is 2.5 which means the gap should be 2.5 times bigger than average row height. You can change that number according to your needs.

4. Various collections.

Let's say you want to find all telephone numbers in the document. First, Filter by a regular expression, then use resulting Tokens collection - loop or convert to a DataTable. IvyPdf is using DataTables extensively. We prefer DataTables over other collections for their flexibility. We extended their functionality, so you can use Join, Union, and many other handy functions. However, you can create any collections you like, use Linq, add your own extensions and so on.

5. Connect to a database, read a text file, get data from a webservice.

In the "Template Library Settings" you can add as many "Modules" as you want. The modules are simply C# classes that you can call from your expressions. In addition you can reference any .Net assembly and use it's methods.

Quick Hints on Ivy Template Editor(in no particular order)

Frequently Asked Questions

General

PDF Parsing