◷ Reading Time: 3 minutes
This extensions enables your logic to extract data information from PDF document.
toPdf
Opens an input as a PDF document
input | toPdf (options)
- options: Object to provide set of options
- textExtractionMethod: When not specified, the default is instructions
- instructions: Extract text as the raw PDF instructions
- words: Extract text by using the instructions and creates words
- textExtractionMethod: When not specified, the default is instructions
- input: Path as string, loaded bytes or Stream
- returns: a PDF object
Example:
pdf = file| toPdf ({textExtractionMethod:'words'})
pdfIndexOf
Finds list of page indices that match a pattern
pdf | pdfIndexOf (pattern, skipCount)
- pdf: PDF document object
- skipCount: number of pages to skip (default is 0)
- returns: a list of pages indices that match pattern e.g. [17,19,34]
pdfExtract
Extract one or more pages from a PDF document
pdf | pdfExtract (pages, outputFile)
- pages: array of page index (0 based)
- outputFile: path of the file to extract to
- returns: full file name of the output file
pdfSplit
Splits a PDF document based on a matching pattern
pdf | pdfSplit (pattern, outputFilePattern, skipCount)
- pattern: a regular expression pattern
- outputFilePattern: Allows dynamically build a set of file paths for the split documents.
- Single string value: In the path using
{$GroupName}
can be used to reference a match value of pattern - Array of string: Must match the numbers of documents that are being split
- Single string value: In the path using
- skipCount: number of pages to skip (default is 0)
- returns: full files path of split documents.
Example: Splitting a document that must return 3 items and will be stored in the provided location
list = pdf|pdfSplit('(ATTACHMENT\\sTO\\sTAX)(\\n*\\s*INVOICE)(\\n*\\s*Page\\s1\\sof\\s(\\d*))(\\n*\\s*.*)(\\n*\\s*.*)(\\n*\\s*.*)(\\n*\\s*Matter\\n*\\s*(?<matter>\\d*))', ['D:/1.pdf', 'D:/2.pdf', 'D:/3.pdf'])
Example: Splitting a document with dynamic name based on pattern
list = pdf|pdfSplit('(ATTACHMENT\\sTO\\sTAX)(\\n*\\s*INVOICE)(\\n*\\s*Page\\s1\\sof\\s(\\d*))(\\n*\\s*.*)(\\n*\\s*.*)(\\n*\\s*.*)(\\n*\\s*Matter\\n*\\s*(?<matter>\\d*))','Matter-{$matter}.pdf')