JavaScript/TypeScript library designed to extract text tables from PDF files efficiently
PdfDocument
constructor accepts the following configuration options:
Option | Type | Default | Description |
---|---|---|---|
hasTitles | boolean | true | Indicates whether tables have title rows. |
threshold | number | 1.5 | Sensitivity for grouping rows by y-axis. |
maxStrLength | number | 30 | Maximum string length for table cells. |
ignoreTexts | string[] | [] | Array of texts to ignore during extraction. |
PdfDocument
numPages
: Number of pages in the PDF document.pages
: Array of parsed pages, each containing:
pageNumber
: Page number in the PDF.tables
: Array of extracted tables.load(source: string | Buffer): Promise<void>
: Loads and processes the PDF file.PdfTable
tableNumber
: Identifier for the table.numrows
: Number of rows in the table.numcols
: Number of columns in the table.data
: 2D array representing table data.