Skip to main content

Open PDF

Note: Users must procure and maintain the applicable open source tools to integrate this DE tool with the Istari Digital platform. Please contact your local IT administrator for assistance.

Supported Functions:

extract

Getting Started

The Open PDF integration allows users to extract data from .pdf files.

Upload: Yes

Files Supported

The istari Digital Platform can extract from the following file types: .pdf

Example Files

Download Example Document: example_document.pdf

Setup for Administrators

Ensure that Istari Digital Agent and appropriate Istari Digital Software is installed on the machine.

Version Compatibility

This software is intended to run in a Windows environment. It was tested on a Windows 11 machine.

Function Coverage and Outputs

The Open PDF software can produce a number of artifacts extracted from the Open PDF document. The table below describes each output artifact and its type.

RouteCoverageArtifact Content Example
Extract all text - TXTYes
Extract text sections - JSONYes
Extract JSON sections - JSONYes
Extract document metadata - JSONYes
Extract seperate pages - PNGYes
Extract embedded images - PNG/JPEGYes
Extract seperate pages - PDFYes
Extract document - HTMLYes

Detailed SDK Reference

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here

Step 1: Upload and Extract the File(s)

Upload the file as a model

model = client.add_model(
path="example_document.pdf",
description="Open PDF example Model",
display_name="Open PDF Model Name",
)
print(f"Uploaded base model with ID {model.id}")

Extract once you have the model ID

extraction_job = client.add_job(
model_id = model.id,
function = "@istari:extract",
tool_name = "open_pdf",
tool_version = "1.0.0",
operating_system = "Windows Server 2019",
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")

Please choose appropriate tool_name, tool_version, and operating_system for your installation of this software.
Above is an example of how to call the function

Step 2: Check the Job Status

extraction_job.poll_job()

Step 3: Retrieve Results

Example

for artifact in model.artifacts:
output_file_path = f"c:\\extracts\\{artifact.name}"

if artifact.extension in ["txt", "csv", "md", "json", "html"]:
with open(output_file_path, "w") as f:
f.write(artifact.read_text())
else:
with open(output_file_path, "wb") as f:
f.write(artifact.read_bytes())

Troubleshooting

  1. For general Agent and Software Troubleshooting Click Here

FAQ