Open PDF
Note: Users must procure and maintain the applicable open source tools to integrate this DE tool with the Istari Digital platform. Please contact your local IT administrator for assistance.
Supported Functions:
Getting Started
The Open PDF integration allows users to extract data from .pdf
files.
Methods to Link to Istari Digital Platform
Upload: Yes
Link: No
Files Supported
The istari Digital Platform can extract from the following file types:
.pdf
Example Files
Download Example Document: example_document.pdf
Setup for Administrators
Ensure that Istari Digital Agent and appropriate Istari Digital Software is installed on the machine.
Version Compatibility
This software is intended to run in a Windows environment. It was tested on a Windows 11 machine.
Function Coverage and Outputs
The Open PDF software can produce a number of artifacts extracted from the Open PDF document. The table below describes each output artifact and its type.
Route | Coverage | Artifact Content Example |
---|---|---|
Extract all text - TXT | Yes | |
Extract text sections - JSON | Yes | |
Extract JSON sections - JSON | Yes | |
Extract document metadata - JSON | Yes | |
Extract seperate pages - PNG | Yes | |
Extract embedded images - PNG/JPEG | Yes | |
Extract seperate pages - PDF | Yes | |
Extract document - HTML | Yes |
Detailed SDK Reference
Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here
Step 1: Upload and Extract the File(s)
Upload the file as a model
model = client.add_model(
path="example_document.pdf",
description="Open PDF example Model",
display_name="Open PDF Model Name",
)
print(f"Uploaded base model with ID {model.id}")
Extract once you have the model ID
extraction_job = client.add_job(
model_id = model.id,
function = "@istari:extract",
tool_name = "open_pdf",
tool_version = "1.0.0",
operating_system = "Windows Server 2019",
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")
Please choose appropriate tool_name, tool_version, and operating_system for your installation of this software.
Above is an example of how to call the function
Step 2: Check the Job Status
extraction_job.poll_job()
Step 3: Retrieve Results
Example
for artifact in model.artifacts:
output_file_path = f"c:\\extracts\\{artifact.name}"
if artifact.extension in ["txt", "csv", "md", "json", "html"]:
with open(output_file_path, "w") as f:
f.write(artifact.read_text())
else:
with open(output_file_path, "wb") as f:
f.write(artifact.read_bytes())
Troubleshooting
- For general Agent and Software Troubleshooting Click Here