Skip to main content

Microsoft Word

Note: Users must procure and maintain valid licenses to integrate this commercial DE tool with the Istari Digital platform. Please contact your local IT administrator for assistance.

Supported Functions:

extract

Getting Started

The Word integration provides support for Microsoft Office 2019 and 2021 Word, allowing users to extract data from .docx files.

Upload: Yes

Files Supported

The istari Digital Platform can extract from the following file types: .docx

All other Word file types not supported at this time. Please submit a feature request if an important file type is not supported.

Example Files

Download Example Documents: example_documents.docx

Setup for Administrators

Ensure that Istari Digital Agent and appropriate Istari Digital Software is installed on the machine.

Version Compatibility

This software was tested with Microsoft Office 2019 and Office 2021, and is intended to run in a Windows environment due to reliance on the Word interop assembly.

Function Coverage and Outputs

The Word software can produce a number of artifacts extracted from the Word document. The table below describes each output artifact and its type.

RouteCoverageArtifact Content Example
Extract all text - TXTYes
Extract paragraphs - TXTYes
Extract figures and tables - JPEG/PNG/PDFYes
Extract tables - HTMLYes

Detailed SDK Reference

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here

Step 1: Upload and Extract the File(s)

Upload the file as a model

model = client.add_model(
path="example_document.docx",
description="Word example Model",
display_name="Word Model Name",
)
print(f"Uploaded base model with ID {model.id}")

Extract once you have the model ID

extraction_job = client.add_job(
model_id = model.id,
function = "@istari:extract",
tool_name = "microsoft_office_word",
tool_version = "2019",
operating_system = "Windows Server 2019",
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")

Please choose appropriate tool_name, tool_version, and operating_system for your installation of this software.
Above is an example of how to call the function

Step 2: Check the Job Status

extraction_job.poll_job()

Step 3: Retrieve Results

Example

for artifact in model.artifacts:
output_file_path = f"c:\\extracts\\{artifact.name}"

if artifact.extension in ["txt", "csv", "md", "json", "html"]:
with open(output_file_path, "w") as f:
f.write(artifact.read_text())
else:
with open(output_file_path, "wb") as f:
f.write(artifact.read_bytes())

Troubleshooting

  1. For general Agent and Software Troubleshooting Click Here
  2. If experiencing errors while extracting data from the .docx file, test that your file successfully opens in Microsoft Office 2019 and 2021 Word.

FAQ

  • Are macro enabled documents supported? No, macro enabled documents are not supported.