Skip to main content
Version: 2025.10

Open Text Extraction

Summary

The Open Text extraction integration provides secure text extraction from multiple document and data formats using direct format parsers. Extract text content from 10 file formats including documents, web files, and tabular data.

Supported File Types: .csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt

Connection Methods: Upload

Example Files

Test the Open Text extraction with these sample files:

How and Where to Use

You can use the Open Text extraction through either the Istari Digital Platform UI or the Istari Digital SDK. Both methods allow you to extract text content from supported document formats with detailed metadata reporting and performance metrics.

What You Can Do

  • Extract text content from 10 different file formats
  • Parse and extract specific fields from JSON files using dot notation
  • Access detailed extraction metadata including processing time and character counts
  • Handle nested JSON structures with array support
  • Process tabular data from CSV, PSV, and TSV files
  • Extract content from Microsoft Office (DOCX) and OpenDocument (ODT) formats
  • Parse HTML and web content
  • Extract text from log files

Prerequisites

Before using this integration, ensure:

  • The Istari Digital Agent is installed and configured by your administrator
  • You have access to the Istari Digital Platform UI or the Istari Digital SDK
  • Your files are in one of the supported formats
  • For Linux systems, the agent has the required textract dependencies installed

API

Functions

FunctionDescriptionInputsOutputs
@istari:extractExtracts text content from supported document formatsDocument file (.csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt)Extracted text (TXT), Metadata report (JSON)
@istari:parse_json_fieldsExtracts specific fields from JSON files using dot notation for nested structuresJSON file, Field paths (parameters)Extracted fields (JSON), Metadata report (JSON)

Output Examples

Output NameTypeDescription
extracted_textTXT filePlain text content extracted from the input document
extracted_fieldsJSON fileExtracted field values from JSON file (for parse_json_fields function)
metadata_reportJSON fileDetailed extraction metadata including file info, processing metrics, character counts, and performance data

Usage

Method 1: Upload

Upload your documents directly to the Istari Digital Platform and run extraction jobs using either the Platform UI or the SDK.

Using the Istari Digital Platform UI

Follow these steps to extract text using the web interface:

  1. Navigate to the Files page.
    Click the Files option in the left-hand sidebar.

  2. Upload your file.
    Drag and drop your document (CSV, DOCX, HTML, JSON, LOG, ODT, PSV, TSV, or TXT) into the Upload Files area or click to browse and select your file.

  3. Open the model file.
    Once uploaded, click on the file in your files list to open its detail page.

  4. Navigate to the Artifacts tab.
    Click the Artifacts tab to view and manage artifacts associated with this file.

  5. Fill out the function execution form.
    In the Execute Function section, provide the following information:

    • Tool Name: textract
    • Version: 1.6.3
    • Operating System: Select the OS where your Istari Digital Agent is running (e.g., Windows 11, Ubuntu 22.04, RHEL 8)
    • Function: @istari:extract (for text extraction) or @istari:parse_json_fields (for JSON field parsing)
    • Agent: Select the appropriate agent from the dropdown
    • Parameters: (Only for @istari:parse_json_fields) Provide the field paths to extract (see JSON field extraction examples below)
  6. Run the function.
    Click the Run or Execute button to start the extraction job.

  7. Monitor job progress.
    The page will display the job status. Wait for it to complete (typically completes in seconds for most documents).

  8. View results.
    Once the job completes successfully, the extracted artifacts will appear in the Artifacts tab.

  9. Download or view artifacts.
    Click on extracted_text.txt or extracted_fields.json to view the extracted content in the browser, or download the metadata_report.json for detailed processing information.

Using the Istari Digital SDK

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here

Step 1: Upload and Extract the File(s)

Upload the file as a model

# Upload a document for text extraction (supports multiple file types)
model = client.add_model(
path="document.docx", # Can also use: .csv, .html, .json, .log, .odt, .psv, .tsv, .txt
description="Document for text extraction",
display_name="My Document",
)
print(f"Uploaded model with ID {model.id}")

Extract text once you have the model ID

# Extract text from the document
extraction_job = client.add_job(
model_id=model.id,
function="@istari:extract",
tool_name="textract",
tool_version="1.6.3",
operating_system="Windows 11", # Or: Ubuntu 22.04, RHEL 8, Windows 10, etc.
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")

Step 2: Check the Job Status

extraction_job.poll_job()

Step 3: Retrieve Results

Example

# Retrieve the model with updated artifacts
model = client.get_model(model.id)

for artifact in model.artifacts:
output_file_path = f"c:\\text_extracts\\{artifact.name}"

# Create directory if needed
Path(output_file_path).parent.mkdir(parents=True, exist_ok=True)

if artifact.extension in ["txt", "json"]:
with open(output_file_path, "w", encoding="utf-8") as f:
f.write(artifact.read_text())
print(f"Saved artifact: {output_file_path}")
else:
with open(output_file_path, "wb") as f:
f.write(artifact.read_bytes())
print(f"Saved binary artifact: {output_file_path}")

Notes on File Types

  • Document Formats (.docx, .odt): Extracts all text content including paragraphs, headers, and footers. Complex formatting may not be preserved.
  • Web Formats (.htm, .html): Extracts visible text content, removing HTML tags and scripts.
  • Tabular Formats (.csv, .psv, .tsv): Extracts all rows and columns as structured text.
  • Plain Text (.txt, .log): Extracts content as-is, preserving line breaks and formatting.
  • JSON (.json): For the @istari:extract function, converts JSON to readable text. Use @istari:parse_json_fields for structured field extraction.

Using the JSON Field Extraction Function

The @istari:parse_json_fields function allows you to extract specific fields from JSON files, including deeply nested values and array elements.

Function Overview

  • Function Name: @istari:parse_json_fields
  • Tool Name: textract
  • Supported Versions: 1.6.3
  • Supported OS: Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, RHEL 8

Example Usage with SDK

Step 1: Upload the JSON File

# Upload JSON file
model = client.add_model(
path="data.json",
description="JSON data file",
display_name="User Data",
)
print(f"Uploaded JSON file with ID {model.id}")

Step 2: Define Fields to Extract

# Define the fields you want to extract using backslash notation for nested fields
# Note: Use double backslashes (\\) for nested field paths
fields_to_extract = [
"name", # Top-level field
"age", # Top-level field
"user\\profile\\email", # Nested field using backslashes
"user\\profile\\contact\\phone", # Deeply nested field
"settings\\theme", # Nested field
"users\\name" # Extract field from array of objects
]

Step 3: Run the Extraction Job

# Run the parse_json_fields function
job = client.add_job(
model_id=model.id,
function="@istari:parse_json_fields",
tool_name="textract",
tool_version="1.6.3",
operating_system="Windows 11", # Or: Ubuntu 22.04, RHEL 8, etc.
parameters={"fields_to_extract": fields_to_extract}
)

print(f"JSON field extraction started, job ID: {job.id}")

Step 4: Check Job Status & Retrieve Results

# Wait for completion
job.poll_job()

# Retrieve the extracted fields
model = client.get_model(model.id)

for artifact in model.artifacts:
if artifact.name == "extracted_fields.json":
# Read and display extracted fields
import json
fields_data = json.loads(artifact.read_text())
print(f"Extracted fields: {json.dumps(fields_data, indent=2)}")
elif artifact.name == "metadata_report.json":
# Read metadata report
metadata = json.loads(artifact.read_text())
print(f"Extraction summary: {metadata['extraction_summary']}")

JSON Field Extraction Examples

Example 1: Simple Top-Level Fields

Input JSON:

{
"name": "John Doe",
"age": 30,
"email": "john@example.com"
}

Fields to Extract:

{
"parameters": {
"type": "parameter",
"value": ["name", "age"]
}
}

Output:

{
"name": "John Doe",
"age": 30
}

Example 2: Nested Fields with Dot Notation

Input JSON:

{
"user": {
"profile": {
"name": "John",
"contact": {
"email": "john@example.com",
"phone": "555-0123"
}
}
}
}

Fields to Extract:

{
"parameters": {
"type": "parameter",
"value": ["user\\profile\\name", "user\\profile\\contact\\email"]
}
}

Output:

{
"user\\profile\\name": "John",
"user\\profile\\contact\\email": "john@example.com"
}

Example 3: Array Field Extraction

Input JSON:

{
"users": [
{ "name": "John", "age": 30 },
{ "name": "Jane", "age": 25 }
]
}

Fields to Extract:

{
"parameters": {
"type": "parameter",
"value": ["users\\name", "users\\age"]
}
}

Output:

{
"users\\name": ["John", "Jane"],
"users\\age": [30, 25]
}

Installation

Prerequisites

  • Python 3.11 or higher (for development only)
  • Istari Digital Agent version 9.0.0 or higher
  • For Linux: System dependencies for textract library (installed automatically by agent)

Configuration

Module Version 1.0.0+: Zero Configuration Required! ✓

Starting with module version 1.0.0, the Open Text integration requires no manual configuration. The module includes all necessary dependencies and works out-of-the-box once installed by the Istari Digital Agent.

System Dependencies

The module automatically handles text extraction for all supported formats. On Linux systems, the agent installation script will install any required system libraries for the textract library.

Windows: No additional dependencies required.

Linux (Ubuntu/RHEL): Dependencies are automatically installed during agent setup.

License Configuration

No license is required for the Open Text integration. This module uses open-source libraries and is freely available as part of the Istari Digital Platform.

Versions

Current Module Version: 1.0.0

This is the initial release of the Open Text extraction module.

Compatibility Notes

  • Agent Version: Requires Istari Digital Agent version 9.0.0 or higher
  • Operating Systems: Windows 10, Windows 11, Windows Server 2019, Windows Server 2022, Ubuntu 22.04, RHEL 8
  • Python Version: 3.11+ (development dependency only; end users don't need Python)
  • Textract Version: 1.6.3

Changelog

Module Version 1.0.0

Release Date: December 2024

Initial Release Features:

  • Support for 10 file formats: CSV, DOCX, HTM, HTML, JSON, LOG, ODT, PSV, TSV, TXT
  • @istari:extract function for text extraction from documents
  • @istari:parse_json_fields function for structured JSON field extraction
  • Dot notation support for nested JSON field access
  • Array field extraction from JSON files
  • Detailed metadata reporting with performance metrics
  • Security-focused implementation using direct format parsers
  • Zero-configuration setup
  • Comprehensive error handling and logging

Release Notes

Key Changes Between Versions

Version 1.0.0 (Initial Release):

  • First public release of the Open Text extraction module
  • Provides text extraction capabilities for 10 file formats
  • Includes specialized JSON field parsing functionality
  • Supports Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, and RHEL 8
  • Requires Istari Digital Agent 9.0.0 or higher

Troubleshooting

Common Issues

Issue: Extraction Failed or Empty Output

  • Symptom: Error messages indicating extraction failure or empty extracted_text.txt file
  • Cause: Unsupported file format, corrupted file, or file doesn't contain extractable text
  • Solution:
    1. Verify your file is in one of the supported formats: .csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt
    2. Check that the file opens correctly in its native application
    3. Review the metadata_report.json for specific error details
    4. Try re-saving the file in its native application to repair potential corruption
    5. Ensure the file actually contains text content (not just images or embedded objects)

Issue: JSON Field Not Found

  • Symptom: extracted_fields.json missing expected fields or shows null values
  • Cause: Incorrect field path, field doesn't exist in JSON, or incorrect backslash notation syntax
  • Solution:
    1. Verify the JSON structure by opening the file in a JSON viewer
    2. Double-check field names are spelled correctly (JSON is case-sensitive)
    3. Ensure backslash notation path is correct for nested fields (e.g., user\\profile\\email)
    4. Review the metadata_report.json for the list of missing fields
    5. Test with a simpler field path first (top-level field) before trying nested paths

Issue: Module Not Found or Won't Execute

  • Symptom: Error messages indicating the module cannot be found or executed
  • Cause: Module not installed correctly, wrong agent version, or missing dependencies
  • Solution:
    1. Verify Istari Digital Agent version is 9.0.0 or higher
    2. Check that the Open Text module is installed in the correct agent modules directory
    3. Restart the Istari Digital Agent service
    4. Review agent logs for specific error messages
    5. On Linux, verify textract dependencies are installed: python3 -m pip list | grep textract
    6. Contact your administrator to verify the module installation

Issue: Slow Extraction Performance

  • Symptom: Extraction takes longer than expected
  • Cause: Large file size, complex document structure, or system resource constraints
  • Solution:
    1. Check the file size - very large files (>100MB) may take longer to process
    2. Review the metadata_report.json for processing time metrics
    3. For DOCX files, complex formatting can slow extraction - try saving as plain text first
    4. Ensure the agent machine has adequate CPU and memory resources
    5. Consider splitting very large files into smaller chunks

Issue: Encoding or Special Character Problems

  • Symptom: Extracted text contains garbled characters or question marks
  • Cause: File uses non-UTF-8 encoding or contains special characters
  • Solution:
    1. For text files, try re-saving with UTF-8 encoding in your text editor
    2. Check the metadata_report.json for encoding information
    3. If processing international characters, verify the source file uses UTF-8 encoding
    4. Some special characters may not be preserved during extraction

Getting Help

If you continue to experience issues:

  1. Check the module log files for detailed error messages
  2. Review the metadata_report.json artifact for extraction details
  3. Review the Istari Digital Agent logs for additional context
  4. Consult the main troubleshooting guide for general agent issues
  5. Contact Istari Digital support with:
    • Module version (1.0.0)
    • Agent version
    • Operating system
    • File format and size
    • Error messages from logs and metadata report
    • Steps to reproduce the issue

Tips and Best Practices

Optimal Usage Patterns

  • File Format Selection: Use the simplest format that meets your needs (e.g., TXT for plain text rather than DOCX)
  • Batch Processing: Process multiple files by creating separate upload jobs for efficient workflow automation
  • JSON Field Planning: Review your JSON structure before defining field paths to ensure accuracy
  • Test Extractions: Run a test extraction on a sample file before processing large batches

Performance Considerations

  • File Size: Smaller files (< 10MB) extract fastest. Consider splitting large files for better performance
  • Format Complexity: Plain text and CSV files extract faster than complex DOCX or ODT files
  • Concurrent Jobs: The agent can handle multiple extraction jobs in parallel for improved throughput
  • Network Transfer: Upload time depends on file size and network speed; extraction itself is typically fast

Data Quality Best Practices

  • Source Quality: Higher quality source documents produce better text extraction results
  • Format Consistency: Use consistent file formats across your document sets for predictable results
  • Validation: Always review the metadata_report.json to confirm successful extraction
  • Character Counts: Use the character count in metadata to verify complete extraction

JSON Field Extraction Best Practices

  • Field Path Validation: Test field paths on a sample JSON first before processing large datasets
  • Error Handling: Review the metadata report for missing fields to identify extraction issues
  • Nested Structures: Use backslash notation (\\) for deeply nested structures - verify each level exists
  • Array Processing: When extracting from arrays, expect output as an array of values

Security Considerations

  • Direct Parsers: This module uses direct format parsers for enhanced security (no external service calls)
  • Local Processing: All extraction happens locally on the agent machine - no data leaves your network
  • File Validation: The module validates file formats before processing to prevent malicious file handling
  • Access Control: Use Istari Digital Platform access controls to limit who can extract sensitive documents

FAQ

  • What happens if my file format isn't in the supported list?

    • The extraction will fail with an error message. Convert your file to a supported format first (e.g., save as TXT or DOCX).
  • Does this preserve formatting from Word documents?

    • No, the module extracts plain text content only. Formatting, images, and layout are not preserved.
  • Can I extract text from PDFs?

  • How do I know if extraction was successful?

    • Check the metadata_report.json artifact which contains a success field and detailed extraction summary including character counts and processing time.
  • Why is my JSON field extraction returning null?

    • The field path may be incorrect, the field doesn't exist in your JSON, or there's a typo in the field name. Remember to use backslashes (\\) for nested fields, not dots. JSON is case-sensitive - verify the exact field name in your source file.
  • Can I extract multiple fields at once from JSON?

    • Yes! You can provide an array of field paths in the parameters.value array. All fields will be extracted in a single operation.
  • What's the difference between @istari:extract and @istari:parse_json_fields for JSON files?

    • @istari:extract converts the entire JSON to readable text format. @istari:parse_json_fields extracts specific fields in structured JSON format, which is better for data processing workflows.
  • Does the module support CSV files with custom delimiters?

    • Yes! Use .csv for comma-separated, .tsv for tab-separated, and .psv for pipe-separated files. The module automatically detects the delimiter.
  • Are there file size limits?

    • There's no hard limit, but very large files (>100MB) may take longer to process. Processing time scales with file size and complexity.
  • Can I use this module offline?

    • Yes, the module runs entirely locally on the agent machine and requires no internet connection for extraction.