Open Text Extraction

Summary

The Open Text extraction integration provides secure text extraction from multiple document and data formats using direct format parsers. Extract text content from 10 file formats including documents, web files, and tabular data.

Supported File Types: .csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt

Connection Methods: Upload

Example Files

Test the Open Text extraction with these sample files:

How and Where to Use

You can use the Open Text extraction through either the Istari Digital Platform UI or the Istari Digital SDK. Both methods allow you to extract text content from supported document formats with detailed metadata reporting and performance metrics.

What You Can Do

Extract text content from 10 different file formats
Parse and extract specific fields from JSON files using dot notation
Access detailed extraction metadata including processing time and character counts
Handle nested JSON structures with array support
Process tabular data from CSV, PSV, and TSV files
Extract content from Microsoft Office (DOCX) and OpenDocument (ODT) formats
Parse HTML and web content
Extract text from log files

Prerequisites

Before using this integration, ensure:

The Istari Digital Agent is installed and configured by your administrator
You have access to the Istari Digital Platform UI or the Istari Digital SDK
Your files are in one of the supported formats
For Linux systems, the agent has the required textract dependencies installed

API

Functions

Function	Description	Inputs	Outputs
`@istari:extract`	Extracts text content from supported document formats	Document file (`.csv`, `.docx`, `.htm`, `.html`, `.json`, `.log`, `.odt`, `.psv`, `.tsv`, `.txt`)	Extracted text (TXT), Metadata report (JSON)
`@istari:parse_json_fields`	Extracts specific fields from JSON files using dot notation for nested structures	JSON file, Field paths (parameters)	Extracted fields (JSON), Metadata report (JSON)

Output Examples

Output Name	Type	Description
`extracted_text`	TXT file	Plain text content extracted from the input document
`extracted_fields`	JSON file	Extracted field values from JSON file (for `parse_json_fields` function)
`metadata_report`	JSON file	Detailed extraction metadata including file info, processing metrics, character counts, and performance data

Usage

Method 1: Upload

Upload your documents directly to the Istari Digital Platform and run extraction jobs using either the Platform UI or the SDK.

Using the Istari Digital Platform UI

Follow these steps to extract text using the web interface:

Navigate to the Files page.
Click the Files option in the left-hand sidebar.
Upload your file.
Drag and drop your document (CSV, DOCX, HTML, JSON, LOG, ODT, PSV, TSV, or TXT) into the Upload Files area or click to browse and select your file.
Open the model file.
Once uploaded, click on the file in your files list to open its detail page.
Navigate to the Artifacts tab.
Click the Artifacts tab to view and manage artifacts associated with this file.
Fill out the function execution form.
In the Execute Function section, provide the following information:
- Tool Name: textract
- Version: 1.6.3
- Operating System: Select the OS where your Istari Digital Agent is running (e.g., Windows 11, Ubuntu 22.04, RHEL 8)
- Function: @istari:extract (for text extraction) or @istari:parse_json_fields (for JSON field parsing)
- Agent: Select the appropriate agent from the dropdown
- Parameters: (Only for @istari:parse_json_fields) Provide the field paths to extract (see JSON field extraction examples below)
Run the function.
Click the Run or Execute button to start the extraction job.
Monitor job progress.
The page will display the job status. Wait for it to complete (typically completes in seconds for most documents).
View results.
Once the job completes successfully, the extracted artifacts will appear in the Artifacts tab.
Download or view artifacts.
Click on extracted_text.txt or extracted_fields.json to view the extracted content in the browser, or download the metadata_report.json for detailed processing information.

Using the Istari Digital SDK

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here

Step 1: Upload and Extract the File(s)

Upload the file as a model

# Upload a document for text extraction (supports multiple file types)
model = client.add_model(
    path="document.docx",  # Can also use: .csv, .html, .json, .log, .odt, .psv, .tsv, .txt
    description="Document for text extraction",
    display_name="My Document",
)
print(f"Uploaded model with ID {model.id}")

Extract text once you have the model ID

# Extract text from the document
extraction_job = client.add_job(
    model_id=model.id,
    function="@istari:extract",
    tool_name="textract",
    tool_version="1.6.3",
    operating_system="Windows 11",  # Or: Ubuntu 22.04, RHEL 8, Windows 10, etc.
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")

Step 2: Check the Job Status

extraction_job.poll_job()

Step 3: Retrieve Results

Example

# Retrieve the model with updated artifacts
model = client.get_model(model.id)

for artifact in model.artifacts:
    output_file_path = f"c:\\text_extracts\\{artifact.name}"

    # Create directory if needed
    Path(output_file_path).parent.mkdir(parents=True, exist_ok=True)

    if artifact.extension in ["txt", "json"]:
        with open(output_file_path, "w", encoding="utf-8") as f:
            f.write(artifact.read_text())
        print(f"Saved artifact: {output_file_path}")
    else:
        with open(output_file_path, "wb") as f:
            f.write(artifact.read_bytes())
        print(f"Saved binary artifact: {output_file_path}")

Notes on File Types

Document Formats (.docx, .odt): Extracts all text content including paragraphs, headers, and footers. Complex formatting may not be preserved.
Web Formats (.htm, .html): Extracts visible text content, removing HTML tags and scripts.
Tabular Formats (.csv, .psv, .tsv): Extracts all rows and columns as structured text.
Plain Text (.txt, .log): Extracts content as-is, preserving line breaks and formatting.
JSON (.json): For the @istari:extract function, converts JSON to readable text. Use @istari:parse_json_fields for structured field extraction.

Using the JSON Field Extraction Function

The @istari:parse_json_fields function allows you to extract specific fields from JSON files, including deeply nested values and array elements.

Function Overview

Function Name: @istari:parse_json_fields
Tool Name: textract
Supported Versions: 1.6.3
Supported OS: Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, RHEL 8

Example Usage with SDK

Step 1: Upload the JSON File

# Upload JSON file
model = client.add_model(
    path="data.json",
    description="JSON data file",
    display_name="User Data",
)
print(f"Uploaded JSON file with ID {model.id}")

Step 2: Define Fields to Extract

# Define the fields you want to extract using backslash notation for nested fields
# Note: Use double backslashes (\\) for nested field paths
fields_to_extract = [
    "name",                          # Top-level field
    "age",                           # Top-level field
    "user\\profile\\email",          # Nested field using backslashes
    "user\\profile\\contact\\phone", # Deeply nested field
    "settings\\theme",               # Nested field
    "users\\name"                    # Extract field from array of objects
]

Step 3: Run the Extraction Job

# Run the parse_json_fields function
job = client.add_job(
    model_id=model.id,
    function="@istari:parse_json_fields",
    tool_name="textract",
    tool_version="1.6.3",
    operating_system="Windows 11",  # Or: Ubuntu 22.04, RHEL 8, etc.
    parameters={"fields_to_extract": fields_to_extract}
)

print(f"JSON field extraction started, job ID: {job.id}")

Step 4: Check Job Status & Retrieve Results

# Wait for completion
job.poll_job()

# Retrieve the extracted fields
model = client.get_model(model.id)

for artifact in model.artifacts:
    if artifact.name == "extracted_fields.json":
        # Read and display extracted fields
        import json
        fields_data = json.loads(artifact.read_text())
        print(f"Extracted fields: {json.dumps(fields_data, indent=2)}")
    elif artifact.name == "metadata_report.json":
        # Read metadata report
        metadata = json.loads(artifact.read_text())
        print(f"Extraction summary: {metadata['extraction_summary']}")

JSON Field Extraction Examples

Example 1: Simple Top-Level Fields

Input JSON:

{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com"
}

Fields to Extract:

{
  "parameters": {
    "type": "parameter",
    "value": ["name", "age"]
  }
}

Output:

{
  "name": "John Doe",
  "age": 30
}

Example 2: Nested Fields with Dot Notation

Input JSON:

{
  "user": {
    "profile": {
      "name": "John",
      "contact": {
        "email": "john@example.com",
        "phone": "555-0123"
      }
    }
  }
}

Fields to Extract:

{
  "parameters": {
    "type": "parameter",
    "value": ["user\\profile\\name", "user\\profile\\contact\\email"]
  }
}

Output:

{
  "user\\profile\\name": "John",
  "user\\profile\\contact\\email": "john@example.com"
}

Example 3: Array Field Extraction

Input JSON:

{
  "users": [
    { "name": "John", "age": 30 },
    { "name": "Jane", "age": 25 }
  ]
}

Fields to Extract:

{
  "parameters": {
    "type": "parameter",
    "value": ["users\\name", "users\\age"]
  }
}

Output:

{
  "users\\name": ["John", "Jane"],
  "users\\age": [30, 25]
}

Installation

Prerequisites

Python 3.11 or higher (for development only)
Istari Digital Agent version 9.0.0 or higher
For Linux: System dependencies for textract library (installed automatically by agent)

Configuration

Module Version 1.0.0+: Zero Configuration Required! ✓

Starting with module version 1.0.0, the Open Text integration requires no manual configuration. The module includes all necessary dependencies and works out-of-the-box once installed by the Istari Digital Agent.

System Dependencies

The module automatically handles text extraction for all supported formats. On Linux systems, the agent installation script will install any required system libraries for the textract library.

Windows: No additional dependencies required.

Linux (Ubuntu/RHEL): Dependencies are automatically installed during agent setup.

License Configuration

No license is required for the Open Text integration. This module uses open-source libraries and is freely available as part of the Istari Digital Platform.

Versions

Current Module Version: 1.0.0

This is the initial release of the Open Text extraction module.

Compatibility Notes

Agent Version: Requires Istari Digital Agent version 9.0.0 or higher
Operating Systems: Windows 10, Windows 11, Windows Server 2019, Windows Server 2022, Ubuntu 22.04, RHEL 8
Python Version: 3.11+ (development dependency only; end users don't need Python)
Textract Version: 1.6.3

Changelog

Module Version 1.0.0

Release Date: December 2024

Initial Release Features:

Support for 10 file formats: CSV, DOCX, HTM, HTML, JSON, LOG, ODT, PSV, TSV, TXT
@istari:extract function for text extraction from documents
@istari:parse_json_fields function for structured JSON field extraction
Dot notation support for nested JSON field access
Array field extraction from JSON files
Detailed metadata reporting with performance metrics
Security-focused implementation using direct format parsers
Zero-configuration setup
Comprehensive error handling and logging

Release Notes

Key Changes Between Versions

Version 1.0.0 (Initial Release):

First public release of the Open Text extraction module
Provides text extraction capabilities for 10 file formats
Includes specialized JSON field parsing functionality
Supports Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, and RHEL 8
Requires Istari Digital Agent 9.0.0 or higher

Troubleshooting

Common Issues

Issue: Extraction Failed or Empty Output

Symptom: Error messages indicating extraction failure or empty extracted_text.txt file
Cause: Unsupported file format, corrupted file, or file doesn't contain extractable text
Solution:
1. Verify your file is in one of the supported formats: .csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt
2. Check that the file opens correctly in its native application
3. Review the metadata_report.json for specific error details
4. Try re-saving the file in its native application to repair potential corruption
5. Ensure the file actually contains text content (not just images or embedded objects)

Issue: JSON Field Not Found

Symptom: extracted_fields.json missing expected fields or shows null values
Cause: Incorrect field path, field doesn't exist in JSON, or incorrect backslash notation syntax
Solution:
1. Verify the JSON structure by opening the file in a JSON viewer
2. Double-check field names are spelled correctly (JSON is case-sensitive)
3. Ensure backslash notation path is correct for nested fields (e.g., user\\profile\\email)
4. Review the metadata_report.json for the list of missing fields
5. Test with a simpler field path first (top-level field) before trying nested paths

Issue: Module Not Found or Won't Execute

Symptom: Error messages indicating the module cannot be found or executed
Cause: Module not installed correctly, wrong agent version, or missing dependencies
Solution:
1. Verify Istari Digital Agent version is 9.0.0 or higher
2. Check that the Open Text module is installed in the correct agent modules directory
3. Restart the Istari Digital Agent service
4. Review agent logs for specific error messages
5. On Linux, verify textract dependencies are installed: python3 -m pip list | grep textract
6. Contact your administrator to verify the module installation

Issue: Slow Extraction Performance

Symptom: Extraction takes longer than expected
Cause: Large file size, complex document structure, or system resource constraints
Solution:
1. Check the file size - very large files (>100MB) may take longer to process
2. Review the metadata_report.json for processing time metrics
3. For DOCX files, complex formatting can slow extraction - try saving as plain text first
4. Ensure the agent machine has adequate CPU and memory resources
5. Consider splitting very large files into smaller chunks

Issue: Encoding or Special Character Problems

Symptom: Extracted text contains garbled characters or question marks
Cause: File uses non-UTF-8 encoding or contains special characters
Solution:
1. For text files, try re-saving with UTF-8 encoding in your text editor
2. Check the metadata_report.json for encoding information
3. If processing international characters, verify the source file uses UTF-8 encoding
4. Some special characters may not be preserved during extraction

Getting Help

If you continue to experience issues:

Check the module log files for detailed error messages
Review the metadata_report.json artifact for extraction details
Review the Istari Digital Agent logs for additional context
Consult the main troubleshooting guide for general agent issues
Contact Istari Digital support with:
- Module version (1.0.0)
- Agent version
- Operating system
- File format and size
- Error messages from logs and metadata report
- Steps to reproduce the issue

Tips and Best Practices

Optimal Usage Patterns

File Format Selection: Use the simplest format that meets your needs (e.g., TXT for plain text rather than DOCX)
Batch Processing: Process multiple files by creating separate upload jobs for efficient workflow automation
JSON Field Planning: Review your JSON structure before defining field paths to ensure accuracy
Test Extractions: Run a test extraction on a sample file before processing large batches

Performance Considerations

File Size: Smaller files (< 10MB) extract fastest. Consider splitting large files for better performance
Format Complexity: Plain text and CSV files extract faster than complex DOCX or ODT files
Concurrent Jobs: The agent can handle multiple extraction jobs in parallel for improved throughput
Network Transfer: Upload time depends on file size and network speed; extraction itself is typically fast

Data Quality Best Practices

Source Quality: Higher quality source documents produce better text extraction results
Format Consistency: Use consistent file formats across your document sets for predictable results
Validation: Always review the metadata_report.json to confirm successful extraction
Character Counts: Use the character count in metadata to verify complete extraction

JSON Field Extraction Best Practices

Field Path Validation: Test field paths on a sample JSON first before processing large datasets
Error Handling: Review the metadata report for missing fields to identify extraction issues
Nested Structures: Use backslash notation (\\) for deeply nested structures - verify each level exists
Array Processing: When extracting from arrays, expect output as an array of values

Security Considerations

Direct Parsers: This module uses direct format parsers for enhanced security (no external service calls)
Local Processing: All extraction happens locally on the agent machine - no data leaves your network
File Validation: The module validates file formats before processing to prevent malicious file handling
Access Control: Use Istari Digital Platform access controls to limit who can extract sensitive documents

FAQ

What happens if my file format isn't in the supported list?
- The extraction will fail with an error message. Convert your file to a supported format first (e.g., save as TXT or DOCX).
Does this preserve formatting from Word documents?
- No, the module extracts plain text content only. Formatting, images, and layout are not preserved.
Can I extract text from PDFs?
- PDF extraction is handled by a separate module. See the PDF extraction documentation for PDF support.
How do I know if extraction was successful?
- Check the metadata_report.json artifact which contains a success field and detailed extraction summary including character counts and processing time.
Why is my JSON field extraction returning null?
- The field path may be incorrect, the field doesn't exist in your JSON, or there's a typo in the field name. Remember to use backslashes (\\) for nested fields, not dots. JSON is case-sensitive - verify the exact field name in your source file.
Can I extract multiple fields at once from JSON?
- Yes! You can provide an array of field paths in the parameters.value array. All fields will be extracted in a single operation.
What's the difference between @istari:extract and @istari:parse_json_fields for JSON files?
- @istari:extract converts the entire JSON to readable text format. @istari:parse_json_fields extracts specific fields in structured JSON format, which is better for data processing workflows.
Does the module support CSV files with custom delimiters?
- Yes! Use .csv for comma-separated, .tsv for tab-separated, and .psv for pipe-separated files. The module automatically detects the delimiter.
Are there file size limits?
- There's no hard limit, but very large files (>100MB) may take longer to process. Processing time scales with file size and complexity.
Can I use this module offline?
- Yes, the module runs entirely locally on the agent machine and requires no internet connection for extraction.

Summary​

Example Files​

How and Where to Use​

What You Can Do​

Prerequisites​

API​

Functions​

Output Examples​

Usage​

Method 1: Upload​

Using the Istari Digital Platform UI​

Using the Istari Digital SDK​

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here​

Step 1: Upload and Extract the File(s)​

Upload the file as a model​

Extract text once you have the model ID​

Step 2: Check the Job Status​

Step 3: Retrieve Results​

Example​

Notes on File Types​

Using the JSON Field Extraction Function​

Function Overview​

Example Usage with SDK​

Step 1: Upload the JSON File​

Step 2: Define Fields to Extract​

Step 3: Run the Extraction Job​

Step 4: Check Job Status & Retrieve Results​

JSON Field Extraction Examples​

Example 1: Simple Top-Level Fields​

Example 2: Nested Fields with Dot Notation​

Example 3: Array Field Extraction​

Installation​

Prerequisites​

Configuration​

System Dependencies​

License Configuration​

Versions​

Current Module Version: 1.0.0​

Compatibility Notes​

Changelog​

Module Version 1.0.0​

Release Notes​

Key Changes Between Versions​

Troubleshooting​

Common Issues​

Getting Help​

Tips and Best Practices​

Optimal Usage Patterns​

Performance Considerations​

Data Quality Best Practices​

JSON Field Extraction Best Practices​

Security Considerations​

FAQ​

Summary

Example Files

How and Where to Use

What You Can Do

Prerequisites

API

Functions

Output Examples

Usage

Method 1: Upload

Using the Istari Digital Platform UI

Using the Istari Digital SDK

Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here

Step 1: Upload and Extract the File(s)

Upload the file as a model

Extract text once you have the model ID

Step 2: Check the Job Status

Step 3: Retrieve Results

Example

Notes on File Types

Using the JSON Field Extraction Function

Function Overview

Example Usage with SDK

Step 1: Upload the JSON File

Step 2: Define Fields to Extract

Step 3: Run the Extraction Job

Step 4: Check Job Status & Retrieve Results

JSON Field Extraction Examples

Example 1: Simple Top-Level Fields

Example 2: Nested Fields with Dot Notation

Example 3: Array Field Extraction

Installation

Prerequisites

Configuration

System Dependencies

License Configuration

Versions

Current Module Version: 1.0.0

Compatibility Notes

Changelog

Module Version 1.0.0

Release Notes

Key Changes Between Versions

Troubleshooting

Common Issues

Getting Help

Tips and Best Practices

Optimal Usage Patterns

Performance Considerations

Data Quality Best Practices

JSON Field Extraction Best Practices

Security Considerations

FAQ