Open Text Extraction
Summary
The Open Text extraction integration provides secure text extraction from multiple document and data formats using direct format parsers. Extract text content from 10 file formats including documents, web files, and tabular data.
Supported File Types: .csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt
Connection Methods: Upload
Example Files
Test the Open Text extraction with these sample files:
- Download Example CSV: sample.csv
- Download Example DOCX: sample.docx
- Download Example HTM: sample.htm
- Download Example HTML: sample.html
- Download Example JSON: sample.json
- Download Example LOG: sample.log
- Download Example ODT: sample.odt
- Download Example PSV: sample.psv
- Download Example TSV: sample.tsv
- Download Example TXT: sample.txt
How and Where to Use
You can use the Open Text extraction through either the Istari Digital Platform UI or the Istari Digital SDK. Both methods allow you to extract text content from supported document formats with detailed metadata reporting and performance metrics.
What You Can Do
- Extract text content from 10 different file formats
- Parse and extract specific fields from JSON files using dot notation
- Access detailed extraction metadata including processing time and character counts
- Handle nested JSON structures with array support
- Process tabular data from CSV, PSV, and TSV files
- Extract content from Microsoft Office (DOCX) and OpenDocument (ODT) formats
- Parse HTML and web content
- Extract text from log files
Prerequisites
Before using this integration, ensure:
- The Istari Digital Agent is installed and configured by your administrator
- You have access to the Istari Digital Platform UI or the Istari Digital SDK
- Your files are in one of the supported formats
- For Linux systems, the agent has the required textract dependencies installed
API
Functions
| Function | Description | Inputs | Outputs |
|---|---|---|---|
@istari:extract | Extracts text content from supported document formats | Document file (.csv, .docx, .htm, .html, .json, .log, .odt, .psv, .tsv, .txt) | Extracted text (TXT), Metadata report (JSON) |
@istari:parse_json_fields | Extracts specific fields from JSON files using dot notation for nested structures | JSON file, Field paths (parameters) | Extracted fields (JSON), Metadata report (JSON) |
Output Examples
| Output Name | Type | Description |
|---|---|---|
extracted_text | TXT file | Plain text content extracted from the input document |
extracted_fields | JSON file | Extracted field values from JSON file (for parse_json_fields function) |
metadata_report | JSON file | Detailed extraction metadata including file info, processing metrics, character counts, and performance data |
Usage
Method 1: Upload
Upload your documents directly to the Istari Digital Platform and run extraction jobs using either the Platform UI or the SDK.
Using the Istari Digital Platform UI
Follow these steps to extract text using the web interface:
-
Navigate to the Files page.
Click the Files option in the left-hand sidebar. -
Upload your file.
Drag and drop your document (CSV, DOCX, HTML, JSON, LOG, ODT, PSV, TSV, or TXT) into the Upload Files area or click to browse and select your file. -
Open the model file.
Once uploaded, click on the file in your files list to open its detail page. -
Navigate to the Artifacts tab.
Click the Artifacts tab to view and manage artifacts associated with this file. -
Fill out the function execution form.
In the Execute Function section, provide the following information:- Tool Name:
textract - Version:
1.6.3 - Operating System: Select the OS where your Istari Digital Agent is running (e.g.,
Windows 11,Ubuntu 22.04,RHEL 8) - Function:
@istari:extract(for text extraction) or@istari:parse_json_fields(for JSON field parsing) - Agent: Select the appropriate agent from the dropdown
- Parameters: (Only for
@istari:parse_json_fields) Provide the field paths to extract (see JSON field extraction examples below)
- Tool Name:
-
Run the function.
Click the Run or Execute button to start the extraction job. -
Monitor job progress.
The page will display the job status. Wait for it to complete (typically completes in seconds for most documents). -
View results.
Once the job completes successfully, the extracted artifacts will appear in the Artifacts tab. -
Download or view artifacts.
Click onextracted_text.txtorextracted_fields.jsonto view the extracted content in the browser, or download themetadata_report.jsonfor detailed processing information.
Using the Istari Digital SDK
Prerequisite: Install Istari Digital SDK and initialize Istari Digital Client per instructions here
Step 1: Upload and Extract the File(s)
Upload the file as a model
# Upload a document for text extraction (supports multiple file types)
model = client.add_model(
path="document.docx", # Can also use: .csv, .html, .json, .log, .odt, .psv, .tsv, .txt
description="Document for text extraction",
display_name="My Document",
)
print(f"Uploaded model with ID {model.id}")
Extract text once you have the model ID
# Extract text from the document
extraction_job = client.add_job(
model_id=model.id,
function="@istari:extract",
tool_name="textract",
tool_version="1.6.3",
operating_system="Windows 11", # Or: Ubuntu 22.04, RHEL 8, Windows 10, etc.
)
print(f"Extraction started for model ID {model.id}, job ID: {extraction_job.id}")
Step 2: Check the Job Status
extraction_job.poll_job()
Step 3: Retrieve Results
Example
# Retrieve the model with updated artifacts
model = client.get_model(model.id)
for artifact in model.artifacts:
output_file_path = f"c:\\text_extracts\\{artifact.name}"
# Create directory if needed
Path(output_file_path).parent.mkdir(parents=True, exist_ok=True)
if artifact.extension in ["txt", "json"]:
with open(output_file_path, "w", encoding="utf-8") as f:
f.write(artifact.read_text())
print(f"Saved artifact: {output_file_path}")
else:
with open(output_file_path, "wb") as f:
f.write(artifact.read_bytes())
print(f"Saved binary artifact: {output_file_path}")
Notes on File Types
- Document Formats (
.docx,.odt): Extracts all text content including paragraphs, headers, and footers. Complex formatting may not be preserved. - Web Formats (
.htm,.html): Extracts visible text content, removing HTML tags and scripts. - Tabular Formats (
.csv,.psv,.tsv): Extracts all rows and columns as structured text. - Plain Text (
.txt,.log): Extracts content as-is, preserving line breaks and formatting. - JSON (
.json): For the@istari:extractfunction, converts JSON to readable text. Use@istari:parse_json_fieldsfor structured field extraction.
Using the JSON Field Extraction Function
The @istari:parse_json_fields function allows you to extract specific fields from JSON files, including deeply nested values and array elements.
Function Overview
- Function Name:
@istari:parse_json_fields - Tool Name:
textract - Supported Versions:
1.6.3 - Supported OS: Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, RHEL 8
Example Usage with SDK
Step 1: Upload the JSON File
# Upload JSON file
model = client.add_model(
path="data.json",
description="JSON data file",
display_name="User Data",
)
print(f"Uploaded JSON file with ID {model.id}")
Step 2: Define Fields to Extract
# Define the fields you want to extract using backslash notation for nested fields
# Note: Use double backslashes (\\) for nested field paths
fields_to_extract = [
"name", # Top-level field
"age", # Top-level field
"user\\profile\\email", # Nested field using backslashes
"user\\profile\\contact\\phone", # Deeply nested field
"settings\\theme", # Nested field
"users\\name" # Extract field from array of objects
]
Step 3: Run the Extraction Job
# Run the parse_json_fields function
job = client.add_job(
model_id=model.id,
function="@istari:parse_json_fields",
tool_name="textract",
tool_version="1.6.3",
operating_system="Windows 11", # Or: Ubuntu 22.04, RHEL 8, etc.
parameters={"fields_to_extract": fields_to_extract}
)
print(f"JSON field extraction started, job ID: {job.id}")
Step 4: Check Job Status & Retrieve Results
# Wait for completion
job.poll_job()
# Retrieve the extracted fields
model = client.get_model(model.id)
for artifact in model.artifacts:
if artifact.name == "extracted_fields.json":
# Read and display extracted fields
import json
fields_data = json.loads(artifact.read_text())
print(f"Extracted fields: {json.dumps(fields_data, indent=2)}")
elif artifact.name == "metadata_report.json":
# Read metadata report
metadata = json.loads(artifact.read_text())
print(f"Extraction summary: {metadata['extraction_summary']}")
JSON Field Extraction Examples
Example 1: Simple Top-Level Fields
Input JSON:
{
"name": "John Doe",
"age": 30,
"email": "john@example.com"
}
Fields to Extract:
{
"parameters": {
"type": "parameter",
"value": ["name", "age"]
}
}
Output:
{
"name": "John Doe",
"age": 30
}
Example 2: Nested Fields with Dot Notation
Input JSON:
{
"user": {
"profile": {
"name": "John",
"contact": {
"email": "john@example.com",
"phone": "555-0123"
}
}
}
}
Fields to Extract:
{
"parameters": {
"type": "parameter",
"value": ["user\\profile\\name", "user\\profile\\contact\\email"]
}
}
Output:
{
"user\\profile\\name": "John",
"user\\profile\\contact\\email": "john@example.com"
}
Example 3: Array Field Extraction
Input JSON:
{
"users": [
{ "name": "John", "age": 30 },
{ "name": "Jane", "age": 25 }
]
}
Fields to Extract:
{
"parameters": {
"type": "parameter",
"value": ["users\\name", "users\\age"]
}
}
Output:
{
"users\\name": ["John", "Jane"],
"users\\age": [30, 25]
}
Installation
Prerequisites
- Python 3.11 or higher (for development only)
- Istari Digital Agent version 9.0.0 or higher
- For Linux: System dependencies for textract library (installed automatically by agent)
Configuration
Module Version 1.0.0+: Zero Configuration Required! ✓
Starting with module version 1.0.0, the Open Text integration requires no manual configuration. The module includes all necessary dependencies and works out-of-the-box once installed by the Istari Digital Agent.
System Dependencies
The module automatically handles text extraction for all supported formats. On Linux systems, the agent installation script will install any required system libraries for the textract library.
Windows: No additional dependencies required.
Linux (Ubuntu/RHEL): Dependencies are automatically installed during agent setup.
License Configuration
No license is required for the Open Text integration. This module uses open-source libraries and is freely available as part of the Istari Digital Platform.
Versions
Current Module Version: 1.0.0
This is the initial release of the Open Text extraction module.
Compatibility Notes
- Agent Version: Requires Istari Digital Agent version 9.0.0 or higher
- Operating Systems: Windows 10, Windows 11, Windows Server 2019, Windows Server 2022, Ubuntu 22.04, RHEL 8
- Python Version: 3.11+ (development dependency only; end users don't need Python)
- Textract Version: 1.6.3
Changelog
Module Version 1.0.0
Release Date: December 2024
Initial Release Features:
- Support for 10 file formats: CSV, DOCX, HTM, HTML, JSON, LOG, ODT, PSV, TSV, TXT
@istari:extractfunction for text extraction from documents@istari:parse_json_fieldsfunction for structured JSON field extraction- Dot notation support for nested JSON field access
- Array field extraction from JSON files
- Detailed metadata reporting with performance metrics
- Security-focused implementation using direct format parsers
- Zero-configuration setup
- Comprehensive error handling and logging
Release Notes
Key Changes Between Versions
Version 1.0.0 (Initial Release):
- First public release of the Open Text extraction module
- Provides text extraction capabilities for 10 file formats
- Includes specialized JSON field parsing functionality
- Supports Windows 10/11, Windows Server 2019/2022, Ubuntu 22.04, and RHEL 8
- Requires Istari Digital Agent 9.0.0 or higher
Troubleshooting
Common Issues
Issue: Extraction Failed or Empty Output
- Symptom: Error messages indicating extraction failure or empty
extracted_text.txtfile - Cause: Unsupported file format, corrupted file, or file doesn't contain extractable text
- Solution:
- Verify your file is in one of the supported formats:
.csv,.docx,.htm,.html,.json,.log,.odt,.psv,.tsv,.txt - Check that the file opens correctly in its native application
- Review the
metadata_report.jsonfor specific error details - Try re-saving the file in its native application to repair potential corruption
- Ensure the file actually contains text content (not just images or embedded objects)
- Verify your file is in one of the supported formats:
Issue: JSON Field Not Found
- Symptom:
extracted_fields.jsonmissing expected fields or shows null values - Cause: Incorrect field path, field doesn't exist in JSON, or incorrect backslash notation syntax
- Solution:
- Verify the JSON structure by opening the file in a JSON viewer
- Double-check field names are spelled correctly (JSON is case-sensitive)
- Ensure backslash notation path is correct for nested fields (e.g.,
user\\profile\\email) - Review the
metadata_report.jsonfor the list of missing fields - Test with a simpler field path first (top-level field) before trying nested paths
Issue: Module Not Found or Won't Execute
- Symptom: Error messages indicating the module cannot be found or executed
- Cause: Module not installed correctly, wrong agent version, or missing dependencies
- Solution:
- Verify Istari Digital Agent version is 9.0.0 or higher
- Check that the Open Text module is installed in the correct agent modules directory
- Restart the Istari Digital Agent service
- Review agent logs for specific error messages
- On Linux, verify textract dependencies are installed:
python3 -m pip list | grep textract - Contact your administrator to verify the module installation
Issue: Slow Extraction Performance
- Symptom: Extraction takes longer than expected
- Cause: Large file size, complex document structure, or system resource constraints
- Solution:
- Check the file size - very large files (>100MB) may take longer to process
- Review the
metadata_report.jsonfor processing time metrics - For DOCX files, complex formatting can slow extraction - try saving as plain text first
- Ensure the agent machine has adequate CPU and memory resources
- Consider splitting very large files into smaller chunks
Issue: Encoding or Special Character Problems
- Symptom: Extracted text contains garbled characters or question marks
- Cause: File uses non-UTF-8 encoding or contains special characters
- Solution:
- For text files, try re-saving with UTF-8 encoding in your text editor
- Check the
metadata_report.jsonfor encoding information - If processing international characters, verify the source file uses UTF-8 encoding
- Some special characters may not be preserved during extraction
Getting Help
If you continue to experience issues:
- Check the module log files for detailed error messages
- Review the
metadata_report.jsonartifact for extraction details - Review the Istari Digital Agent logs for additional context
- Consult the main troubleshooting guide for general agent issues
- Contact Istari Digital support with:
- Module version (1.0.0)
- Agent version
- Operating system
- File format and size
- Error messages from logs and metadata report
- Steps to reproduce the issue
Tips and Best Practices
Optimal Usage Patterns
- File Format Selection: Use the simplest format that meets your needs (e.g., TXT for plain text rather than DOCX)
- Batch Processing: Process multiple files by creating separate upload jobs for efficient workflow automation
- JSON Field Planning: Review your JSON structure before defining field paths to ensure accuracy
- Test Extractions: Run a test extraction on a sample file before processing large batches
Performance Considerations
- File Size: Smaller files (< 10MB) extract fastest. Consider splitting large files for better performance
- Format Complexity: Plain text and CSV files extract faster than complex DOCX or ODT files
- Concurrent Jobs: The agent can handle multiple extraction jobs in parallel for improved throughput
- Network Transfer: Upload time depends on file size and network speed; extraction itself is typically fast
Data Quality Best Practices
- Source Quality: Higher quality source documents produce better text extraction results
- Format Consistency: Use consistent file formats across your document sets for predictable results
- Validation: Always review the
metadata_report.jsonto confirm successful extraction - Character Counts: Use the character count in metadata to verify complete extraction
JSON Field Extraction Best Practices
- Field Path Validation: Test field paths on a sample JSON first before processing large datasets
- Error Handling: Review the metadata report for missing fields to identify extraction issues
- Nested Structures: Use backslash notation (
\\) for deeply nested structures - verify each level exists - Array Processing: When extracting from arrays, expect output as an array of values
Security Considerations
- Direct Parsers: This module uses direct format parsers for enhanced security (no external service calls)
- Local Processing: All extraction happens locally on the agent machine - no data leaves your network
- File Validation: The module validates file formats before processing to prevent malicious file handling
- Access Control: Use Istari Digital Platform access controls to limit who can extract sensitive documents
FAQ
-
What happens if my file format isn't in the supported list?
- The extraction will fail with an error message. Convert your file to a supported format first (e.g., save as TXT or DOCX).
-
Does this preserve formatting from Word documents?
- No, the module extracts plain text content only. Formatting, images, and layout are not preserved.
-
Can I extract text from PDFs?
- PDF extraction is handled by a separate module. See the PDF extraction documentation for PDF support.
-
How do I know if extraction was successful?
- Check the
metadata_report.jsonartifact which contains asuccessfield and detailed extraction summary including character counts and processing time.
- Check the
-
Why is my JSON field extraction returning null?
- The field path may be incorrect, the field doesn't exist in your JSON, or there's a typo in the field name. Remember to use backslashes (
\\) for nested fields, not dots. JSON is case-sensitive - verify the exact field name in your source file.
- The field path may be incorrect, the field doesn't exist in your JSON, or there's a typo in the field name. Remember to use backslashes (
-
Can I extract multiple fields at once from JSON?
- Yes! You can provide an array of field paths in the
parameters.valuearray. All fields will be extracted in a single operation.
- Yes! You can provide an array of field paths in the
-
What's the difference between
@istari:extractand@istari:parse_json_fieldsfor JSON files?@istari:extractconverts the entire JSON to readable text format.@istari:parse_json_fieldsextracts specific fields in structured JSON format, which is better for data processing workflows.
-
Does the module support CSV files with custom delimiters?
- Yes! Use
.csvfor comma-separated,.tsvfor tab-separated, and.psvfor pipe-separated files. The module automatically detects the delimiter.
- Yes! Use
-
Are there file size limits?
- There's no hard limit, but very large files (>100MB) may take longer to process. Processing time scales with file size and complexity.
-
Can I use this module offline?
- Yes, the module runs entirely locally on the agent machine and requires no internet connection for extraction.