Apache Tika Python Client

3.1.0 · active · verified Tue Apr 14

tika-python is a client for the Apache Tika server, a powerful toolkit for document parsing and metadata extraction from over a thousand different file types. It provides a simple API to interact with a running Tika server, allowing Python applications to leverage Tika's capabilities. The current version is 3.1.0. Releases tend to follow major Apache Tika server releases, with occasional bugfix updates.

Warnings

Install

Imports

Quickstart

This example demonstrates how to parse a local file using the `tika-python` client. It requires an Apache Tika server to be running and accessible. The client attempts to start the server automatically, but explicit management is often required for production use.

import os
from tika import parser, config

# IMPORTANT: This library requires a running Apache Tika server (Java application).
# tika-python attempts to start a Tika server automatically if one isn't found,
# but this often requires Java to be correctly installed and JAVA_HOME set.
# For robust usage, it's often recommended to start the Tika server manually
# (e.g., 'java -jar /path/to/tika-server.jar') before running your Python code,
# or to configure 'tika.config.getTikaClient().startServer()' explicitly.
# Ensure the Tika server is accessible (default port is 9998).

# Create a dummy file for parsing
dummy_file_path = "dummy_document.txt"
with open(dummy_file_path, "w") as f:
    f.write("This is a test document for Apache Tika.")
    f.write("\nIt contains some sample text.")

try:
    # Parse the dummy file
    parsed = parser.from_file(dummy_file_path)

    if parsed and parsed.get("content"):
        print("Extracted content:")
        print(parsed["content"].strip())
        print("\nExtracted metadata (sample):")
        print(f"Content-Type: {parsed['metadata'].get('Content-Type')}")
        print(f"Content-Length: {parsed['metadata'].get('Content-Length')}")
    else:
        print("Failed to extract content or metadata. Check Tika server logs.")

except Exception as e:
    print(f"An error occurred during parsing: {e}")
    print("This often indicates the Tika server is not running or not accessible.")
    print("Please ensure you have Java installed and the Tika server is running.")

finally:
    if os.path.exists(dummy_file_path):
        os.remove(dummy_file_path)

view raw JSON →