Apache Tika Python Client
tika-python is a client for the Apache Tika server, a powerful toolkit for document parsing and metadata extraction from over a thousand different file types. It provides a simple API to interact with a running Tika server, allowing Python applications to leverage Tika's capabilities. The current version is 3.1.0. Releases tend to follow major Apache Tika server releases, with occasional bugfix updates.
Warnings
- breaking The `tika-python` library is a client to the Apache Tika server (a Java application) and *requires* the server to be running. It will attempt to start a server automatically if not found, but this often fails if Java is not installed or JAVA_HOME is not correctly configured.
- gotcha Performance issues or timeouts can occur when parsing very large files or processing many files synchronously, as Tika operations are blocking HTTP calls. The Tika server itself also has memory and CPU requirements.
- breaking The behavior of passing `headers` arguments to `parser.from_file` and `parser.from_buffer` was fixed in version 3.1.0. If you were relying on previous (potentially incorrect) behavior, your parsing results or header handling might change.
- gotcha Compatibility issues can arise between `tika-python` client versions and the Apache Tika server version it connects to, especially across major server releases (e.g., Tika server 1.x vs 2.x/3.x).
Install
-
pip install tika
Imports
- parser
from tika import parser
- config
from tika import config
Quickstart
import os
from tika import parser, config
# IMPORTANT: This library requires a running Apache Tika server (Java application).
# tika-python attempts to start a Tika server automatically if one isn't found,
# but this often requires Java to be correctly installed and JAVA_HOME set.
# For robust usage, it's often recommended to start the Tika server manually
# (e.g., 'java -jar /path/to/tika-server.jar') before running your Python code,
# or to configure 'tika.config.getTikaClient().startServer()' explicitly.
# Ensure the Tika server is accessible (default port is 9998).
# Create a dummy file for parsing
dummy_file_path = "dummy_document.txt"
with open(dummy_file_path, "w") as f:
f.write("This is a test document for Apache Tika.")
f.write("\nIt contains some sample text.")
try:
# Parse the dummy file
parsed = parser.from_file(dummy_file_path)
if parsed and parsed.get("content"):
print("Extracted content:")
print(parsed["content"].strip())
print("\nExtracted metadata (sample):")
print(f"Content-Type: {parsed['metadata'].get('Content-Type')}")
print(f"Content-Length: {parsed['metadata'].get('Content-Length')}")
else:
print("Failed to extract content or metadata. Check Tika server logs.")
except Exception as e:
print(f"An error occurred during parsing: {e}")
print("This often indicates the Tika server is not running or not accessible.")
print("Please ensure you have Java installed and the Tika server is running.")
finally:
if os.path.exists(dummy_file_path):
os.remove(dummy_file_path)