{"id":6264,"library":"tika","title":"Apache Tika Python Client","description":"tika-python is a client for the Apache Tika server, a powerful toolkit for document parsing and metadata extraction from over a thousand different file types. It provides a simple API to interact with a running Tika server, allowing Python applications to leverage Tika's capabilities. The current version is 3.1.0. Releases tend to follow major Apache Tika server releases, with occasional bugfix updates.","status":"active","version":"3.1.0","language":"en","source_language":"en","source_url":"http://github.com/chrismattmann/tika-python","tags":["text-extraction","document-processing","nlp","metadata-extraction","pdf","java"],"install":[{"cmd":"pip install tika","lang":"bash","label":"Install `tika-python` client"}],"dependencies":[{"reason":"Used for HTTP communication with the Apache Tika server.","package":"requests"},{"reason":"Used for multipart/form-data encoding when sending files to the Tika server.","package":"requests-toolbelt"}],"imports":[{"symbol":"parser","correct":"from tika import parser"},{"symbol":"config","correct":"from tika import config"}],"quickstart":{"code":"import os\nfrom tika import parser, config\n\n# IMPORTANT: This library requires a running Apache Tika server (Java application).\n# tika-python attempts to start a Tika server automatically if one isn't found,\n# but this often requires Java to be correctly installed and JAVA_HOME set.\n# For robust usage, it's often recommended to start the Tika server manually\n# (e.g., 'java -jar /path/to/tika-server.jar') before running your Python code,\n# or to configure 'tika.config.getTikaClient().startServer()' explicitly.\n# Ensure the Tika server is accessible (default port is 9998).\n\n# Create a dummy file for parsing\ndummy_file_path = \"dummy_document.txt\"\nwith open(dummy_file_path, \"w\") as f:\n    f.write(\"This is a test document for Apache Tika.\")\n    f.write(\"\\nIt contains some sample text.\")\n\ntry:\n    # Parse the dummy file\n    parsed = parser.from_file(dummy_file_path)\n\n    if parsed and parsed.get(\"content\"):\n        print(\"Extracted content:\")\n        print(parsed[\"content\"].strip())\n        print(\"\\nExtracted metadata (sample):\")\n        print(f\"Content-Type: {parsed['metadata'].get('Content-Type')}\")\n        print(f\"Content-Length: {parsed['metadata'].get('Content-Length')}\")\n    else:\n        print(\"Failed to extract content or metadata. Check Tika server logs.\")\n\nexcept Exception as e:\n    print(f\"An error occurred during parsing: {e}\")\n    print(\"This often indicates the Tika server is not running or not accessible.\")\n    print(\"Please ensure you have Java installed and the Tika server is running.\")\n\nfinally:\n    if os.path.exists(dummy_file_path):\n        os.remove(dummy_file_path)\n","lang":"python","description":"This example demonstrates how to parse a local file using the `tika-python` client. It requires an Apache Tika server to be running and accessible. The client attempts to start the server automatically, but explicit management is often required for production use."},"warnings":[{"fix":"Ensure Java is installed and `JAVA_HOME` environment variable is set. For production, consider starting the Tika server manually (e.g., `java -jar tika-server.jar`) and configuring `tika-python` to connect to it, rather than relying on auto-start.","message":"The `tika-python` library is a client to the Apache Tika server (a Java application) and *requires* the server to be running. It will attempt to start a server automatically if not found, but this often fails if Java is not installed or JAVA_HOME is not correctly configured.","severity":"breaking","affected_versions":"All versions"},{"fix":"For large-scale processing, implement asynchronous parsing, use a task queue (e.g., Celery), or configure the Tika server with sufficient resources and appropriate timeouts.","message":"Performance issues or timeouts can occur when parsing very large files or processing many files synchronously, as Tika operations are blocking HTTP calls. The Tika server itself also has memory and CPU requirements.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Review code that passes custom headers to `from_file` or `from_buffer` and test thoroughly after upgrading to 3.1.0 to ensure the intended behavior is preserved.","message":"The behavior of passing `headers` arguments to `parser.from_file` and `parser.from_buffer` was fixed in version 3.1.0. If you were relying on previous (potentially incorrect) behavior, your parsing results or header handling might change.","severity":"breaking","affected_versions":">=3.1.0"},{"fix":"Ideally, match your `tika-python` client version with the major version of the Tika server you are connecting to. Always test thoroughly when upgrading either the client or the server.","message":"Compatibility issues can arise between `tika-python` client versions and the Apache Tika server version it connects to, especially across major server releases (e.g., Tika server 1.x vs 2.x/3.x).","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}