{"id":10025,"library":"pdftotext","title":"pdftotext","description":"pdftotext is a Python wrapper for the `pdftotext` command-line utility (part of the Poppler PDF rendering library). It provides a simple, efficient way to extract text from PDF documents. The current version is 3.0.0, and it has a moderate release cadence, with major updates happening less frequently than minor bug fixes.","status":"active","version":"3.0.0","language":"en","source_language":"en","source_url":"https://github.com/jalan/pdftotext","tags":["PDF","text extraction","document processing","poppler"],"install":[{"cmd":"pip install pdftotext","lang":"bash","label":"Python package"},{"cmd":"sudo apt-get install poppler-utils # Debian/Ubuntu\nsudo dnf install poppler-utils # Fedora\nbrew install poppler # macOS (Homebrew)","lang":"bash","label":"System dependency (Poppler)"}],"dependencies":[],"imports":[{"symbol":"pdftotext","correct":"import pdftotext"}],"quickstart":{"code":"import pdftotext\nimport os\n\n# Create a dummy PDF file for demonstration\ndummy_pdf_content = b\"%PDF-1.4\\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 44>>stream\\nBT /F1 24 Tf 100 700 Td (Hello, pdftotext!) Tj ET\\nendstream\\nendobj\\nxref\\n0 5\\n0000000000 65535 f\\n0000000009 00000 n\\n0000000055 00000 n\\n0000000109 00000 n\\n0000000216 00000 n\\ntrailer<</Size 5/Root 1 0 R>>startxref 303\\n%%EOF\"\nwith open(\"dummy.pdf\", \"wb\") as f:\n    f.write(dummy_pdf_content)\n\n# Load your PDF file\ntry:\n    with open(\"dummy.pdf\", \"rb\") as f:\n        pdf = pdftotext.PDF(f)\n\n    # Get all text from the document (each element is a page)\n    full_text = \"\\n\\n\".join(pdf)\n    print(\"--- Full PDF Text ---\")\n    print(full_text)\n\n    # Get text from a specific page (e.g., the first page)\n    if len(pdf) > 0:\n        first_page_text = pdf[0]\n        print(\"\\n--- First Page Text ---\")\n        print(first_page_text)\n    else:\n        print(\"\\nNo pages found in PDF.\")\nexcept pdftotext.Error as e:\n    print(f\"Error processing PDF: {e}. Make sure poppler-utils is installed.\")\nfinally:\n    # Clean up the dummy file\n    if os.path.exists(\"dummy.pdf\"):\n        os.remove(\"dummy.pdf\")\n","lang":"python","description":"This quickstart demonstrates how to load a PDF, extract all text by joining its pages, and access text from individual pages using list-like indexing. It also includes error handling for the common case where the underlying poppler-utils `pdftotext` command is not found."},"warnings":[{"fix":"Replace `for page in pdf.pages:` with `for page in pdf:` and `pdf.pages[0]` with `pdf[0]`.","message":"The `pdf.pages` attribute was removed in version 3.0.0. The `pdftotext.PDF` object now behaves like a list of strings, where each string is the text of a page. Old code referencing `pdf.pages` will break.","severity":"breaking","affected_versions":">=3.0.0"},{"fix":"Install `poppler-utils` (Debian/Ubuntu), `poppler` (macOS via Homebrew), or the equivalent package for your operating system. Ensure the `pdftotext` executable is in your system's PATH.","message":"This library is a wrapper for the `pdftotext` command-line utility, which is part of the Poppler PDF rendering library. You must install Poppler (e.g., `poppler-utils` on Linux, `poppler` on macOS) on your system for `pdftotext` to function.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For extremely large PDFs, consider processing them in chunks if possible (though `pdftotext` isn't designed for this granular control) or explore alternative libraries better suited for streaming or lower memory footprint operations.","message":"Processing very large or complex PDF documents can be memory-intensive, as the library often loads the entire document into memory before extraction. This can lead to `MemoryError` or slow performance.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Install `poppler-utils` (Linux) or `poppler` (macOS) on your operating system and ensure the `pdftotext` executable is accessible via your system's PATH.","cause":"The underlying `pdftotext` command-line utility from Poppler is not installed or not in your system's PATH.","error":"FileNotFoundError: [Errno 2] No such file or directory: 'pdftotext'"},{"fix":"Remove the `.pages` attribute. The `pdftotext.PDF` object itself is now directly iterable and indexable. For example, use `for page in pdf:` instead of `for page in pdf.pages:` and `pdf[0]` instead of `pdf.pages[0]`.","cause":"You are attempting to access `pdf.pages` on an object of type `pdftotext.PDF` with `pdftotext` library version 3.0.0 or higher. The `pages` attribute was removed.","error":"AttributeError: 'list' object has no attribute 'pages'"},{"fix":"Explicitly specify UTF-8 encoding when writing to files: `with open('output.txt', 'w', encoding='utf-8') as f: f.write(text)`. For printing, ensure your terminal is configured for UTF-8.","cause":"While `pdftotext` (especially v3.0.0+) generally handles UTF-8, some system default encodings or malformed PDFs can still lead to encoding issues when printing or writing extracted text.","error":"UnicodeEncodeError: 'charmap' codec can't encode character..."}]}