PDFPlumber

raw JSON →
0.11.9 verified Tue May 12 auth: no python install: verified

PDFPlumber is a powerful Python library designed for high-precision extraction of text, tables, and detailed layout information from PDF documents. Built on `pdfminer.six`, it offers fine-grained control over PDF elements like characters, lines, rectangles, and curves, and includes robust visual debugging tools. The library is currently at version 0.11.9 and maintains an active development and release cadence with frequent updates to its core dependencies.

pip install pdfplumber
error ModuleNotFoundError: No module named 'pdfplumber'
cause The `pdfplumber` library has not been installed in the active Python environment or the Python interpreter being used does not have access to the installed package.
fix
Ensure you are in the correct Python environment and run: pip install pdfplumber
error AttributeError: module 'pdfplumber' has no attribute 'open'
cause This usually occurs if you have named one of your Python files `pdfplumber.py` (creating a circular import) or if the `pdfplumber` package was not installed correctly.
fix
Rename any file named pdfplumber.py in your project directory to something else, then try re-importing. If the issue persists, reinstall pdfplumber using pip uninstall pdfplumber followed by pip install pdfplumber.
error ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. pdfplumber X requires pdfminer.six==YYYYMMDD, but you have pdfminer-six ZZZZ (which is incompatible).
cause `pdfplumber` has a strict dependency on a specific version of `pdfminer.six`, and another installed package (or a pre-existing installation) has a conflicting version of `pdfminer.six`.
fix
It is often safe to upgrade pdfminer.six to a newer version compatible with other packages, or to reinstall pdfplumber which should pull the correct pdfminer.six version. Try pip install pdfminer.six --upgrade or pip install pdfplumber --upgrade --no-deps if you want to manage pdfminer.six separately, or uninstall both and reinstall pdfplumber: pip uninstall pdfplumber pdfminer.six && pip install pdfplumber.
error TypeError: can only concatenate str (not "NoneType") to str
cause This error typically arises when `page.extract_text()` returns `None` (e.g., for an empty page, a page with only images, or a malformed page) and you attempt to concatenate this `None` value with a string.
fix
Add a check to ensure the result of extract_text() is not None before attempting string concatenation. For example: text = page.extract_text(); if text: final_text += text.
error OSError: [Errno 2] No such file or directory: 'pdftotext'
cause This error indicates that the `pdftotext` executable, which is part of the Poppler utility suite and is sometimes relied upon by `pdfplumber` or its underlying dependencies for certain operations (especially image-based PDF handling or specific text extraction methods), is not installed or not in your system's PATH.
fix
Install Poppler on your operating system. For Ubuntu/Debian: sudo apt-get update && sudo apt-get install -y poppler-utils. For macOS: brew install poppler. For Windows, you typically need to download pre-compiled binaries and add them to your system's PATH.
breaking In `v0.11.7`, `stroking_pattern` and `non_stroking_pattern` object attributes were removed due to underlying changes in `pdfminer.six`. Code relying on these attributes will break.
fix Review code accessing `stroking_pattern` or `non_stroking_pattern` and adapt to new `pdfminer.six` object structures or alternative methods if needed. The `pdfminer.six` changelog may provide more context on the replacement functionality.
breaking Version `v0.11.0` introduced new `line_dir` and `char_dir` parameters for better control over text directionality (e.g., non-left-to-right, top-to-bottom text). While enhancing support for complex PDFs, these changes might subtly alter text extraction behavior for certain documents if relying on previous implicit direction handling.
fix Test existing extraction workflows with `v0.11.0` and later versions. If text extraction results differ, explicitly set `line_dir` and `char_dir` parameters in `extract_text()` or `pdfplumber.open()` to match the desired reading order.
gotcha `pdfplumber` is built on `pdfminer.six`, and updates to `pdfminer.six` can sometimes introduce breaking changes or altered behavior in `pdfplumber`. `pdfplumber` often pins `pdfminer.six` versions, but large jumps can still have impacts.
fix Always test `pdfplumber` upgrades thoroughly, especially if an underlying `pdfminer.six` version jump is noted in the release notes. Consult both `pdfplumber` and `pdfminer.six` changelogs for details on breaking changes.
gotcha When processing large PDF files or many documents, cached page and object properties can consume significant memory. Not properly closing PDF objects can lead to memory leaks.
fix Always use `pdfplumber.open()` with a `with` statement (context manager) to ensure `PDF` and `Page` objects are properly closed and resources are released. If not using a `with` statement, explicitly call `pdf.close()` when done.
gotcha Image-based features like `Page.to_image()` (used for visual debugging) require the Poppler utility library to be installed on your system. Without it, these methods will raise an error.
fix Install Poppler utilities specific to your operating system (e.g., `poppler-utils` on Debian/Ubuntu, `poppler` on macOS via Homebrew). See the installation instructions.
breaking The table extraction algorithm in `pdfplumber` underwent a radical redesign in `v0.5.0`. This introduced significant breaking changes to the table extraction API and configuration, meaning code written for versions prior to `v0.5.0` will likely not work with newer versions.
fix For users migrating from very old versions (<0.5.0), thoroughly review the table extraction documentation for `v0.5.0` and later. The `extract_tables()` method and its parameters were substantially changed.
gotcha When running `pdfplumber` operations, ensure that all input PDF files specified in the code are present and accessible in the file system. If an input file is not found, a `FileNotFoundError` (or similar error indicating the file's absence) will be raised, preventing any PDF processing.
fix Verify that your test environment or application context correctly provides all necessary PDF files. Check file paths, permissions, and ensure the files exist at the specified locations. If running tests, ensure test data is properly mounted or included in the test runner's environment.
sudo apt-get install poppler-utils # Debian/Ubuntu brew install poppler # macOS
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 0.63s 75.2M
3.10 alpine (musl) - - 0.58s 74.1M
3.10 slim (glibc) wheel 4.0s 0.51s 74M
3.10 slim (glibc) - - 0.54s 73M
3.11 alpine (musl) wheel - 0.69s 78.1M
3.11 alpine (musl) - - 0.77s 77.0M
3.11 slim (glibc) wheel 3.5s 0.61s 77M
3.11 slim (glibc) - - 0.57s 76M
3.12 alpine (musl) wheel - 0.72s 69.7M
3.12 alpine (musl) - - 0.70s 68.6M
3.12 slim (glibc) wheel 3.1s 0.63s 68M
3.12 slim (glibc) - - 0.62s 67M
3.13 alpine (musl) wheel - 0.62s 69.5M
3.13 alpine (musl) - - 0.67s 68.3M
3.13 slim (glibc) wheel 3.4s 0.60s 68M
3.13 slim (glibc) - - 0.62s 67M
3.9 alpine (musl) wheel - 0.43s 71.6M
3.9 alpine (musl) - - 0.46s 70.5M
3.9 slim (glibc) wheel 4.8s 0.40s 70M
3.9 slim (glibc) - - 0.37s 69M

This quickstart demonstrates how to open a PDF, extract text from its first page, and find tables. For `to_image()` features and visual debugging, Poppler must be installed on your system. Remember to replace 'dummy.pdf' with the path to an actual PDF file.

import pdfplumber
import os

# Create a dummy PDF file for demonstration
# In a real scenario, you'd replace 'dummy.pdf' with your actual file path.
# This example just shows the API usage.

try:
    # This part would typically involve a real PDF file
    # For a runnable example, we'll assume 'dummy.pdf' exists or create a simple one (conceptually)
    # For local testing, you might use a library like FPDF to generate a simple PDF
    # or download a sample PDF.

    # Placeholder: Replace with path to your actual PDF file
    pdf_path = os.environ.get('PDFPLUMBER_DEMO_PDF', 'dummy.pdf')

    # Example of how to use pdfplumber
    with pdfplumber.open(pdf_path) as pdf:
        print(f"Number of pages: {len(pdf.pages)}")
        first_page = pdf.pages[0]
        print(f"Text from first page:\n{first_page.extract_text()}")

        # Extract tables from the first page
        tables = first_page.extract_tables()
        if tables:
            print(f"\nTables found on first page (first table):\n{tables[0]}")
        else:
            print("\nNo tables found on the first page.")

        # Optional: Visual debugging (requires Poppler installed)
        # im = first_page.to_image()
        # im.draw_rects(first_page.chars)
        # im.save("first_page_debug.png")

except FileNotFoundError:
    print(f"Error: PDF file '{pdf_path}' not found. Please provide a valid PDF for the quickstart.")
except Exception as e:
    print(f"An error occurred: {e}")