PyPDF2 (DEPRECATED: migrate to PyPDF)
PyPDF2 is a pure-Python library designed for PDF file manipulation, offering capabilities like splitting, merging, cropping, and transforming PDF pages. The `pypdf2` package on PyPI, with its final major version 3.0.1, is now officially deprecated. It functions as a compatibility wrapper, internally using the API of `pypdf` version 3.0.1. All active development, new features, and security updates are happening under the `pypdf` project (currently at version 6.x.x), which is the recommended library for all new and ongoing PDF processing tasks in Python.
Warnings
- breaking The `PyPDF2` project has been officially renamed to `pypdf` and is now actively maintained under that name. The `pypdf2` PyPI package (version 3.0.1) is deprecated and acts as a wrapper around an older version of `pypdf` (specifically, `pypdf` 3.0.1). Users are strongly advised to migrate to `pypdf` for ongoing support, new features, and critical security updates.
- breaking Prior to `PyPDF2` version 3.0.0 (which became the `pypdf` 3.0.1 wrapper), the API involved `import PyPDF2` and class names like `PyPDF2.PdfFileReader` and `PyPDF2.PdfFileWriter`. The modern `pypdf` API (and the `pypdf2 >= 3.0.0` wrapper) uses `from pypdf import PdfReader, PdfWriter` and respective class names.
- gotcha Older versions of `PyPDF2` (pre-3.0.0, i.e., those that are not the `pypdf` 3.0.1 wrapper) contain known performance issues and critical security vulnerabilities, including infinite loop exploits and denial-of-service vectors. Even `pypdf2` 3.0.1, while wrapping `pypdf` 3.0.1, is significantly behind the latest `pypdf` (currently 6.x.x), which has received numerous security patches and performance improvements. Continuing to use `pypdf2` is not recommended for security-sensitive applications.
- gotcha The history of Python PDF libraries is complex, with several forks and renames including `pyPdf`, `PyPDF2`, `PyPDF3`, and `PyPDF4`. This can lead to significant confusion regarding which library is current and actively maintained. `pypdf` (the successor to `PyPDF2`) is the currently recommended and actively developed library.
- deprecated The method `PageObject.replace_contents` was documented as potentially problematic and its usage on `PdfReader` objects was specifically advised against in `pypdf` 6.8.0. Incorrect usage can lead to unintended side effects or corrupted PDF files.
Install
-
pip install pypdf2 -
pip install pypdf -
pip install pypdf[crypto]
Imports
- PdfReader, PdfWriter
from pypdf import PdfReader, PdfWriter
Quickstart
from pypdf import PdfReader, PdfWriter
import os
# Create a dummy PDF for demonstration if it doesn't exist
dummy_pdf_path = "example.pdf"
if not os.path.exists(dummy_pdf_path):
writer = PdfWriter()
writer.add_blank_page(width=72, height=72)
writer.add_blank_page(width=72, height=72)
with open(dummy_pdf_path, "wb") as f:
writer.write(f)
# --- Example: Read, extract text, and merge pages using pypdf (successor to PyPDF2) ---
# Create a PdfReader object
reader = PdfReader(dummy_pdf_path)
# Get number of pages
num_pages = len(reader.pages)
print(f"Number of pages: {num_pages}")
# Extract text from the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(f"Text from first page: '{text.strip() if text else 'No text'}'")
# Create a PdfWriter object for merging
writer = PdfWriter()
# Add all pages from the reader to the writer
for page in reader.pages:
writer.add_page(page)
# Add a blank page
writer.add_blank_page(width=72, height=72)
# Write the output PDF to a file
output_pdf_path = "merged_output.pdf"
with open(output_pdf_path, "wb") as fp:
writer.write(fp)
print(f"Successfully created {output_pdf_path} with {len(writer.pages)} pages.")
# Clean up dummy file
os.remove(dummy_pdf_path)
os.remove(output_pdf_path)