pdf2docx

0.5.12 · maintenance · verified Sun Apr 12

pdf2docx is an open-source Python library designed for converting PDF files into editable Microsoft Word DOCX documents. It leverages PyMuPDF for PDF data extraction, applies rule-based parsing for layout analysis, and utilizes python-docx for generating the final DOCX output. The library aims to extract text, images, and tables while preserving the original layout and formatting. The current version is 0.5.12, released on March 9, 2026.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to convert a PDF file to a DOCX file using the `Converter` class. It shows how to initialize the converter with a PDF file, perform the conversion, and close the converter. The example includes creating a simple dummy PDF if no existing file is provided to ensure it's runnable.

import os
from pdf2docx import Converter

# Create a dummy PDF file for demonstration if it doesn't exist
dummy_pdf_content = b"%PDF-1.4\n1 0 obj <</Type /Page /Contents 2 0 R>> endobj\n2 0 obj <</Length 11>> stream\nBT /F1 12 Tf 72 712 Td (Hello World) Tj ET\nendstream endobj\nxref\n0 3\n0000000000 65535 f\n0000000009 00000 n\n0000000074 00000 n\ntrailer <</Size 3 /Root 1 0 R>> startxref 122\n%%EOF"

pdf_file_path = "sample.pdf"
docx_file_path = "output.docx"

if not os.path.exists(pdf_file_path):
    with open(pdf_file_path, "wb") as f:
        f.write(dummy_pdf_content)
    print(f"Created dummy PDF: {pdf_file_path}")

try:
    # Create a Converter object
    cv = Converter(pdf_file_path)

    # Convert the PDF to DOCX
    cv.convert(docx_file_path, start=0, end=None) # start and end are 0-based, None means to the end
    cv.close()
    print(f"Conversion successful: {pdf_file_path} -> {docx_file_path}")
except Exception as e:
    print(f"An error occurred during conversion: {e}")
finally:
    # Clean up dummy PDF if it was created
    if os.path.exists(pdf_file_path) and dummy_pdf_content:
        os.remove(pdf_file_path)
        print(f"Cleaned up dummy PDF: {pdf_file_path}")
    if os.path.exists(docx_file_path):
        # In a real scenario, you might want to keep the output, but for a quickstart, we clean up.
        # os.remove(docx_file_path)
        pass # Keep the output docx for user inspection

view raw JSON →