pdf2docx
pdf2docx is an open-source Python library designed for converting PDF files into editable Microsoft Word DOCX documents. It leverages PyMuPDF for PDF data extraction, applies rule-based parsing for layout analysis, and utilizes python-docx for generating the final DOCX output. The library aims to extract text, images, and tables while preserving the original layout and formatting. The current version is 0.5.12, released on March 9, 2026.
Warnings
- deprecated The `pdf2docx` library is no longer actively maintained by its original developer, Artifex. While the repository is open for community contributions, active development and official maintenance by Artifex have ceased.
- gotcha The library primarily processes text-based PDFs and does not perform Optical Character Recognition (OCR). Scanned PDF documents, which are essentially images, will not have their text content extracted or converted to editable DOCX text.
- gotcha Complex PDF layouts, especially those with intricate tables, multi-column designs, or unusual text flows, may not be perfectly replicated in the converted DOCX file due to the library's rule-based parsing method.
- gotcha The library is primarily designed for left-to-right languages and standard reading directions. Documents with right-to-left languages or significant text transformations/rotations might not convert accurately.
Install
-
pip install pdf2docx
Imports
- Converter
from pdf2docx import Converter
- parse
from pdf2docx import parse
Quickstart
import os
from pdf2docx import Converter
# Create a dummy PDF file for demonstration if it doesn't exist
dummy_pdf_content = b"%PDF-1.4\n1 0 obj <</Type /Page /Contents 2 0 R>> endobj\n2 0 obj <</Length 11>> stream\nBT /F1 12 Tf 72 712 Td (Hello World) Tj ET\nendstream endobj\nxref\n0 3\n0000000000 65535 f\n0000000009 00000 n\n0000000074 00000 n\ntrailer <</Size 3 /Root 1 0 R>> startxref 122\n%%EOF"
pdf_file_path = "sample.pdf"
docx_file_path = "output.docx"
if not os.path.exists(pdf_file_path):
with open(pdf_file_path, "wb") as f:
f.write(dummy_pdf_content)
print(f"Created dummy PDF: {pdf_file_path}")
try:
# Create a Converter object
cv = Converter(pdf_file_path)
# Convert the PDF to DOCX
cv.convert(docx_file_path, start=0, end=None) # start and end are 0-based, None means to the end
cv.close()
print(f"Conversion successful: {pdf_file_path} -> {docx_file_path}")
except Exception as e:
print(f"An error occurred during conversion: {e}")
finally:
# Clean up dummy PDF if it was created
if os.path.exists(pdf_file_path) and dummy_pdf_content:
os.remove(pdf_file_path)
print(f"Cleaned up dummy PDF: {pdf_file_path}")
if os.path.exists(docx_file_path):
# In a real scenario, you might want to keep the output, but for a quickstart, we clean up.
# os.remove(docx_file_path)
pass # Keep the output docx for user inspection