{"id":4681,"library":"pdf2docx","title":"pdf2docx","description":"pdf2docx is an open-source Python library designed for converting PDF files into editable Microsoft Word DOCX documents. It leverages PyMuPDF for PDF data extraction, applies rule-based parsing for layout analysis, and utilizes python-docx for generating the final DOCX output. The library aims to extract text, images, and tables while preserving the original layout and formatting. The current version is 0.5.12, released on March 9, 2026.","status":"maintenance","version":"0.5.12","language":"en","source_language":"en","source_url":"https://github.com/ArtifexSoftware/pdf2docx","tags":["pdf","docx","conversion","document processing","office","pymupdf","python-docx"],"install":[{"cmd":"pip install pdf2docx","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Used for extracting data (text, images, drawings) from PDF files.","package":"PyMuPDF","optional":false},{"reason":"Used for generating the DOCX output file.","package":"python-docx","optional":false}],"imports":[{"symbol":"Converter","correct":"from pdf2docx import Converter"},{"note":"The `parse` function is directly available from the top-level package, not as a method of a class.","wrong":"pdf2docx.parse()","symbol":"parse","correct":"from pdf2docx import parse"}],"quickstart":{"code":"import os\nfrom pdf2docx import Converter\n\n# Create a dummy PDF file for demonstration if it doesn't exist\ndummy_pdf_content = b\"%PDF-1.4\\n1 0 obj <</Type /Page /Contents 2 0 R>> endobj\\n2 0 obj <</Length 11>> stream\\nBT /F1 12 Tf 72 712 Td (Hello World) Tj ET\\nendstream endobj\\nxref\\n0 3\\n0000000000 65535 f\\n0000000009 00000 n\\n0000000074 00000 n\\ntrailer <</Size 3 /Root 1 0 R>> startxref 122\\n%%EOF\"\n\npdf_file_path = \"sample.pdf\"\ndocx_file_path = \"output.docx\"\n\nif not os.path.exists(pdf_file_path):\n    with open(pdf_file_path, \"wb\") as f:\n        f.write(dummy_pdf_content)\n    print(f\"Created dummy PDF: {pdf_file_path}\")\n\ntry:\n    # Create a Converter object\n    cv = Converter(pdf_file_path)\n\n    # Convert the PDF to DOCX\n    cv.convert(docx_file_path, start=0, end=None) # start and end are 0-based, None means to the end\n    cv.close()\n    print(f\"Conversion successful: {pdf_file_path} -> {docx_file_path}\")\nexcept Exception as e:\n    print(f\"An error occurred during conversion: {e}\")\nfinally:\n    # Clean up dummy PDF if it was created\n    if os.path.exists(pdf_file_path) and dummy_pdf_content:\n        os.remove(pdf_file_path)\n        print(f\"Cleaned up dummy PDF: {pdf_file_path}\")\n    if os.path.exists(docx_file_path):\n        # In a real scenario, you might want to keep the output, but for a quickstart, we clean up.\n        # os.remove(docx_file_path)\n        pass # Keep the output docx for user inspection","lang":"python","description":"This quickstart demonstrates how to convert a PDF file to a DOCX file using the `Converter` class. It shows how to initialize the converter with a PDF file, perform the conversion, and close the converter. The example includes creating a simple dummy PDF if no existing file is provided to ensure it's runnable."},"warnings":[{"fix":"Be aware that new features or prompt bug fixes from the original maintainers are unlikely. Consider community forks or alternative libraries if active development and support are critical for your project.","message":"The `pdf2docx` library is no longer actively maintained by its original developer, Artifex. While the repository is open for community contributions, active development and official maintenance by Artifex have ceased.","severity":"deprecated","affected_versions":"0.5.x onwards"},{"fix":"For scanned PDFs, you must run an OCR tool on the PDF first to convert images of text into actual text before using `pdf2docx` for conversion.","message":"The library primarily processes text-based PDFs and does not perform Optical Character Recognition (OCR). Scanned PDF documents, which are essentially images, will not have their text content extracted or converted to editable DOCX text.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Test with representative PDF documents to assess conversion fidelity. For critical layout preservation, manual adjustments to the output DOCX might be necessary, or consider alternative conversion methods for highly complex documents.","message":"Complex PDF layouts, especially those with intricate tables, multi-column designs, or unusual text flows, may not be perfectly replicated in the converted DOCX file due to the library's rule-based parsing method.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify output for documents in non-left-to-right languages or with complex text orientations. No direct fix within `pdf2docx` for these limitations.","message":"The library is primarily designed for left-to-right languages and standard reading directions. Documents with right-to-left languages or significant text transformations/rotations might not convert accurately.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}