{"id":8962,"library":"docx2python","title":"docx2python","description":"docx2python is a Python library for extracting structured content from .docx files. It can extract headers, footers, formatted text, footnotes, endnotes, comments, document properties, and images, converting them into a Python object. The library is also capable of preserving document structure, including numbered and bulleted lists, and handling tables. It is currently at version 3.6.2 and receives active maintenance.","status":"active","version":"3.6.2","language":"en","source_language":"en","source_url":"https://github.com/ShayHill/docx2python","tags":["docx","word","document parsing","text extraction","image extraction","microsoft word"],"install":[{"cmd":"pip install docx2python","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for execution.","package":"python","version":">=3.10.0","optional":false}],"imports":[{"symbol":"docx2python","correct":"from docx2python import docx2python"},{"note":"Used for iterating through paragraphs with advanced formatting and structural information.","symbol":"iter_paragraphs","correct":"from docx2python.iterators import iter_paragraphs"}],"quickstart":{"code":"import os\nfrom docx2python import docx2python\n\n# Create a dummy docx file for demonstration (in a real scenario, this file would exist)\n# For a proper test, ensure 'example.docx' exists in the same directory\n# with some text and a table.\n# Example: A .docx file with 'Hello World' and a simple 2x2 table.\n\n# Assuming 'example.docx' exists:\ndocx_file = 'example.docx'\n\nif not os.path.exists(docx_file):\n    print(f\"Please create a file named '{docx_file}' with some content for the quickstart.\")\nelse:\n    try:\n        with docx2python(docx_file) as docx_content:\n            print(\"--- Extracted Document Text ---\")\n            print(docx_content.text)\n            print(\"\\n--- Document Body Structure (nested list) ---\")\n            # The body is a nested list, with paragraphs at depth 4\n            print(docx_content.body[:1]) # Print first element for brevity\n\n            if docx_content.images:\n                print(\"\\n--- Extracted Images (names only) ---\")\n                for name in docx_content.images.keys():\n                    print(name)\n            else:\n                print(\"\\nNo images found.\")\n\n    except Exception as e:\n        print(f\"An error occurred: {e}\")\n        print(f\"Ensure '{docx_file}' is a valid and readable .docx file.\")","lang":"python","description":"Demonstrates how to extract all text content from a .docx file as a single string, access the nested list representation of the document body, and list extracted image filenames. The example assumes a 'example.docx' file exists."},"warnings":[{"fix":"Always pass `html` and `duplicate_merged_cells` as keyword arguments, e.g., `docx2python(file, html=True)`.","message":"In version 3.0, the `html` and `duplicate_merged_cells` arguments to the `docx2python` function became keyword-only. Positional arguments for these will raise a TypeError.","severity":"breaking","affected_versions":">=3.0.0"},{"fix":"Update parsing logic for tables to expect the `nxm` nested list structure, accounting for duplicated merged cell content if `duplicate_merged_cells` is `True`.","message":"Version 3.0 introduced changes to table output: tables are now consistently `nxm` (rows x columns) nested lists. If `duplicate_merged_cells=True` (default), merged cells will be duplicated to fill the `nxm` structure. This improves consistency for processing but changes the raw data structure compared to older versions.","severity":"breaking","affected_versions":">=3.0.0"},{"fix":"Familiarize yourself with the output structure documented in the library's README. Consider using helper functions from `docx2python.iterators` like `iter_paragraphs` or `iter_tables` for easier traversal of specific content types.","message":"The primary output (e.g., `docx_content.body`) is a deeply nested list structure, where paragraphs are consistently found at depth 4 (e.g., `output.body[i][j][k][l]` is a paragraph string). This structure can be complex to navigate directly.","severity":"gotcha","affected_versions":"all"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Rename your Python script or any conflicting file from `docx.py` to something else (e.g., `extract_doc.py`) to avoid shadowing the library module.","cause":"This error, often seen with `python-docx`, can occur with `docx2python` if a user's script or a file in the import path is named `docx.py`. This creates a module name collision, causing Python to import the user's file instead of the actual library.","error":"AttributeError: partially initialized module 'docx' has no attribute 'Document'"},{"fix":"Open the problematic `.docx` file in Microsoft Word (or a compatible word processor) and simply re-save it. This often 'fixes' the underlying XML structure, making it parsable by `docx2python`.","cause":"`.docx` files downloaded directly from services like Google Sheets may not conform perfectly to the Open XML standard in a way that `docx2python` expects. They might lack certain metadata or structural elements.","error":"Failed to parse .docx file (often with complex traceback including `KeyError` or XML parsing errors)"}]}