{"id":14803,"library":"pdfminer","title":"PDFMiner","description":"PDFMiner is a Python library for extracting and analyzing text data from PDF documents, focusing on precise text location and layout information. The version `20191125` is the last release of the original `euske/pdfminer` project. It supports Python 3.6 and above, but has not been actively maintained since 2020. For ongoing development and community support, the `pdfminer.six` fork is recommended.","status":"maintenance","version":"20191125","language":"en","source_language":"en","source_url":"http://github.com/euske/pdfminer","tags":["pdf","parsing","text-extraction","document-analysis"],"install":[{"cmd":"pip install pdfminer","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for handling encrypted PDF documents.","package":"pycryptodome","optional":false}],"imports":[{"symbol":"PDFResourceManager","correct":"from pdfminer.pdfinterp import PDFResourceManager"},{"symbol":"PDFPageInterpreter","correct":"from pdfminer.pdfinterp import PDFPageInterpreter"},{"symbol":"PDFPage","correct":"from pdfminer.pdfpage import PDFPage"},{"symbol":"PDFParser","correct":"from pdfminer.pdfparser import PDFParser"},{"symbol":"PDFDocument","correct":"from pdfminer.pdfdocument import PDFDocument"},{"symbol":"TextConverter","correct":"from pdfminer.converter import TextConverter"},{"symbol":"LAParams","correct":"from pdfminer.layout import LAParams"}],"quickstart":{"code":"import os\nfrom io import StringIO\n\nfrom pdfminer.converter import TextConverter\nfrom pdfminer.layout import LAParams\nfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter\nfrom pdfminer.pdfpage import PDFPage\nfrom pdfminer.pdfparser import PDFParser\n\ndef extract_text_from_pdf(pdf_path):\n    # Ensure a dummy PDF exists for demonstration, or replace with a real path\n    if not os.path.exists(pdf_path):\n        print(f\"Error: PDF file not found at {pdf_path}. Creating a dummy PDF for demonstration.\")\n        # In a real scenario, you'd handle the missing file appropriately.\n        # For a runnable example, we'll create a simple dummy file.\n        try:\n            from reportlab.pdfgen import canvas\n            c = canvas.Canvas(pdf_path)\n            c.drawString(100, 750, \"Hello, PDFMiner!\")\n            c.drawString(100, 730, \"This is a dummy PDF for testing.\")\n            c.save()\n            print(f\"Dummy PDF created at {pdf_path}\")\n        except ImportError:\n            print(\"Please install reportlab (`pip install reportlab`) to create dummy PDF, or provide a real PDF.\")\n            return \"\"\n\n    rsrcmgr = PDFResourceManager()\n    retstr = StringIO()\n    laparams = LAParams()\n    device = TextConverter(rsrcmgr, retstr, laparams=laparams)\n    \n    with open(pdf_path, 'rb') as fp:\n        parser = PDFParser(fp)\n        document = PDFDocument(parser)\n        interpreter = PDFPageInterpreter(rsrcmgr, device)\n        for page in PDFPage.create_pages(document):\n            interpreter.process_page(page)\n        text = retstr.getvalue()\n    \n    device.close()\n    retstr.close()\n    return text\n\nif __name__ == '__main__':\n    pdf_file = 'dummy.pdf'\n    extracted_content = extract_text_from_pdf(pdf_file)\n    print(\"\\n--- Extracted Text ---\")\n    print(extracted_content)\n","lang":"python","description":"This quickstart demonstrates how to extract text from a PDF file using PDFMiner's core components. It initializes a resource manager, a text converter, and a page interpreter to process the PDF document page by page. A dummy `dummy.pdf` file is created if not found, allowing the code to be runnable for demonstration purposes. This reflects the more verbose API usage typical of the original PDFMiner, as opposed to the simplified `high_level` API found in `pdfminer.six`."},"warnings":[{"fix":"Consider migrating to `pdfminer.six` for an actively developed and supported version (`pip install pdfminer.six`).","message":"The original `pdfminer` project (euske/pdfminer) is no longer actively maintained since 2020. While the latest version `20191125` supports Python 3, new features, bug fixes, and community support are primarily found in its actively maintained fork, `pdfminer.six`.","severity":"breaking","affected_versions":"<=20191125"},{"fix":"For complex layouts, extensive customization of `LAParams` or post-processing may be required. For scanned PDFs, integrate with OCR libraries like Tesseract or pre-process with tools that extract images for OCR.","message":"PDFMiner struggles with text extraction from PDFs with complex layouts (e.g., multi-column, nested tables) and cannot extract text from scanned PDFs (images) without external Optical Character Recognition (OCR) tools.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify if text can be copy-pasted correctly from a PDF viewer. If it can, try adjusting `LAParams` or exploring `pdfminer.six` which might have better font/encoding handling. Otherwise, it might be an inherent limitation of the PDF itself.","message":"Output may contain raw character IDs like `(cid:x)` instead of readable text, especially for non-standard fonts or encoding issues. This happens when the font is not properly mapped to Unicode.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure you correctly parse the file first using `PDFParser` to create a `PDFDocument` instance, and then pass the `PDFDocument` object to `PDFPage.create_pages()`. Refer to the quickstart example for correct API flow.","cause":"This error typically occurs when a file-like object (like `io.BytesIO`) is passed directly to `PDFPage.create_pages()` or similar functions, but the API expects a `PDFDocument` object that has already been parsed by a `PDFParser`. This indicates incorrect API usage.","error":"AttributeError: '_io.BytesIO' object has no attribute 'catalog'"},{"fix":"If you intend to use `pdfminer.six` (recommended), ensure you install it with `pip install pdfminer.six`. If you're sticking to the original `pdfminer`, use its specific import paths and API patterns. The original `pdfminer` does not expose a `pdfminer.high_level` module.","cause":"This usually means you have installed the original `pdfminer` package but are attempting to import modules or use `high_level` functions specific to the `pdfminer.six` fork. Or, `pdfminer.six` was not installed at all.","error":"ModuleNotFoundError: No module named 'pdfminer.six'"},{"fix":"Always specify `encoding='utf-8'` when creating output files or `StringIO` objects if you expect Unicode characters. For `TextConverter`, ensure the `outfp` (output file pointer) is opened with `encoding='utf-8'` or handle character sets explicitly.","cause":"Encoding issues are common when handling diverse text content in PDFs, especially across different operating systems or locales, or when writing to files without specifying the correct encoding.","error":"UnicodeEncodeError: 'charmap' codec can't encode character..."}],"ecosystem":"pypi"}