{"id":3918,"library":"camelot-py","title":"Camelot","description":"Camelot is a Python library designed for extracting tabular data from PDF files. It provides fine-grained control over the extraction process, offering two distinct parsing methods: Lattice (for tables with clearly defined lines) and Stream (for tables with whitespace as delimiters). Extracted tables are converted into pandas DataFrames, enabling seamless integration into data analysis workflows, and can be exported to various formats including CSV, JSON, Excel, HTML, Markdown, and SQLite. The library is actively maintained, with the current version being 1.0.9, and features frequent patch releases.","status":"active","version":"1.0.9","language":"en","source_language":"en","source_url":"https://github.com/camelot-dev/camelot","tags":["pdf","table extraction","data extraction","automation","etl"],"install":[{"cmd":"pip install \"camelot-py[base]\"","lang":"bash","label":"Recommended (includes default image backend)"},{"cmd":"pip install \"camelot-py[cv]\"","lang":"bash","label":"With OpenCV (required for some image processing features)"},{"cmd":"conda install -c conda-forge camelot-py","lang":"bash","label":"Conda installation"}],"dependencies":[{"reason":"Default image conversion backend since v1.0.0, required for core functionality. Automatically installed with `camelot-py[base]`.","package":"pypdfium2","optional":false},{"reason":"Optional image conversion backend; required for the 'lattice' flavor in older Camelot versions (<1.0.0) or if explicitly chosen as the backend. Often requires manual system-level installation and PATH configuration.","package":"Ghostscript","optional":true},{"reason":"Required for the `[cv]` extra, which enables some image processing capabilities in table detection.","package":"opencv-python-headless","optional":true},{"reason":"Required for visual debugging features (e.g., `table.plot()`).","package":"matplotlib","optional":true}],"imports":[{"symbol":"camelot","correct":"import camelot"}],"quickstart":{"code":"import camelot\nimport pandas as pd\nimport os\n\n# NOTE: Replace 'foo.pdf' with the path to your actual PDF file.\n# You can create a dummy PDF for testing or use an existing one.\n# Example: A simple PDF with a table on page 1.\n\n# Ensure the PDF exists for the example to run\nif not os.path.exists('foo.pdf'):\n    print(\"Please create a 'foo.pdf' with at least one table for this example.\")\n    # For a truly runnable example, one might generate a PDF using ReportLab or FPDF\n    # For simplicity here, we assume the user provides foo.pdf\n    exit()\n\n# Read tables from the PDF (defaults to 'lattice' flavor and first page)\ntables = camelot.read_pdf('foo.pdf') #\n\n# Print the number of tables found\nprint(f\"Found {tables.n} tables.\\n\")\n\nif tables.n > 0:\n    # Access the first extracted table\n    first_table = tables[0]\n\n    # Print parsing report for insights on accuracy and whitespace\n    print(\"Parsing Report for the first table:\")\n    print(first_table.parsing_report) #\n\n    # Convert the table to a pandas DataFrame\n    df = first_table.df #\n    print(\"\\nExtracted DataFrame (first 5 rows):\\n\", df.head())\n\n    # Export the table to CSV\n    first_table.to_csv('foo_table.csv', index=False) #\n    print(\"\\nTable exported to foo_table.csv\")\n\n    # Alternatively, export all tables to a compressed zip file\n    tables.export('all_tables.zip', f='csv', compress=True) #\n    print(\"All tables exported to all_tables.zip\")\nelse:\n    print(\"No tables found in 'foo.pdf'. You may need to adjust parameters like 'flavor' or 'pages'.\")","lang":"python","description":"This quickstart demonstrates how to read a PDF file, extract tables using Camelot's default settings, inspect the parsing report, convert an extracted table to a pandas DataFrame, and export it to a CSV file. It assumes a 'foo.pdf' file with at least one table exists in the execution directory."},"warnings":[{"fix":"Ensure your PDF is text-based. For image-based PDFs, consider using OCR tools first to convert them to text-based documents before using Camelot.","message":"Camelot primarily works with text-based PDFs. It cannot reliably extract tables from scanned documents or image-based PDFs where text is not selectable. Always verify if text in your PDF is selectable via a PDF viewer.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For v1.0.0 and above, use `pip install \"camelot-py[base]\"` for easier installation. If Ghostscript is required, ensure it's correctly installed on your system and its `bin` directory is added to your system's PATH. Check installation via `from ctypes.util import find_library; find_library(\"gs\")` in Python.","message":"Installation issues with Ghostscript: Prior to v1.0.0, Ghostscript was a mandatory external dependency, often leading to installation complexities due to system-level setup and PATH configuration, especially on Windows and macOS. While v1.0.0 introduced pypdfium2 as the default Python-installable backend to mitigate this, Ghostscript is still an optional backend and problems can arise if it's explicitly chosen or needed for specific environments.","severity":"breaking","affected_versions":"<1.0.0 (mandatory), >=1.0.0 (optional backend)"},{"fix":"Experiment with both `flavor='lattice'` and `flavor='stream'` when calling `camelot.read_pdf()`. If auto-detection fails, manually specify `table_areas` or `columns` using coordinates obtained via visual debugging.","message":"Choosing the correct parsing 'flavor' is crucial for accurate extraction. 'lattice' (default) is best for tables with clearly defined lines. 'stream' is better for tables where columns and rows are separated by whitespace, not explicit lines. Using the wrong flavor can lead to no tables being found or incorrect data extraction.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Specify `pages` to extract from particular pages (e.g., `pages='1,3-5'`). Use `table_areas` to define specific regions where tables are located. For tables spanning multiple pages, extract them individually and then merge using pandas. For multiple tables on one page, defining multiple `table_areas` can help.","message":"For PDFs with complex layouts, tables spanning multiple pages, or multiple tables on a single page, Camelot might fail to autodetect all tables or merge unrelated data. The 'stream' flavor, in particular, may treat an entire page as a single table.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Adjust parameters like `row_tol` (row tolerance), `split_text` (to split multiline text), and `strip_text` (to remove unwanted characters like '\\n' or spaces). Visual debugging with `table.plot()` can help in identifying and fixing these issues.","message":"Complex tables with merged cells, multi-line text within cells, or inconsistent spacing can lead to data being incorrectly grouped into single rows or having unwanted newline characters.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}