PDFQuery
raw JSON → 0.4.3 verified Fri May 01 auth: no python maintenance
PDFQuery is a lightweight Python library for scraping data from PDFs using JQuery-like CSS selectors or XPath expressions. It wraps pdfminer and lxml to provide a concise API for extracting text, tables, and layouts. Version 0.4.3 is the latest, with no active development since 2016.
pip install pdfquery Common errors
error ModuleNotFoundError: No module named 'pdfminer' ↓
cause pdfquery requires pdfminer.six, but the import looks for 'pdfminer'.
fix
Install pdfminer.six: pip install pdfminer.six
error AttributeError: 'PDFQuery' object has no attribute 'pq' ↓
cause pq (pyquery) is only available after calling .load() on the PDF object.
fix
Call pdf.load() before using pdf.pq().
error ImportError: cannot import name 'PDFQuery' ↓
cause Possibly installed wrong package or older version.
fix
pip install pdfquery && verify version: pip show pdfquery
Warnings
gotcha PDFQuery depends on pdfminer.six, not the older pdfminer. If both are installed, import conflicts may occur. ↓
fix Uninstall old pdfminer: pip uninstall pdfminer; ensure pdfminer.six is installed.
deprecated pdfquery is unmaintained since 2016. Compatibility with newer Python versions (3.10+) is not guaranteed. Consider alternatives like pypdf or pdfplumber. ↓
fix Test with your Python version; if issues arise, switch to pdfplumber or pypdf.
gotcha The library uses pyquery which is case-sensitive for tags. Common mistake: 'LTTextLineHorizontal' not 'lttextlinehorizontal'. ↓
fix Use exact case: LTTextLineHorizontal, LTTextBox, etc.
Imports
- PDFQuery
from pdfquery import PDFQuery
Quickstart
from pdfquery import PDFQuery
pdf = PDFQuery('sample.pdf')
pdf.load()
# Extract text using CSS selector
text = pdf.pq('LTTextLineHorizontal').text()
print(text)
# Extract with XPath
text2 = pdf.pq('LTTextLineHorizontal:contains("Invoice")').text()
print(text2)