PDFQuery

raw JSON →
0.4.3 verified Fri May 01 auth: no python maintenance

PDFQuery is a lightweight Python library for scraping data from PDFs using JQuery-like CSS selectors or XPath expressions. It wraps pdfminer and lxml to provide a concise API for extracting text, tables, and layouts. Version 0.4.3 is the latest, with no active development since 2016.

pip install pdfquery
error ModuleNotFoundError: No module named 'pdfminer'
cause pdfquery requires pdfminer.six, but the import looks for 'pdfminer'.
fix
Install pdfminer.six: pip install pdfminer.six
error AttributeError: 'PDFQuery' object has no attribute 'pq'
cause pq (pyquery) is only available after calling .load() on the PDF object.
fix
Call pdf.load() before using pdf.pq().
error ImportError: cannot import name 'PDFQuery'
cause Possibly installed wrong package or older version.
fix
pip install pdfquery && verify version: pip show pdfquery
gotcha PDFQuery depends on pdfminer.six, not the older pdfminer. If both are installed, import conflicts may occur.
fix Uninstall old pdfminer: pip uninstall pdfminer; ensure pdfminer.six is installed.
deprecated pdfquery is unmaintained since 2016. Compatibility with newer Python versions (3.10+) is not guaranteed. Consider alternatives like pypdf or pdfplumber.
fix Test with your Python version; if issues arise, switch to pdfplumber or pypdf.
gotcha The library uses pyquery which is case-sensitive for tags. Common mistake: 'LTTextLineHorizontal' not 'lttextlinehorizontal'.
fix Use exact case: LTTextLineHorizontal, LTTextBox, etc.

Load a PDF and extract text lines using JQuery-like selectors via pyquery.

from pdfquery import PDFQuery

pdf = PDFQuery('sample.pdf')
pdf.load()

# Extract text using CSS selector
text = pdf.pq('LTTextLineHorizontal').text()
print(text)

# Extract with XPath
text2 = pdf.pq('LTTextLineHorizontal:contains("Invoice")').text()
print(text2)