GROBID client Python

raw JSON →
0.1.4 verified Fri May 01 auth: no python

Simple Python client for GROBID REST services. Current version: 0.1.4. Release cadence: irregular, with several fixes in 2024-2025.

pip install grobid-client-python
error ModuleNotFoundError: No module named 'grobid_client'
cause The package is installed as 'grobid-client-python', but the import path uses underscore. Mistaking package name for import name.
fix
Install with 'pip install grobid-client-python', then import correctly: from grobid_client.grobid_client import GrobidClient
error AttributeError: module 'grobid_client' has no attribute 'GrobidClient'
cause Trying to import GrobidClient from the top-level grobid_client package without specifying the submodule.
fix
Use: from grobid_client.grobid_client import GrobidClient
error ConnectionError: HTTPConnectionPool(host='localhost', port=8070): Max retries exceeded
cause No GROBID server running at the default URL.
fix
Start GROBID server or provide a different grobid_server URL pointing to a running instance.
error ValueError: The 'input' parameter must be a file path or a directory.
cause Passing a non-existent path or invalid file type (not PDF or XML).
fix
Ensure input exists and is a valid PDF or TEI XML file, or include a directory containing such files.
gotcha The client expects a running GROBID server at the specified URL. Without it, all API calls will raise ConnectionError.
fix Ensure GROBID server is running at the configured grobid_server URL (default: http://localhost:8070).
breaking Version 0.1.0 changed the default output format to JSON and Markdown. The process() method now returns dicts and writes files differently. Old code expecting raw TEI XML may break.
fix Use generateIDs=True and specify output format parameters like output_format='tei' if needed.
gotcha The batch size default changed to 10 in v0.0.17 to avoid unexpected behaviors. Large batches may cause server timeouts.
fix Adjust batch_size parameter in client initialization (e.g., GrobidClient(batch_size=100) for larger throughput, but test for stability).
gotcha The client uses synchronous requests. Processing many PDFs can block the calling thread for a long time.
fix Consider using threading or asyncio wrappers if concurrent processing is needed.

Initialize client with default config and process a PDF.

from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path=None, grobid_server='http://localhost:8070')
client.process("processFulltextDocument", "input.pdf", output="output/")
print("Done")