GEOparse
GEOparse is a Python library designed to access, parse, and handle data from the Gene Expression Omnibus (GEO) database. It simplifies the programmatic retrieval of GEO Series (GSE), GEO DataSets (GDS), and GEO Sample (GSM) entries, providing easy access to metadata and expression tables. The current version is 2.0.4, and releases are made on an as-needed basis, typically for bug fixes or feature enhancements.
Common errors
-
AttributeError: 'tuple' object has no attribute 'name'
cause This error occurs when code written for GEOparse 1.x (which returned a tuple from `get_GEO`) is run with GEOparse 2.x, which returns a single object. You're trying to access an attribute like 'name' directly on what is now the full `GESeries` object, but your code expects a tuple structure.fixUpdate your code to match the 2.x API. Instead of `gse, gsms = GEOparse.get_GEO(...)` and then `gse.name`, simply use `gse = GEOparse.get_GEO(...)` and then `gse.name`. Access individual samples via `gse.gsms`. -
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))cause The connection to the NCBI GEO server was abruptly closed during download, often due to network instability, a firewall blocking the connection, or the server closing the connection.fixCheck your internet connection and proxy settings. Temporarily disable any firewalls or VPNs that might interfere. Try running the download again. If the issue persists, the GEO server might be experiencing temporary problems; try again later. -
KeyError: 'some_metadata_field_name'
cause You are trying to access a specific metadata field (e.g., `gsm.metadata['some_field']`) that does not exist for the current GEO Series or Sample.fixNot all metadata fields are present for every entry. Use the `.get()` method for dictionary access with a default value to prevent `KeyError`: `gsm.metadata.get('some_metadata_field_name', 'N/A')`.
Warnings
- breaking The return type of `GEOparse.get_GEO()` changed significantly from versions 1.x to 2.x. Previously, it might have returned a tuple (e.g., `(gse_object, gsm_list)`). In 2.x, it consistently returns a single GEOparse object (e.g., `GESeries`, `GDS`, or `GSM`).
- gotcha Processing very large GEO datasets can consume significant amounts of RAM and disk space, potentially leading to out-of-memory errors or long download/parsing times.
- gotcha Downloads from the GEO database can occasionally fail due to network connectivity issues, server-side problems at NCBI, or incorrect GEO accession IDs.
Install
-
pip install geoparse
Imports
- GEOparse
import GEOparse
- get_GEO
from GEOparse import get_GEO
Quickstart
import GEOparse
import os
# Download a small GEO Series (GSE1 is very small for quick testing)
print("Downloading GSE1 data...")
gse = GEOparse.get_GEO(geo="GSE1", destdir="./")
print(f"\nSuccessfully parsed GEO Series: {gse.name}")
print(f"Title: {gse.metadata.get('title', ['N/A'])[0]}")
print(f"Number of samples (GSMs): {len(gse.gsms)}")
# Access and print information for the first sample
if gse.gsms:
first_gsm_name = list(gse.gsms.keys())[0]
first_gsm = gse.gsms[first_gsm_name]
print(f"\nFirst Sample (GSM): {first_gsm.name}")
print(f"Sample Title: {first_gsm.metadata.get('title', ['N/A'])[0]}")
print(f"Sample Type: {first_gsm.metadata.get('type', ['N/A'])[0]}")
print(f"Sample Table Head:\n{first_gsm.table.head(2)}")