gffutils

raw JSON →
0.14 verified Mon Apr 27 auth: no python

gffutils is a Python package for working with GFF and GTF files in a flexible database framework. It stores annotations in a SQLite database for fast querying, manipulation, and export. Current version is 0.14 (released 2024). It supports Python >=3.8 and is released on a semi-regular cadence.

pip install gffutils
error sqlite3.OperationalError: no such table: features
cause The database file exists but is empty or not a valid gffutils database (e.g., created without force=True and not overwritten).
fix
Delete the existing database file and recreate it with force=True.
error AttributeError: module 'gffutils' has no attribute 'create_db'
cause Importing the module incorrectly (e.g., from gffutils import create_db) when the function is not a direct attribute.
fix
Use: import gffutils; db = gffutils.create_db(...)
error ValueError: No valid ID found for feature type 'gene'. Please specify an id_spec.
cause The GFF/GTF file does not have an 'ID' attribute for genes (common in GTF format) or the attribute name differs.
fix
Provide an id_spec dictionary mapping feature types to attribute names, e.g., id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}
gotcha The force=True parameter is required when overwriting an existing database file. If you omit it and the file exists, you'll get a 'Database already exists' error.
fix Always include force=True if you intend to recreate the database, or use a new file path.
gotcha Gene/transcript naming conventions differ between GFF and GTF. gffutils uses the GFF convention (ID attribute). If your file is GTF, the ID attribute may be named differently (e.g., gene_id, transcript_id). Use the 'id_spec' parameter to specify custom IDs.
fix Use gffutils.create_db('file.gtf', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'})
deprecated The method `db.all_features()` is deprecated since version 0.11; use `db.features()` instead.
fix Replace db.all_features() with db.features()
breaking In version 0.12, the default for 'merge_strategy' changed from 'merge' to 'error'. This means that if you have overlapping features with the same ID, the database creation will fail unless you specify merge_strategy='merge'.
fix Set merge_strategy='merge' in create_db if you need to handle overlapping features with the same ID.

Create an in-memory GFF database from a string and query features.

import gffutils

# Download a sample GFF file or use your own
# For this example, we'll use an in-memory database
import os
db = gffutils.create_db(':memory:', from_string='##gff-version 3\nchr1\t.\tgene\t1\t1000\t.\t+\t.\tID=gene1\nchr1\t.\texon\t100\t200\t.\t+\t.\tID=exon1;Parent=gene1\n', force=True, keep_order=True)

gene = db['gene1']
print(gene.id, gene.seqid, gene.start, gene.end)

# Query all features
exons = db.children('gene1', featuretype='exon')
for exon in exons:
    print(exon.id, exon.start, exon.end)