PyDESeq2
PyDESeq2 is a Python implementation of the DESeq2 method for differential expression analysis (DEA) with bulk RNA-seq data. It enables researchers to perform single-factor and multi-factor designs, Wald tests with multiple testing correction, and optional LFC shrinkage. The library is actively maintained, with version 0.5.4 being the latest stable release, and it is part of the scverse ecosystem, integrating with AnnData for data handling.
Warnings
- breaking Python 3.10 is no longer supported starting from PyDESeq2 v0.5.3.
- breaking The `design` argument of `DeseqDataSet` changed in v0.5.0 to accept `formulaic` string formulas (e.g., `'~condition'`) instead of pandas DataFrames for the design matrix directly. Python 3.9 also dropped support.
- gotcha In v0.5.2, 1D variables stored in `obsm` and `varm` attributes of the AnnData-like `DeseqDataSet` were moved to `obs` and `var` respectively for better consistency with AnnData standards.
- gotcha PyDESeq2 v0.5.4 includes fixes for pandas 3 data type and copy/write bugs. Users on older PyDESeq2 versions combined with pandas 3.x might encounter unexpected behavior.
- gotcha PyDESeq2 is a re-implementation of the R DESeq2 method. While it aims for similar results and features, there might be subtle differences in computed values or available functionalities compared to the original R package.
Install
-
pip install pydeseq2 -
conda install -c bioconda pydeseq2
Imports
- DeseqDataSet
from pydeseq2.dds import DeseqDataSet
- DeseqStats
from pydeseq2.ds import DeseqStats
- load_example_data
from pydeseq2.utils import load_example_data
Quickstart
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
# 1. Create dummy count data (genes x samples) and metadata
# In a real scenario, load with pd.read_csv('counts.csv', index_col=0).T
# and pd.read_csv('metadata.csv', index_col=0)
counts_df = pd.DataFrame(
{
'sample1': [100, 50, 200, 10, 5, 150],
'sample2': [120, 60, 210, 12, 6, 160],
'sample3': [10, 5, 20, 100, 50, 15],
'sample4': [15, 7, 25, 110, 55, 18]
},
index=['geneA', 'geneB', 'geneC', 'geneD', 'geneE', 'geneF']
).T # Transpose to samples x genes
metadata = pd.DataFrame(
{
'condition': ['treated', 'treated', 'control', 'control'],
'batch': ['batch1', 'batch2', 'batch1', 'batch2']
},
index=['sample1', 'sample2', 'sample3', 'sample4']
)
# Ensure counts_df index (samples) matches metadata index (samples)
assert counts_df.index.equals(metadata.index)
# 2. Filter low-count genes (optional, but good practice)
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# 3. Initialize DeseqDataSet with a formulaic design
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design='~condition'
)
# 4. Run the DESeq2 pipeline (normalization, dispersion, LFC estimation)
dds.deseq2()
# 5. Perform statistical testing
deseq_stats = DeseqStats(dds, contrast=['condition', 'treated', 'control'])
deseq_stats.wald_test()
# 6. Apply LFC shrinkage (optional)
deseq_stats.lfc_shrink(coeff='condition_treated_vs_control')
# 7. Access results
results = deseq_stats.results_df
print(results.head())
# You can also access attributes directly from dds, e.g., normalized counts
# print(dds.layers['normed_counts'].head())