PyDESeq2

0.5.4 · active · verified Sat Apr 11

PyDESeq2 is a Python implementation of the DESeq2 method for differential expression analysis (DEA) with bulk RNA-seq data. It enables researchers to perform single-factor and multi-factor designs, Wald tests with multiple testing correction, and optional LFC shrinkage. The library is actively maintained, with version 0.5.4 being the latest stable release, and it is part of the scverse ecosystem, integrating with AnnData for data handling.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates a typical PyDESeq2 workflow for differential expression analysis. It covers loading data, initializing a `DeseqDataSet` with a `formulaic` design string, running the DESeq2 pipeline, performing Wald tests, and applying LFC shrinkage. Results are accessible via the `DeseqStats` object's `results_df` attribute.

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# 1. Create dummy count data (genes x samples) and metadata
# In a real scenario, load with pd.read_csv('counts.csv', index_col=0).T
# and pd.read_csv('metadata.csv', index_col=0)
counts_df = pd.DataFrame(
    {
        'sample1': [100, 50, 200, 10, 5, 150],
        'sample2': [120, 60, 210, 12, 6, 160],
        'sample3': [10, 5, 20, 100, 50, 15],
        'sample4': [15, 7, 25, 110, 55, 18]
    },
    index=['geneA', 'geneB', 'geneC', 'geneD', 'geneE', 'geneF']
).T # Transpose to samples x genes

metadata = pd.DataFrame(
    {
        'condition': ['treated', 'treated', 'control', 'control'],
        'batch': ['batch1', 'batch2', 'batch1', 'batch2']
    },
    index=['sample1', 'sample2', 'sample3', 'sample4']
)

# Ensure counts_df index (samples) matches metadata index (samples)
assert counts_df.index.equals(metadata.index)

# 2. Filter low-count genes (optional, but good practice)
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

# 3. Initialize DeseqDataSet with a formulaic design
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design='~condition'
)

# 4. Run the DESeq2 pipeline (normalization, dispersion, LFC estimation)
dds.deseq2()

# 5. Perform statistical testing
deseq_stats = DeseqStats(dds, contrast=['condition', 'treated', 'control'])
deseq_stats.wald_test()

# 6. Apply LFC shrinkage (optional)
deseq_stats.lfc_shrink(coeff='condition_treated_vs_control')

# 7. Access results
results = deseq_stats.results_df
print(results.head())

# You can also access attributes directly from dds, e.g., normalized counts
# print(dds.layers['normed_counts'].head())

view raw JSON →