Negative binomial GLM differential expression#

bullkpy.tl.de_glm(adata, *, formula, contrast=None, layer_counts='counts', shrink_dispersion=True, prior_df=10.0, key_added='de_glm', add_intercept=True)[source]#

NB-GLM DE with design matrix from a formula and a contrast.

Example

bk.tl.de_glm(adata, formula=”~ Subtype + Batch”, contrast=(“Subtype”,”Basal”,”Luminal”))

Stores:: adata.uns[key_added][contrast_name] = dict(method/params/results)

Run negative binomial GLM differential expression (bulk-friendly) using a formula-defined design matrix and a two-level contrast. This is intended for raw (integer-like) counts and supports adding covariates (e.g., batch) in the model.

The implementation is “DESeq2-like” in spirit:

it computes size factors and uses a log(size factor) offset
estimates gene-wise dispersion from normalized counts
optionally shrinks dispersion toward a trend
fits a gene-wise NB-GLM and performs a Wald test for a specified contrast
reports log2FC, Wald z, p-values, and BH FDR q-values

When to use#

Use de_glm when you want:

two-group DE while controlling for additional covariates (e.g., batch, sex, purity)
a count model (NB) instead of a t-test on log-expression
a workflow similar to bulk RNA-seq DE tools (DESeq2/edgeR style).

You should provide raw counts in adata.layers[“counts”] (or another layer via layer_counts).

Parameters#

Core inputs#

adata
AnnData with samples in rows (n_obs) and genes in columns (n_vars).

formula
A Patsy-like formula string defining the design matrix using adata.obs columns. Examples:

“~ Subtype”
“~ Subtype + Batch”
“~ Subtype + Batch + Sex”. The design matrix is constructed via an internal helper (_design_matrix_from_formula).

contrast
Required in the current implementation.
A tuple (var, level_a, level_b) meaning level_a − level_b for a categorical variable.
Example:

(“Subtype”, “Basal”, “Luminal”) → Basal vs Luminal (Basal − Luminal) Internally a contrast vector c is built from the coefficient names (coef_names) using _contrast_vector.

layer_counts
Layer name containing the raw counts matrix (samples × genes).
Default: “counts”.

Dispersion options#

shrink_dispersion
If True, shrink gene-wise dispersion estimates toward a trend (stabilizes estimates when n is small).
Default: True.

prior_df
Prior degrees of freedom controlling the strength of dispersion shrinkage (higher → stronger shrinkage).
Default: 10.0.

Output storage#

key_added
Where results are stored in adata.uns.
Default: “de_glm”.

add_intercept
Whether to include an intercept term in the design matrix.
Default: True.

Output#

Results are stored as:

adata.uns[key_added][contrast_name] = {
  "method": ...,
  "formula": ...,
  "contrast": ...,
  "coef_names": ...,
  "layer_counts": ...,
  "shrink_dispersion": ...,
  "prior_df": ...,
  "results": <pd.DataFrame>
}

Where:

contrast_name is constructed as:
- f”{var}:{level_a}vs{level_b}”
- Example: “Subtype:Basal_vs_Luminal”

Result table columns.

The stored DataFrame (results) contains:

gene : gene name (from adata.var_names)
log2FC : estimated log2 fold-change for the contrast (level_a − level_b)
wald_z : Wald z-statistic for the contrast
pval : two-sided p-value from Wald z
qval : BH-adjusted p-value (FDR)
dispersion : final dispersion used per gene (after optional shrink)
mean_norm : mean of normalized counts per gene (used in dispersion estimation).

The table is sorted by qval ascending.

Statistical details (high level)#

Design matrix. Built from adata.obs using the provided formula (plus optional intercept).
Counts and offsets.

counts: Y = adata.layers[layer_counts]
size factors: sf = deseq2_size_factors(Y)
offset: log(sf) included in the GLM.

Dispersion.

estimates dispersion from normalized counts (Y / sf)
optional shrinkage to a dispersion trend.

Model fitting (per gene). Fits:

NB(mu, alpha) with log link
log(mu) = Xβ + offset.

Wald test for contrast Tests cᵀβ = 0 using:

z = (cᵀβ) / sqrt(cᵀCov(β)c)
p-value from standard normal approximation

Requirements / dependencies#

statsmodels is required:
- installed via pip install statsmodels
scipy is used for the normal CDF in Wald p-values.

If statsmodels is missing, the function raises an ImportError with install instructions.

Examples#

Basic DE with batch covariate

bk.tl.de_glm(
    adata,
    formula="~ Subtype + Batch",
    contrast=("Subtype", "Basal", "Luminal"),
    layer_counts="counts",
)

Retrieve results

res = adata.uns["de_glm"]["Subtype:Basal_vs_Luminal"]["results"]
res.head()

Turn off dispersion shrinkage

bk.tl.de_glm(
    adata,
    formula="~ Subtype + Batch",
    contrast=("Subtype", "Basal", "Luminal"),
    shrink_dispersion=False,
)

Notes / caveats#

The current implementation requires contrast (no “test all coefficients” mode yet).
This is a gene-wise GLM loop; it can be slower than vectorized approaches for very large gene counts.
Counts with total sum 0 for a gene are skipped (remain default p=1 / NaNs for some stats).
The contrast mechanism assumes a fairly standard dummy encoding for categorical variables (handled by the internal design/contrast helpers). If coefficient naming differs from expectations, _contrast_vector may fail or target the wrong coefficient—double-check coef_names stored in adata.uns.