Negative binomial GLM differential expression#
- bullkpy.tl.de_glm(adata, *, formula, contrast=None, layer_counts='counts', shrink_dispersion=True, prior_df=10.0, key_added='de_glm', add_intercept=True)[source]#
NB-GLM DE with design matrix from a formula and a contrast.
Example
bk.tl.de_glm(adata, formula=”~ Subtype + Batch”, contrast=(“Subtype”,”Basal”,”Luminal”))
- Stores:
adata.uns[key_added][contrast_name] = dict(method/params/results)
Run negative binomial GLM differential expression (bulk-friendly) using a formula-defined design matrix and a two-level contrast. This is intended for raw (integer-like) counts and supports adding covariates (e.g., batch) in the model.
The implementation is “DESeq2-like” in spirit:
it computes size factors and uses a log(size factor) offset
estimates gene-wise dispersion from normalized counts
optionally shrinks dispersion toward a trend
fits a gene-wise NB-GLM and performs a Wald test for a specified contrast
reports log2FC, Wald z, p-values, and BH FDR q-values
When to use#
Use de_glm when you want:
two-group DE while controlling for additional covariates (e.g., batch, sex, purity)
a count model (NB) instead of a t-test on log-expression
a workflow similar to bulk RNA-seq DE tools (DESeq2/edgeR style).
You should provide raw counts in adata.layers[“counts”] (or another layer via layer_counts).
Parameters#
Core inputs#
adata
AnnData with samples in rows (n_obs) and genes in columns (n_vars).
formula
A Patsy-like formula string defining the design matrix using adata.obs columns.
Examples:
“~ Subtype”
“~ Subtype + Batch”
“~ Subtype + Batch + Sex”. The design matrix is constructed via an internal helper (_design_matrix_from_formula).
contrast
Required in the current implementation.
A tuple (var, level_a, level_b) meaning level_a − level_b for a categorical variable.
Example:
(“Subtype”, “Basal”, “Luminal”) → Basal vs Luminal (Basal − Luminal) Internally a contrast vector c is built from the coefficient names (coef_names) using _contrast_vector.
layer_counts
Layer name containing the raw counts matrix (samples × genes).
Default: “counts”.
Dispersion options#
shrink_dispersion
If True, shrink gene-wise dispersion estimates toward a trend (stabilizes estimates when n is small).
Default: True.
prior_df
Prior degrees of freedom controlling the strength of dispersion shrinkage (higher → stronger shrinkage).
Default: 10.0.
Output storage#
key_added
Where results are stored in adata.uns.
Default: “de_glm”.
add_intercept
Whether to include an intercept term in the design matrix.
Default: True.
Output#
Results are stored as:
adata.uns[key_added][contrast_name] = {
"method": ...,
"formula": ...,
"contrast": ...,
"coef_names": ...,
"layer_counts": ...,
"shrink_dispersion": ...,
"prior_df": ...,
"results": <pd.DataFrame>
}
Where:
contrast_name is constructed as:
f”{var}:{level_a}vs{level_b}”
Example: “Subtype:Basal_vs_Luminal”
Result table columns.
The stored DataFrame (results) contains:
gene : gene name (from adata.var_names)
log2FC : estimated log2 fold-change for the contrast (level_a − level_b)
wald_z : Wald z-statistic for the contrast
pval : two-sided p-value from Wald z
qval : BH-adjusted p-value (FDR)
dispersion : final dispersion used per gene (after optional shrink)
mean_norm : mean of normalized counts per gene (used in dispersion estimation).
The table is sorted by qval ascending.
Statistical details (high level)#
Design matrix. Built from adata.obs using the provided formula (plus optional intercept).
Counts and offsets.
counts: Y = adata.layers[layer_counts]
size factors: sf = deseq2_size_factors(Y)
offset: log(sf) included in the GLM.
Dispersion.
estimates dispersion from normalized counts (Y / sf)
optional shrinkage to a dispersion trend.
Model fitting (per gene). Fits:
NB(mu, alpha) with log link
log(mu) = Xβ + offset.
Wald test for contrast Tests cᵀβ = 0 using:
z = (cᵀβ) / sqrt(cᵀCov(β)c)
p-value from standard normal approximation
Requirements / dependencies#
statsmodels is required:
installed via pip install statsmodels
scipy is used for the normal CDF in Wald p-values.
If statsmodels is missing, the function raises an ImportError with install instructions.
Examples#
Basic DE with batch covariate
bk.tl.de_glm(
adata,
formula="~ Subtype + Batch",
contrast=("Subtype", "Basal", "Luminal"),
layer_counts="counts",
)
Retrieve results
res = adata.uns["de_glm"]["Subtype:Basal_vs_Luminal"]["results"]
res.head()
Turn off dispersion shrinkage
bk.tl.de_glm(
adata,
formula="~ Subtype + Batch",
contrast=("Subtype", "Basal", "Luminal"),
shrink_dispersion=False,
)
Notes / caveats#
The current implementation requires contrast (no “test all coefficients” mode yet).
This is a gene-wise GLM loop; it can be slower than vectorized approaches for very large gene counts.
Counts with total sum 0 for a gene are skipped (remain default p=1 / NaNs for some stats).
The contrast mechanism assumes a fairly standard dummy encoding for categorical variables (handled by the internal design/contrast helpers). If coefficient naming differs from expectations, _contrast_vector may fail or target the wrong coefficient—double-check coef_names stored in adata.uns.