Batch correction with Combat#
- bullkpy.pp.batch_correct_combat(adata, *, batch_key, layer='log1p_cpm', covariates=None, key_added='combat', overwrite=False, inplace=True)[source]#
ComBat batch correction (Johnson et al.) for bulk expression.
Notes
ComBat is intended for approximately Gaussian data: use log-transformed normalized expression (e.g., log1p_cpm), not raw counts.
Writes corrected matrix to adata.layers[key_added] by default.
- Parameters:
batch_key – adata.obs column with batch labels (categorical recommended).
layer – Which matrix to correct. If None, uses adata.X.
covariates – Optional adata.obs columns to include in the design (biological covariates to preserve).
key_added – Layer name to store corrected values (if overwrite=False).
overwrite – If True, write corrected values back into the selected layer / X.
inplace – If True, store results in adata; if False, return corrected matrix (samples x genes).
ComBat batch correction (Johnson et al.) for bulk expression matrices.
batch_correct_combat removes unwanted batch-associated variation while optionally preserving biological covariates (e.g., subtype, Project_ID). It is intended for approximately Gaussian expression values, so use log-normalized layers (e.g. log1p_cpm) rather than raw counts.
What it does#
• Takes an expression matrix (default: adata.layers["log1p_cpm"]) and a batch label in adata.obs[batch_key].
• Fits a linear model with:
• intercept
• optional covariates (biological variables to preserve)
• batch indicators
• Applies empirical Bayes shrinkage to batch effect parameters and returns a corrected matrix.
• By default, stores the corrected values as a new layer: adata.layers[key_added] (default: "combat").
Inputs#
Required#
• adata: AnnData (samples in .obs, genes in .var_names)
• batch_key: column in adata.obs with batch labels (categorical recommended)
Optional#
• layer: which matrix to correct ("log1p_cpm" recommended). If None, uses adata.X.
• covariates: list of adata.obs columns to include in the design matrix (preserved effects).
• Numeric covariates are included as-is
• Categorical covariates are one-hot encoded
• key_added: layer name for corrected matrix if overwrite=False
• overwrite:
• False (default): write to adata.layers[key_added]
• True: overwrite the selected layer (or .X if layer=None)
• inplace:
• True (default): write into adata, return None
• False: return corrected matrix as a NumPy array (n_samples, n_genes)
Returns#
• If inplace=True: returns None and stores corrected matrix in .layers (or overwrites)
• If inplace=False: returns a NumPy array with corrected expression values (samples × genes)
Notes and recommendations#
• Do NOT run ComBat on raw counts. Use log-normalized expression (e.g. CPM/TPM + log1p).
• If your batch variable has < 2 batches, the function will skip correction (returns input matrix if inplace=False).
• Use covariates to avoid removing true biological effects correlated with batch.
Typical covariates: • tumor subtype, tissue type, sex, Project_ID (if biological) • avoid adding covariates that are actually batch proxies unless you explicitly want to preserve them
Examples#
1) Correct by sequencing center / batch and store to a new layer#
import bullkpy as bk
bk.pp.batch_correct_combat(
adata,
batch_key="Center", # e.g., sequencing center
layer="log1p_cpm",
key_added="combat",
)
# downstream: use corrected layer
bk.tl.pca(adata, layer="combat", use_highly_variable=True)
bk.pl.pca_scatter(adata, color="Center")
2) Preserve a biological covariate (e.g. subtype) while correcting batch#
bk.pp.batch_correct_combat(
adata,
batch_key="Center",
layer="log1p_cpm",
covariates=["Project_ID"], # preserve biological signal
key_added="combat_cov",
)
3) Overwrite the original layer#
bk.pp.batch_correct_combat(
adata,
batch_key="Center",
layer="log1p_cpm",
overwrite=True, # writes back to adata.layers["log1p_cpm"]
)
4) Get the corrected matrix without modifying adata#
Xc = bk.pp.batch_correct_combat(
adata,
batch_key="Center",
layer="log1p_cpm",
inplace=False,
)
print(Xc.shape) # (n_samples, n_genes)
See also#
• pp.qc_metrics, pp.filter_samples, pp.highly_variable_genes
• tl.pca, pl.pca_scatter, pl.umap