Top Gene-Obs correlations#
- bullkpy.tl.top_gene_obs_correlations(adata, *, gene, obs=None, obs_keys=None, layer='log1p_cpm', method='pearson', top_n=50, min_abs_r=None, use_abs=True, batch_key=None, batch_mode='none', covariates=None)[source]#
Correlate one or more genes with numeric obs columns.
obs: restrict to these obs columns
obs_keys: alternative explicit numeric obs selection
Correlate one or more genes with numeric sample-level variables in adata.obs.
This function scans numeric columns in adata.obs (e.g. QC metrics, clinical variables,
scores, signatures) and returns the strongest gene↔obs correlations.
Typical uses:
correlate genes with QC metrics (library size, % mito, etc.)
associate genes with continuous phenotypes (age, tumor purity, signature score)
quickly identify top gene–trait relationships
Parameters#
adata
AnnData object with expression and sample metadata.
gene
A gene name (string) or list of gene names. All must exist in adata.var_names.
obs
Optional obs key (string) or list of obs keys to restrict the analysis to these columns.
Only numeric columns are allowed.
obs_keys
Alternative explicit list of numeric obs columns to consider.
If both obs and obs_keys are given, obs acts as a final filter.
layer
Expression layer to use (default: “log1p_cpm”). If None, uses adata.X.
method
Correlation method:
• “pearson” (linear correlation)
• “spearman” (rank correlation)
top_n
Maximum number of gene–obs pairs returned after ranking (default: 50).
min_abs_r
Optional minimum absolute correlation threshold. Pairs with |r| < min_abs_r
are skipped before ranking.
use_abs
If True (default), rank by |r| (strongest correlations regardless of sign).
If False, rank by signed r.
batch_key, batch_mode, covariates
Optional batch-aware correlation controls (same behavior as other correlation utilities):
• batch_key: obs column defining batches
• batch_mode: “none”, “within”, or “residual”
• covariates: numeric obs columns to regress out before correlation
Returned value#
A DataFrame with one row per tested gene–obs pair, ranked by correlation strength.
Columns:
column |
description |
|---|---|
gene |
Gene name |
obs |
Numeric obs column |
r |
Correlation coefficient |
pval |
Raw p-value |
qval |
FDR-adjusted p-value (BH) |
n |
Number of samples used |
method |
Correlation method |
batch_key |
Batch column used (or None) |
batch_mode |
Batch handling strategy |
Examples#
Correlate one gene vs all numeric obs
df = bk.tl.top_gene_obs_correlations(
adata,
gene="TP53",
)
df.head()
Restrict to specific obs variables
df = bk.tl.top_gene_obs_correlations(
adata,
gene="MKI67",
obs=["purity", "signature_IFNG", "pct_counts_mt"],
method="spearman",
)
Multiple genes
df = bk.tl.top_gene_obs_correlations(
adata,
gene=["MYC", "CDKN1A", "MKI67"],
top_n=30,
)
All genes: look for top genes correlated with a numeric obs column (e.g. “tumor_purity”, “age”, “score”, …)
df = bk.tl.top_gene_obs_correlations(
adata,
gene=list(adata.var_names), # scan all genes
obs="tumor_purity", # the obs you care about (must be numeric)
layer="log1p_cpm", # or None to use adata.X
method="spearman", # "pearson" or "spearman"
top_n=50, # keep top 50
min_abs_r=0.3, # optional filter (speeds up output / focus)
use_abs=True, # rank by |r|
)
df.head(10)
With batch adjustment
df = bk.tl.top_gene_obs_correlations(
adata,
gene="EPCAM",
batch_key="Batch",
batch_mode="residual",
)
Interpretation notes#
Correlation does not imply causation; treat results as exploratory.
Batch effects and confounders can inflate correlations — use batch_mode=”residual” and/or covariates=[…] when appropriate.
For visualization, follow up with a scatter plot of gene expression vs the selected obs.
See also#
• tl.gene_gene_correlations
• tl.top_gene_gene_correlations
• tl.association
• pl.scatter