Top gene-gene correlations#

bullkpy.tl.top_gene_gene_correlations(adata, *, genes=None, layer='log1p_cpm', method='pearson', top_n=200, min_abs_r=None, use_abs=True, batch_key=None, batch_mode='none', covariates=None)[source]#

Find strongest gene–gene correlations among a gene set (or all genes if provided). Returns long table with (gene1, gene2, r, pval, qval, n).

Identify the strongest gene–gene correlations within a selected gene set.

This function computes pairwise correlations between genes and returns the top-ranked gene pairs according to correlation strength.
It is designed for targeted panels, pathways, or signatures, not genome-wide scans.

Purpose#

top_gene_gene_correlations helps answer questions such as: • Which genes are most tightly co-regulated? • Are genes within a pathway strongly correlated? • Do correlations persist after accounting for batch effects? • Which gene–gene relationships are strongest in my dataset?

Because the computation scales as O(G²), the function requires an explicit gene list to avoid accidental genome-wide scans.

Parameters#

adata
AnnData object containing expression data.

genes
List of gene names to test.
Required for safety (all-vs-all is intentionally disallowed).

layer
Expression layer to use (default: “log1p_cpm”).

method
Correlation method: • “pearson” • “spearman”

top_n
Number of strongest gene pairs to return (default: 200).

min_abs_r
Optional minimum absolute correlation threshold.
Pairs below this value are discarded early.

use_abs
If True (default), ranking is based on |r|.
If False, ranking uses signed correlation.

batch_key
Optional obs column specifying batch labels.

batch_mode
How to handle batch effects: • “none”: ignore batches • “within”: compute correlations within batches • “residual”: regress out batch effects before correlation

covariates
Optional obs columns to regress out before correlation (e.g. library size, QC metrics).

What is computed#

For each gene pair (gene1, gene2): • correlation coefficient r • p-value • Benjamini–Hochberg FDR (qval) • number of samples used (n)

Batch-aware correlation is applied if batch_key is provided.

Returned value#

A tidy DataFrame with one row per gene pair:

column

description

gene1

First gene

gene2

Second gene

r

Correlation coefficient

pval

Raw p-value

qval

FDR-adjusted p-value

n

Number of samples used

method

Correlation method

batch_key

Batch column used (if any)

batch_mode

Batch handling strategy

Rows are sorted by correlation strength (strongest first).

Examples#

Basic usage

bk.tl.top_gene_gene_correlations(
    adata,
    genes=["TP53", "MDM2", "CDKN1A", "BAX"],
)

With batch correction

bk.tl.top_gene_gene_correlations(
    adata,
    genes=hallmark_genes,
    batch_key="Batch",
    batch_mode="residual",
)

Enforce minimum correlation strength

bk.tl.top_gene_gene_correlations(
    adata,
    genes=genes_of_interest,
    min_abs_r=0.4,
    top_n=50,
)

Interpretation tips#

  • High |r| suggests co-regulation or shared biology, not causality

  • Strong correlations can arise from: • shared pathway activity • technical confounders • cell-type composition

  • Use batch_key and covariates to reduce confounding

  • Consider visualizing top hits with: • scatter plots • correlation heatmaps • network graphs

Performance notes#

•	Runtime scales as O(G²)

→ keep genes lists reasonably small (tens to a few hundred). • For genome-wide correlation analysis, use specialized methods instead.

See also#

•	tl.association
•	tl.gene_categorical_association
•	pl.scatter
•	pl.heatmap
•	pl.network