Rank genes groups#

bullkpy.tl.rank_genes_groups(adata, *, groupby, groups=None, reference='rest', layer='log1p_cpm', method='t-test', corr_method='benjamini-hochberg', use_abs=False, n_genes=100, key_added='rank_genes_groups')[source]#

Scanpy-like ranking of genes per group.

Statistics#

method=”t-test” -> Welch t-test per gene, score = t statistic method=”wilcoxon” -> Mann–Whitney U per gene, score = z-like from U (approx)

Output#

Stores results in adata.uns[key_added] with:

params
names[group], scores[group], logfoldchanges[group], pvals[group], pvals_adj[group]
mean_group[group], mean_ref[group] (extra, useful for bulk)

Scanpy-like per-group gene ranking for bulk-friendly AnnData objects.

rank_genes_groups tests, for each group in adata.obs[groupby], which genes are most different between that group and a reference (either "rest" or a specific group). Results are stored in adata.uns[key_added] in a Scanpy-compatible “dict-of-arrays per group” structure.

What it does#

For each selected group g:

Split samples into:

group: samples where adata.obs[groupby] == g
reference:
- “rest”: all other samples, or
- a specific group name.

Compute per-gene:

mean expression in group and reference
a test statistic (scores)
p-values and FDR-adjusted p-values
a fold-change-like quantity (logfoldchanges).

Rank genes by scores (or abs(scores) if use_abs=True) and store the top n_genes.

Statistics#

method=”t-test” (default)

Performs a Welch t-test per gene (unequal variances).
score = t statistic
p-value = two-sided t-test p-value using Welch–Satterthwaite df.

method=”wilcoxon”

Performs a Mann–Whitney U test per gene.
score = z-like approximation derived from U (no tie correction)
p-value = two-sided MWU p-value.

Notes:

wilcoxon is usually slower because it loops gene-by-gene.
For bulk-sized datasets (hundreds of samples, thousands of genes), this is typically still OK, but for very large matrices prefer “t-test”.

Parameters#

adata
AnnData with expression data in adata.X or adata.layers[layer].

groupby
Categorical column in adata.obs defining groups (e.g. “condition”).

groups
Subset of group names to analyze. If None, uses all categories in adata.obs[groupby].

reference
Reference for each group comparison:

“rest” (default): all samples not in the target group
“SomeGroup”: compare each group vs that explicit group

layer
Expression matrix to use:

if not None, uses adata.layers[layer]
else uses adata.X

method
“t-test” (Welch) or “wilcoxon” (MWU).

corr_method
Multiple testing correction. Currently only:

“benjamini-hochberg” (BH FDR)

use_abs
If True, ranks by abs(scores) (largest magnitude changes first).
If False, ranks by scores descending (positive effects first).

n_genes
Number of top genes stored per group.

key_added
Key in adata.uns where results are stored.

Output#

Results are stored in:

adata.uns[key_added]

with:

params: dict of the parameters used
names[group]: array of top gene names
scores[group]: array of test statistics (t or z-like)
logfoldchanges[group]: array of log2 fold-change-like values
pvals[group]: array of raw p-values
pvals_adj[group]: array of BH-adjusted p-values
mean_group[group]: mean expression in the target group (bulk-friendly extra)
mean_ref[group]: mean expression in the reference (bulk-friendly extra)

This layout is intentionally similar to Scanpy so downstream plotting/helpers can consume it easily.

Interpretation of logfoldchanges#

logfoldchanges is computed as:

(mean_group - mean_ref) / ln(2)

This is exact log2 fold change only if the input layer is on a natural-log scale. If you use layer=”log1p_cpm” (default), it behaves like a log-scale mean difference expressed in log2 units, which is often still useful for ranking/visualization.

If you need strict log2FC from counts, compute it from raw means on a linear scale (or use your DE routine that models counts).

Raises#

ImportError if scipy is not available (scipy.stats is required).
KeyError if groupby, requested groups, or reference are not found.
ValueError if an unsupported method or corr_method is provided.

Groups or references with 0 samples are skipped with a warning.

Examples#

Rank genes for all groups vs rest

bk.tl.rank_genes_groups(
    adata,
    groupby="condition",
    reference="rest",
    layer="log1p_cpm",
    method="t-test",
    n_genes=50,
)

Compare each group vs an explicit reference group

bk.tl.rank_genes_groups(
    adata,
    groupby="treatment",
    groups=["drugA", "drugB"],
    reference="vehicle",
    method="wilcoxon",
    n_genes=100,
)

Access results

rg = adata.uns["rank_genes_groups"]
rg["names"]["drugA"][:10]
rg["pvals_adj"]["drugA"][:10]

Rank genes groups#

Statistics#

Output#

What it does#

Statistics#

Parameters#

Output#

Interpretation of logfoldchanges#

Raises#

Examples#

See also#