Rank genes categorical#

Rank genes associated with a categorical sample annotation.

This function performs a bulk-friendly, Scanpy-like gene ranking for categorical variables stored in adata.obs (e.g. subtype, response group, mutation status).

It supports two-group and multi-group comparisons and reports both statistical significance and effect size.

What it does#

Given a categorical variable in adata.obs, this function:

Splits samples into:
• a target group • a reference group (either another category or “rest”).
Tests each gene for association with group membership.
Computes:
• p-value • FDR-corrected q-value • effect size • group and reference means • log2 fold-change.
Returns a ranked DataFrame, optionally storing results in adata.uns

The behavior closely mirrors scanpy.tl.rank_genes_groups, but is designed for bulk RNA-seq.

Supported comparisons#

Two-group comparison#

Used when: • group is provided, or • group is None and groupby has exactly 2 categories

Supported methods:

Method	Test	Effect size
“mwu” (default)	Mann–Whitney U	Rank-biserial correlation
“ttest”	Welch’s t-test	Cohen’s d (approx.)
“kruskal”	Kruskal–Wallis	η² (rough)
“anova”	One-way ANOVA	η² (rough)

Multi-group comparison#

Used when groupby has >2 categories and a global test is requested:

Method	Test	Effect size
“kruskal”	Kruskal–Wallis	η² (rough)
“anova”	One-way ANOVA	η² (rough)

Returned columns#

The returned DataFrame contains:

Column: Description

gene: Gene name.
pval: Raw p-value.
qval: Benjamini–Hochberg FDR.
effect_size: Method-dependent effect size.
mean_group: Mean expression in target group.
mean_ref: Mean expression in reference group.
log2FC: log2(mean_group / mean_ref).

Results are sorted by qval, then pval.

Parameters#

Group definition#

groupby
Column in adata.obs defining categories.

group
Target category.
If None and groupby has exactly 2 categories, the first is used.

reference
Reference group:
• “rest” (default): all other samples • or a specific category name

####Expression source.

layer
Expression layer to use (e.g. “log1p_cpm”).
If None, uses adata.X.

genes
Optional list of genes to test.
Default: all genes in adata.var_names.

Statistical method#

method
One of: “mwu”, “ttest”, “kruskal”, “anova”.

Storing results#

store_key
If provided, results are stored in:

adata.uns["assoc"][store_key]

along with metadata describing the test.

Examples#

Binary comparison (default MWU)#

res = bk.tl.rank_genes_categorical(
    adata,
    groupby="Subtype",
    group="Basal",
    reference="rest",
    layer="log1p_cpm",
)

Two-group comparison with explicit reference#

res = bk.tl.rank_genes_categorical(
    adata,
    groupby="Response",
    group="Responder",
    reference="Non-responder",
    method="ttest",
)

Restrict to selected genes#

res = bk.tl.rank_genes_categorical(
    adata,
    groupby="RB1_mut",
    group="1",
    genes=["TP53", "CDKN2A", "E2F1"],
)

Store results in AnnData#

bk.tl.rank_genes_categorical(
    adata,
    groupby="Project_ID",
    group="ACC",
    store_key="assoc:Project_ID:ACC_vs_rest",
)

Later access:

adata.uns["assoc"]["assoc:Project_ID:ACC_vs_rest"]["results"]

Notes#

•	MWU is recommended for robust, non-parametric testing in bulk RNA-seq.
•	Effect sizes are always reported and should be used alongside p/q-values.
•	For very small groups (n < 2), the function raises an error.