Rank genes categorical#
Rank genes associated with a categorical sample annotation.
This function performs a bulk-friendly, Scanpy-like gene ranking for categorical
variables stored in adata.obs (e.g. subtype, response group, mutation status).
It supports two-group and multi-group comparisons and reports both statistical significance and effect size.
What it does#
Given a categorical variable in adata.obs, this function:
Splits samples into:
• a target group • a reference group (either another category or “rest”).Tests each gene for association with group membership.
Computes:
• p-value • FDR-corrected q-value • effect size • group and reference means • log2 fold-change.Returns a ranked DataFrame, optionally storing results in adata.uns
The behavior closely mirrors scanpy.tl.rank_genes_groups, but is designed for bulk RNA-seq.
Supported comparisons#
Two-group comparison#
Used when: • group is provided, or • group is None and groupby has exactly 2 categories
Supported methods:
Method |
Test |
Effect size |
|---|---|---|
“mwu” (default) |
Mann–Whitney U |
Rank-biserial correlation |
“ttest” |
Welch’s t-test |
Cohen’s d (approx.) |
“kruskal” |
Kruskal–Wallis |
η² (rough) |
“anova” |
One-way ANOVA |
η² (rough) |
Multi-group comparison#
Used when groupby has >2 categories and a global test is requested:
Method |
Test |
Effect size |
|---|---|---|
“kruskal” |
Kruskal–Wallis |
η² (rough) |
“anova” |
One-way ANOVA |
η² (rough) |
Returned columns#
The returned DataFrame contains:
Column: Description
gene: Gene name.
pval: Raw p-value.
qval: Benjamini–Hochberg FDR.
effect_size: Method-dependent effect size.
mean_group: Mean expression in target group.
mean_ref: Mean expression in reference group.
log2FC: log2(mean_group / mean_ref).
Results are sorted by qval, then pval.
Parameters#
Group definition#
groupby
Column in adata.obs defining categories.
group
Target category.
If None and groupby has exactly 2 categories, the first is used.
reference
Reference group:
• “rest” (default): all other samples
• or a specific category name
####Expression source.
layer
Expression layer to use (e.g. “log1p_cpm”).
If None, uses adata.X.
genes
Optional list of genes to test.
Default: all genes in adata.var_names.
Statistical method#
method
One of: “mwu”, “ttest”, “kruskal”, “anova”.
Storing results#
store_key
If provided, results are stored in:
adata.uns["assoc"][store_key]
along with metadata describing the test.
Examples#
Binary comparison (default MWU)#
res = bk.tl.rank_genes_categorical(
adata,
groupby="Subtype",
group="Basal",
reference="rest",
layer="log1p_cpm",
)
Two-group comparison with explicit reference#
res = bk.tl.rank_genes_categorical(
adata,
groupby="Response",
group="Responder",
reference="Non-responder",
method="ttest",
)
Restrict to selected genes#
res = bk.tl.rank_genes_categorical(
adata,
groupby="RB1_mut",
group="1",
genes=["TP53", "CDKN2A", "E2F1"],
)
Store results in AnnData#
bk.tl.rank_genes_categorical(
adata,
groupby="Project_ID",
group="ACC",
store_key="assoc:Project_ID:ACC_vs_rest",
)
Later access:
adata.uns["assoc"]["assoc:Project_ID:ACC_vs_rest"]["results"]
Notes#
• MWU is recommended for robust, non-parametric testing in bulk RNA-seq.
• Effect sizes are always reported and should be used alongside p/q-values.
• For very small groups (n < 2), the function raises an error.
See also#
• tl.rank_genes_groups (Scanpy, single-cell)
• pl.volcano
• pl.rankplot