Score cell cycle genes#

bullkpy.tl.score_genes_cell_cycle(adata, *, s_genes, g2m_genes, layer=None, gene_pool=None, ctrl_size=50, n_bins=25, random_state=0, scale=False, s_score='S_score', g2m_score='G2M_score', phase='phase')[source]#

Score cell cycle phases similar to scanpy.tl.score_genes_cell_cycle.

Writes to:

adata.obs[s_score] (float)
adata.obs[g2m_score] (float)
adata.obs[phase] (category: “S”, “G2M”, “G1”)

Phase calling rule (Scanpy-like):

if S_score > G2M_score and S_score > 0 -> “S”
elif G2M_score > S_score and G2M_score > 0 -> “G2M”
else -> “G1”

Compute cell-cycle scores and phase assignment (G1 / S / G2M) using a Scanpy-compatible strategy, adapted to bulk RNA-seq (or pseudobulk) data.

This function is a thin, opinionated wrapper around score_genes that reproduces the behavior of scanpy.tl.score_genes_cell_cycle, while remaining robust for non-single-cell use cases.

What it does#

Computes two gene signature scores per sample:
• S phase score. • G2/M phase score. using matched control genes (expression-bin matched).
Assigns a discrete cell-cycle phase per sample using Scanpy’s rule:

Condition	Phase
S_score > G2M_score and S_score > 0	S
G2M_score > S_score and G2M_score > 0	G2M
otherwise	G1

Stores results directly in adata.obs.

Output#

The function writes three columns to adata.obs:

Column	Type	Description
s_score	float	S-phase signature score
g2m_score	float	G2/M-phase signature score
phase	categorical	“G1”, “S”, “G2M” (ordered)

Default column names:

S_score, G2M_score, phase

Parameters#

Required#

adata
AnnData object containing expression data.

s_genes
Gene symbols defining the S-phase signature.

g2m_genes
Gene symbols defining the G2/M-phase signature.

Optional (scoring behavior)#

layer (default: None)
Expression layer to use. • None → adata.X.
• recommended: “log1p_cpm”.

gene_pool (default: all genes)
Background gene pool used to sample control genes.
Signature genes are automatically excluded.

ctrl_size (default: 50)
Number of control genes sampled per signature gene.

n_bins (default: 25)
Number of expression bins used to match control genes.

random_state (default: 0)
Random seed for reproducible control selection.
(Internally, G2M uses random_state + 1.).

scale (default: False)
If True, z-score genes before computing scores.

Optional (output names)#

s_score (default: “S_score”)
Column name for S-phase score.

g2m_score (default: “G2M_score”)
Column name for G2/M score.

phase (default: “phase”)
Column name for phase assignment.

Returns#

None.

All results are written in-place to adata.obs.

Example#

Standard cell-cycle scoring

bk.tl.score_genes_cell_cycle(
    adata,
    s_genes=S_GENES,
    g2m_genes=G2M_GENES,
    layer="log1p_cpm",
)

Creates:

adata.obs["S_score"]
adata.obs["G2M_score"]
adata.obs["phase"]

Custom output column names

bk.tl.score_genes_cell_cycle(
    adata,
    s_genes=S_GENES,
    g2m_genes=G2M_GENES,
    s_score="cc_S",
    g2m_score="cc_G2M",
    phase="cell_cycle",
)

Z-scored signatures with restricted gene pool

bk.tl.score_genes_cell_cycle(
    adata,
    s_genes=S_GENES,
    g2m_genes=G2M_GENES,
    gene_pool=expressed_genes,
    scale=True,
)

Notes & best practices#

Normalization matters
Use normalized, log-transformed data (e.g. log1p CPM or TPM).
Raw counts are not recommended.

Bulk vs single-cell
While originally designed for single-cell data, this implementation works well for bulk and pseudobulk, where phases should be interpreted as dominant cell-cycle programs rather than discrete cell states.

Mostly G1 warning.
If >95% of samples are labeled “G1”, a warning is emitted.
Common causes:
• gene symbols not matching adata.var_names • inappropriate layer (e.g. raw counts) • very low proliferation signal

Interpretation.
Cell-cycle phase here is a relative classification useful for QC, covariate adjustment, or exploratory analysis—not a definitive cell state.