Clustering#

bullkpy.tl.cluster(adata, *, method='leiden', key_added='clusters', resolution=1.0, n_clusters=8, use_rep='X_pca', n_pcs=20, random_state=0)[source]#

Cluster samples.

Leiden: uses adata.obsp[‘connectivities’] (requires igraph + leidenalg).
KMeans fallback: clusters in PCA space (no extra deps).

Cluster samples stored in an AnnData object using either a graph-based Leiden algorithm (recommended) or a K-means fallback in PCA space.

This function mirrors Scanpy’s clustering workflow while remaining explicit, bulk-friendly, and dependency-aware.

Overview#

cluster assigns each sample (adata.obs) to a discrete cluster and stores the result as a categorical column in adata.obs.

Two methods are supported:

Leiden (default, recommended)
- Graph-based clustering
- Operates on adata.obsp[“connectivities”]
- Produces resolution-controlled communities
K-means (fallback)
- Runs directly in PCA space
- Requires no extra dependencies
- Useful when graph-based clustering is unavailable

Requirements#

For method=”leiden”

You must have:

A neighbors graph:

bk.tl.neighbors(adata)

This creates adata.obsp[“connectivities”].

Optional but recommended:

PCA via bk.tl.pca(adata).

Python packages:

igraph
leidenalg.

If any of these are missing, an informative error is raised.

For method=”kmeans”

You must have:

A representation in adata.obsm[use_rep] (default: “X_pca”).

No additional dependencies are required.

Parameters#

Common parameters#

method
Clustering algorithm to use:

“leiden” (default)
“kmeans”.

key_added
Name of the column written to adata.obs.
The column is stored as a categorical variable.
Default: “clusters”.

random_state
Random seed for reproducibility.
Default: 0.

Leiden-specific parameters#

resolution.
Controls cluster granularity:

Higher → more clusters.
Lower → fewer clusters. Typical range: 0.2–2.0. Default: 1.0.

K-means–specific parameters#

n_clusters Number of clusters (k).
Default: 8.

use_rep
Representation in adata.obsm used for clustering.
Default: “X_pca”.

n_pcs
Number of dimensions to use from use_rep.
Default: 20.

Method details#

Leiden clustering

Converts the connectivities graph into an undirected weighted igraph object
Runs Leiden using RBConfigurationVertexPartition
Cluster labels are integers, stored as strings for categorical safety
Results are deterministic given random_state.

This is the preferred method for most analyses.

K-means clustering

Runs standard K-means in PCA space
Uses only the first n_pcs dimensions
Faster and dependency-free, but:
Ignores graph structure
Less robust for complex manifolds

Stored results#

In adata.obs.

adata.obs[key_added]
- Categorical cluster labels
- Labels are strings (“0”, “1”, …).

In adata.uns[“clusters”]

Each clustering run is recorded with its parameters:

Leiden example

adata.uns["clusters"]["clusters"] = {
    "method": "leiden",
    "resolution": 1.0,
    "random_state": 0,
}

K-means example

adata.uns["clusters"]["clusters"] = {
    "method": "kmeans",
    "n_clusters": 8,
    "random_state": 0,
    "use_rep": "X_pca",
    "n_pcs": 20,
}

Typical workflow#

# 1. Dimensionality reduction
bk.tl.pca(adata)

# 2. Build neighbor graph
bk.tl.neighbors(adata)

# 3. Graph-based clustering
bk.tl.cluster(
    adata,
    method="leiden",
    resolution=1.0,
    key_added="leiden",
)

Fallback without graph:

bk.tl.cluster(
    adata,
    method="kmeans",
    n_clusters=10,
    key_added="kmeans",
)

Notes and caveats#

Leiden clustering requires a neighbors graph.
Re-running cluster with the same key_added will overwrite the column.
Cluster labels are stored as categorical strings, not integers.
For resolution tuning, see:
- bk.tl.leiden_resolution_scan
- bk.pl.ari_resolution_heatmap