Sample distances#

bullkpy.pl.sample_distances(adata, *, layer='log1p_cpm', metric='euclidean', method='average', use='samples', col_colors=None, palette='tab20', z_score=False, figsize=None, show_labels=False, save=None, show=True)[source]#

Sample (or gene) distance clustergram.

Computes pairwise distances (pdist) on X (samples x genes).
Uses seaborn.clustermap with hierarchical clustering.
Optionally annotate samples with metadata columns via col_colors.

Notes

For sample QC, use metric=”correlation” (distance = 1-corr) often works well.
z_score=True will z-score genes across samples before distance computation.

Sample (or gene) distance clustergram using hierarchical clustering and a distance matrix computed from an expression matrix.

This is a QC-style visualization: it shows which samples (or genes) are most similar under a chosen distance metric, and clusters them with linkage-based hierarchical clustering. The plot is rendered with seaborn.clustermap, so you get dendrograms plus a heatmap of pairwise distances.

Example Sample distances plot

What it does.#

Fetches a matrix X via _get_matrix(adata, layer=layer, use=use):

use=”samples” → X is samples × genes (rows are samples/obs).
use=”genes” → X is genes × samples or otherwise arranged so that rows correspond to the chosen axis for distance computation (depends on _get_matrix implementation).

Optionally z-scores features across samples (z_score=True):

Z-scoring is applied column-wise: – X = (X - mean(feature)) / std(feature).
This makes distance more about patterns than absolute scale.

Computes pairwise distances using SciPy:

d = pdist(X, metric=metric) → condensed distance vector
D = squareform(d) → full N×N distance matrix.

Builds a labeled DataFrame dfD with:

labels = adata.obs_names if use=”samples” else adata.var_names
dfD is symmetric with zeros on the diagonal.

Performs hierarchical clustering on the condensed distances:
Z = linkage(d, method=method).
Plots with seaborn.clustermap:

Uses row_linkage=Z and col_linkage=Z (same clustering for both axes)
Heatmap colormap is fixed to “viridis” (distance scale)
Optional metadata annotations via col_colors when use=”samples”.

(Optional) Adds metadata legends to the right side of the heatmap when col_colors is provided.

Parameters#

Core data / distance#

adata (AnnData): Input object.

layer (str | None, default “log1p_cpm”): Which layer to use for distances.

Passed to _get_matrix.
If None, _get_matrix typically falls back to adata.X (implementation-dependent).

use (“samples” | “genes”, default “samples”):

“samples”: distance among samples (QC use-case).
“genes”: distance among genes (feature similarity / module exploration).

metric (str, default “euclidean”): Distance metric for scipy.spatial.distance.pdist.
Common QC choice: metric=”correlation” (distance = 1 − correlation).

method (str, default “average”): Linkage method for hierarchical clustering (scipy.cluster.hierarchy.linkage).
Common options: “average”, “complete”, “single”, “ward” (ward requires euclidean-like assumptions).

Metadata annotations (samples only)#

col_colors (Sequence[str] | None): List of adata.obs keys used to annotate columns/rows with colored strips.

Only applied when use=”samples”.
Uses _metadata_colors(adata, columns=col_colors, palette=palette) to map categories → colors.

palette (str, default “tab20”): Palette name used for categorical metadata mapping.

Scaling / display#

z_score (bool, default False): If True, z-score features across samples before computing distances.

**figsize ** ((w, h) | None): If None, auto-sized based on n items:
w = max(6.0, min(16.0, 0.18*n + 4.0)), h = w

show_labels (bool, default False): Show axis tick labels (sample names / gene names).
Recommended False for large n.

Output#

save (str | Path | None): If provided, saves using _savefig(cg.fig, save).

show (bool, default True): If True, calls plt.show().

Returns#

cg: seaborn.matrix.ClusterGrid

Access main heatmap axis via cg.ax_heatmap.
Figure via cg.fig.

Requirements / errors#

Requires seaborn (sns) or raises ImportError.
Requires SciPy components: pdist, squareform, and linkage, or raises ImportError.

Notes & best practices#

QC recommendation: try metric=”correlation” for expression-like matrices; it often clusters by expression profiles rather than magnitude.
When to use z_score=True:

Good when genes have very different scales and you care about relative patterns.
Less useful if the layer is already standardized or if absolute magnitude is meaningful.

Metadata annotations: col_colors=[“Subtype”, “Batch”] is a common QC setup to see whether clustering is driven by biology vs batch.

Examples#

Sample QC with correlation distance

bk.pl.sample_distances(
    adata,
    layer="log1p_cpm",
    metric="correlation",
    method="average",
    col_colors=["Subtype", "Batch"],
    show_labels=False,
)

Z-scored Euclidean distances (pattern-focused)

bk.pl.sample_distances(
    adata,
    layer="log1p_cpm",
    metric="euclidean",
    z_score=True,
    col_colors=["Patient"],
)

Gene-gene distance clustergram

bk.pl.sample_distances(
    adata,
    use="genes",
    metric="correlation",
    show_labels=True,
)