Principal Component Analysis#

bullkpy.tl.pca(adata, *, layer='log1p_cpm', n_comps=50, center=True, scale=False, use_highly_variable=False, key_added='pca')[source]#

Principal Component Analysis (PCA) for bulk-friendly AnnData objects.

This function performs a classical PCA using SVD on an expression matrix (adata.X or a selected layer), stores sample scores in adata.obsm, gene loadings in adata.varm, and PCA metadata in adata.uns.

It is intentionally simple, transparent, and Scanpy-compatible in terms of where results are stored, while remaining suitable for bulk RNA-seq data.

What it does#

Selects an expression matrix:

adata.layers[layer] if layer is not None
otherwise adata.X.

Optionally restricts genes to those marked as adata.var[“highly_variable”].
Optionally centers and/or scales genes.
Computes PCA using SVD:
[
X_c = U S V^\top
]
Stores:

Scores (samples × PCs) in adata.obsm[“X_pca”]
Loadings (genes × PCs) in adata.varm[“PCs”]
Explained variance and parameters in adata.uns[key_added]

Parameters#

adata
AnnData object containing expression data.

layer
Which matrix to use for PCA.

Default: “log1p_cpm”
If None, uses adata.X.

n_comps
Number of principal components to compute.
The effective number is capped at:

min(n_comps, n_obs - 1, n_genes_used)

center
If True (default), center each gene by subtracting its mean.

scale
If True, scale each gene to unit variance after centering (z-scoring per gene).

use_highly_variable
If True, restrict PCA to genes where adata.var[“highly_variable”] == True.

Notes:

If the column does not exist, all genes are used with a warning.
If it exists but has zero True values, an error is raised.

key_added
Key under adata.uns where PCA metadata is stored.
Default: “pca”.

Output#

Sample scores

Stored in:

adata.obsm["X_pca"]

Shape: (n_samples, n_comps)

These are the PCA coordinates of samples, suitable for:

QC plots
clustering
downstream embeddings (UMAP, t-SNE, etc.)

Gene loadings

Stored in:

adata.varm["PCs"]

Shape: (n_genes, n_comps)

Notes:

If use_highly_variable=True, genes not used in PCA receive NaN loadings.
This keeps alignment with the full gene set, as expected by AnnData.

PCA metadata

Stored in:

adata.uns[key_added]

Contains:

params
- layer
- n_comps
- center
- scale
- use_highly_variable
- n_vars_used.
variance: Eigenvalues (explained variance) for each PC.
variance_ratio: Fraction of total variance explained by each PC.
used_genes_mask: Boolean mask of genes actually used in PCA (length = n_vars).
mean: Per-gene mean used for centering (only for used genes).
std: Per-gene standard deviation used for scaling (or None if scale=False).

Notes and best practices#

Recommended input
Use normalized, approximately Gaussian data such as:

log1p_cpm
log1p_tpm
other log-transformed normalized expression

Scaling

scale=False (default) is often appropriate for bulk RNA-seq.
scale=True may be useful when genes have very different variances.

Highly variable genes

Optional but recommended for large gene sets.
Works well after running pp.highly_variable_genes.

Sparse matrices

Sparse inputs are densified internally.
For very large datasets, ensure memory availability.

Examples#

Basic PCA on log1p CPM

bk.tl.pca(adata)

PCA using only highly variable genes

bk.pp.highly_variable_genes(adata)
bk.tl.pca(adata, use_highly_variable=True)

PCA with scaling

bk.tl.pca(
    adata,
    layer="log1p_cpm",
    n_comps=30,
    center=True,
    scale=True,
)

Access results

adata.obsm["X_pca"]              # sample scores
adata.varm["PCs"]                # gene loadings
adata.uns["pca"]["variance_ratio"]

Principal Component Analysis#

What it does#

Parameters#

Output#

Notes and best practices#

Examples#

See also#