PCA loadings#
- bullkpy.tl.pca_loadings(adata, *, key='pca', loadings_key='PCs', pcs=None, n_top=50, use_abs=False, include_negative=True, dropna=True, gene_col='gene', store_key='pca_loadings', export=None, export_sep='\t', export_gmt=None, min_abs_loading=None)[source]#
Rank genes by PCA loadings per PC and (optionally) export for enrichment.
- Assumes PCA results like:
adata.varm[loadings_key] == array (n_vars x n_comps)
adata.uns[key] contains PCA params (optional)
- Parameters:
pcs – PCs to process, 1-based indexing (e.g. [1,2,3]). If None -> all available.
n_top – Number of top genes to return per PC (and per sign if include_negative=True and use_abs=False).
use_abs – If True: rank by absolute loading (single list per PC). If False: returns positive top list (and negative list if include_negative=True).
include_negative – Only used when use_abs=False. If True returns both pos/neg lists.
export – If provided, writes a long-format table (PC, sign, gene, loading, rank). Also writes a wide-format file alongside: “<stem>.wide<suffix>”.
export_gmt –
- If provided, writes GMT gene sets:
PC1_pos, PC1_neg, PC2_pos, …
If use_abs=True: PC1_abs, PC2_abs, …
min_abs_loading – Optional filter: only keep genes with abs(loading) >= threshold.
- Returns:
dict mapping keys like “PC1_pos”, “PC1_neg”, “PC1_abs” -> DataFrame.
Also stores results in adata.uns[store_key].
Rank genes by PCA loadings for one or more principal components (PCs), and optionally export the results for downstream enrichment analysis (e.g., GSEA/Enrichr).
This is a bulk-friendly utility that assumes PCA has already been computed with bk.tl.pca, producing:
adata.varm["PCs"](or your chosenloadings_key) with shape(n_genes, n_comps)(optionally)
adata.uns["pca"](or your chosenkey) containing PCA metadata
The function returns per-PC gene rankings (positive / negative or absolute), stores them in adata.uns, and can export them as TSV/CSV and/or GMT.
What it does#
For each requested PC:
Extract the loading vector (one value per gene).
Optionally drop NaNs and/or filter by min_abs_loading.
Rank genes by:
absolute loading (use_abs=True) → one list per PC, or
signed loading (use_abs=False) → top positive list, plus optional top negative list
Store results in adata.uns[store_key].
Optionally export:
a long table (PC, sign, rank, gene, loading)
a wide table (one column per PC/sign, rows are genes)
a GMT file (gene sets per PC/sign).
Parameters#
Inputs / keys#
adata
AnnData with PCA loadings stored in adata.varm[loadings_key].
key
PCA metadata key in adata.uns (not strictly required, mostly used for provenance strings).
Default: “pca”.
loadings_key
Where PCA loadings are stored in adata.varm.
Default: “PCs”.
PC selection and ranking behavior#
pcs
PCs to process using 1-based indexing (e.g. [1, 2, 3]).
If None, processes all PCs available in the loadings matrix.
n_top
Number of genes to return per PC.
If use_abs=True: returns n_top genes per PC.
If use_abs=False and include_negative=True: returns up to n_top positive + up to n_top negative per PC.
use_abs
If True, rank by abs(loading) and return a single list per PC.
Output keys look like: PC1_abs, PC2_abs, ….
include_negative
Only used when use_abs=False.
If True, returns both positive and negative gene lists per PC:
PC1_pos = highest positive loadings
PC1_neg = most negative loadings
Filtering / formatting#
dropna
If True, drop genes with NaN loadings before ranking.
gene_col
Column name used for the gene identifier in output tables.
Default: “gene”.
min_abs_loading
Optional threshold: keep only genes with abs(loading) >= min_abs_loading.
Useful to avoid exporting near-zero loadings.
Storage and export#
store_key
Where results are stored in adata.uns.
Default: “pca_loadings”.
export
If provided, writes:
a long-format table to this path, and
a wide-format table next to it named: “
.wide ”
export_sep
Delimiter for export. Either “\t” or “,”.
Default: “\t”.
export_gmt
If provided, writes a GMT file where each PC/sign is a gene set:
If use_abs=False: PC1_pos, PC1_neg, …
If use_abs=True: PC1_abs, PC2_abs, …
Returns#
A dictionary mapping set keys → DataFrames.
Keys are one of:
Signed mode (use_abs=False):
“PC1_pos”, “PC1_neg”, “PC2_pos”, …
Absolute mode (use_abs=True):
“PC1_abs”, “PC2_abs”, ….
Each DataFrame contains:
pc (e.g., “PC1”)
sign (“pos”, “neg”, or “abs”)
rank (1..N)
<gene_col> (gene name)
loading (raw loading value)
Stored in adata.uns#
Results are stored under:
adata.uns[store_key]
Structure:
params: run parameters (pcs, n_top, use_abs, include_negative, etc.)
results: dict of per-set tables (same as return value)
table: concatenated long-format table across all PCs/signs.
Exports#
Long table (export).
Columns: pc, sign, rank, gene, loading.
This is convenient for:
auditing loadings
making plots
joining with annotations
Wide table (
Each column is a PC/sign set and rows are ranked gene names:
PC1_pos, PC1_neg, …
Useful for quick inspection or manual copy/paste.
GMT (export_gmt).
Each line is a gene set:
set_name<TAB>description<TAB>gene1<TAB>gene2...
Description is set to “{key}:{loadings_key}” (e.g. “pca:PCs”).
Examples#
Get top positive/negative genes for first 3 PCs
out = bk.tl.pca_loadings(
adata,
pcs=[1, 2, 3],
n_top=50,
use_abs=False,
include_negative=True,
)
Rank by absolute loadings (single list per PC)
out = bk.tl.pca_loadings(
adata,
pcs=[1, 2, 3],
use_abs=True,
n_top=100,
)
Filter weak loadings and export TSV + GMT
bk.tl.pca_loadings(
adata,
pcs=[1, 2, 3],
n_top=100,
include_negative=True,
min_abs_loading=0.02,
export="results/pca_loadings.tsv",
export_gmt="results/pca_loadings.gmt",
)
Access the stored long table
adata.uns["pca_loadings"]["table"].head()
Notes / tips#
PC indexing is 1-based (pcs=[1,2] means PC1 and PC2).
If you ran PCA using only HVGs, non-used genes may have NaN loadings in adata.varm[“PCs”].
Keep dropna=True (default) to avoid empty/garbage ranks.
For enrichment, signed sets (PC1_pos vs PC1_neg) are often more interpretable than absolute sets.