Highly variable genes#
- bullkpy.pp.highly_variable_genes(adata, *, layer='log1p_cpm', n_top_genes=2000, n_bins=20, min_mean=0.0, max_mean=inf, min_disp=-inf, key_added='highly_variable')[source]#
Select highly variable genes (bulk-friendly version of HVGs).
This computes mean and variance across samples on the chosen layer and identifies genes with high dispersion relative to genes of similar mean (mean-binned z-score of log-dispersion, Scanpy/Seurat spirit).
- Stores results in adata.var:
mean, variance, dispersion, dispersion_norm
boolean flag adata.var[key_added] (default: ‘highly_variable’)
Identify highly variable genes (HVGs) in bulk RNA-seq data.
Description#
highly_variable_genes selects genes whose expression varies strongly across samples, relative to genes of similar mean expression.
This function follows the Scanpy / Seurat spirit, but is adapted for bulk RNA-seq: variation is computed across samples, not cells, and works directly on normalized expression layers.
The method: 1. Computes per-gene mean and variance across samples 2. Calculates dispersion (variance / mean) 3. Log-transforms dispersion for numerical stability 4. Bins genes by mean expression 5. Computes a z-score of dispersion within each mean bin 6. Selects the top-ranked genes by normalized dispersion
Results are stored in adata.var.
Parameters#
adata
AnnData object with samples in .obs and genes in .var.
layer.
str | NoneLayer used to compute variability (default:
"log1p_cpm").
IfNone, usesadata.X.
n_top_genes
intNumber of highly variable genes to select (default:
2000).
n_bins
intNumber of mean-expression bins used for dispersion normalization.
min_mean*
floatMinimum mean expression for a gene to be considered.
max_mean
floatMaximum mean expression for a gene to be considered.
min_disp
floatMinimum dispersion threshold.
key_added
strColumn name in
adata.varwhere the HVG boolean flag is stored.
Returns#
Results are stored in adata.var.
The following columns are added to adata.var:
Column –> Description.
means –> Mean expression across samples.
variances –> Expression variance across samples.
dispersions –> Variance / mean.
dispersions_norm –> Mean-binned z-score of log-dispersion.
<key_added> –> Boolean flag indicating highly variable genes.
Example usage#
#Basic HVG selection
bk.pp.highly_variable_genes(adata)
# Using a different layer
bk.pp.highly_variable_genes(
adata,
layer="log2_tpm",
n_top_genes=3000,
)
# Applying expression filters
bk.pp.highly_variable_genes(
adata,
min_mean=0.1,
max_mean=5.0,
min_disp=0.5,
)
# Subset AnnData to HVGs
adata = adata[:, adata.var["highly_variable"]].copy()
Notes#
• Designed for bulk RNA-seq, not single-cell data
• Works with dense or sparse matrices
• Uses population variance (ddof=0)
• Mean binning is performed on log1p(mean) to improve stability
• Genes failing mean/dispersion thresholds are excluded from ranking