Filter genes#

bullkpy.pp.filter_genes(adata, *, layer='counts', min_samples=3, min_expr=0.0, inplace=True)[source]#

Filter genes by detection across samples.

A gene is “detected” in a sample if expression > min_expr in layer. Keeps genes detected in at least min_samples samples.

Works for counts (min_expr=0) or log layers (use e.g. min_expr=0.1).

Filter genes by detection across samples.

bk.pp.filter_genes removes genes that are detected in too few samples. A gene is considered detected in a sample if its expression value exceeds a user-defined threshold (min_expr) in the selected expression layer.

This function is designed to work seamlessly with:

raw count matrices, and
normalized or log-transformed layers (e.g. log1p_cpm).

What it does#

For each gene, the function counts in how many samples its expression is above min_expr. Genes detected in fewer than min_samples samples are removed.

Detection rule:

A gene is detected in a sample if expression > min_expr in the chosen layer.

When to use it#

Typical use cases include:

Removing genes expressed in only a handful of samples
Reducing noise before PCA / clustering
Speeding up downstream differential expression
Making bulk RNA-seq analyses more robust

This is especially important for large cohorts (e.g. TCGA) where many genes are barely expressed.

Basic usage#

Filter genes detected in at least 3 samples (counts)#

bk.pp.filter_genes(
    adata,
    min_samples=3,
    layer="counts",
)

This is the most common use for raw count data.

Filter genes using a log-transformed layer#

bk.pp.filter_genes(
    adata,
    min_samples=10,
    min_expr=0.1,
    layer="log1p_cpm",
)

This is recommended when your data are already normalized or log-transformed.

Parameters#

layer

Expression matrix to use for detection:
• “counts” (default): raw counts • any key in adata.layers • None: uses adata.X

min_expr

Minimum expression value required to count a gene as detected in a sample.

Typical values: • counts: min_expr = 0 • log-normalized data: min_expr ≈ 0.1

min_samples

Minimum number of samples in which a gene must be detected to be kept.

In-place vs copy

# By default, filtering is done in-place:
bk.pp.filter_genes(adata, inplace=True)

# To keep the original object unchanged:
adata_filt = bk.pp.filter_genes(adata, inplace=False)

Stored metadata#

The function records filtering details for reproducibility:

# In adata.var

adata.var["n_samples_detected"]
# Number of samples in which each gene was detected.

# In adata.uns
adata.uns["pp"]["filter_genes"]

Contains:
• layer used
• detection threshold
• minimum samples
• number of genes before and after filtering

Interaction with other preprocessing steps#

A common preprocessing order is:

bk.pp.filter_genes(adata)
bk.pp.filter_samples(adata)
bk.pp.qc_metrics(adata)

Filtering genes early reduces noise and improves QC and dimensionality reduction.