Leading edge overlap matrix#

bullkpy.pl.leading_edge_overlap_matrix(pre_res, *, term_idx=None, terms=None, min_gene_freq=2, sort_genes_by='freq', row_cluster=True, col_cluster=False, cmap='Greys', figsize=None, show_gene_labels=True, gene_label_fontsize=8.0, show_term_labels=True, label_fontsize=9.0, dendrogram_ratio=(0.06, 0.12), title=None, show_title=True, grid_every=5, grid_color='0.90', save=None, show=True)[source]#

Pathway × gene binary matrix for leading-edge membership.

Enhancements:

grid lines every N rows/cols for readability
label_fontsize controls pathway labels too
dendrogram_ratio shrinks left dendrogram
when returning df, returns it in the clustered order (rows/cols)

Build and plot a pathway × gene binary matrix indicating leading-edge membership from a GSEApy prerank result. This is meant to mimic a “Leading Edge Viewer”: you can quickly see whether a small set of genes drives many enriched pathways.

Rows = pathways/terms
Columns = genes
Cell value = 1 if the gene is in the pathway’s leading-edge set, else 0
Optionally filters genes by how often they appear across leading edges (min_gene_freq), and clusters rows/columns via seaborn clustermap.

Example Leading edge overlap matrix

Parameters#

Required#

pre_res
A GSEApy prerank result object (the pre_res returned by gseapy.prerank(…)).
This function expects that pre_res contains enough information to recover each term’s leading-edge genes, via the internal helper _leading_edge_sets(pre_res, …).

Term selection

term_idx: None or index-like
Optional selection mechanism understood by _leading_edge_sets. Use this when you want to select terms by index/rank rather than by name.

terms: Sequence[str] | None
Explicit list of term names to include (subset of available terms in pre_res).

If both are provided, the helper decides precedence (typically: explicit terms wins; otherwise term_idx; otherwise all).

Gene filtering / ordering#

min_gene_freq: int (default 2)
Keep only genes that appear in at least this many leading-edge sets across the selected terms.

1 keeps everything
larger values focus on shared drivers

sort_genes_by: “freq” or “alpha”

“freq” (default): sort columns by decreasing gene frequency across pathways
“alpha”: sort columns alphabetically.

Clustering / plotting#

row_cluster: bool (default True)
Cluster pathways (rows).

col_cluster: bool (default False)
Cluster genes (columns). Often disabled because there may be many genes.

cmap: colormap (default “Greys”)
Binary heatmap color scale.

figsize: (w, h) | None
If None, size is chosen automatically based on matrix dimensions.

show_gene_labels: bool (default True)
Show gene names on the x-axis.

gene_label_fontsize: float (default 8.0)
Font size for gene labels (use smaller values for large matrices).

show_term_labels: bool (default True)
Show pathway names on the y-axis.

save: str | Path | None. If provided, saves the figure (not just the matrix) using _savefig.

show: bool (default True).
Calls plt.show().

What it does.#

Validates dependencies

Requires seaborn (uses sns.clustermap).
Requires min_gene_freq >= 1.

Extracts leading-edge sets. Calls:

term_names, le_sets = _leading_edge_sets(pre_res, term_idx=term_idx, terms=terms)

le_sets is expected to be a dict: term -> set(genes).

Builds the binary matrix.

Creates union of all leading-edge genes across selected terms.
Constructs a matrix mat[i, j] = 1 if gene j is in term i’s leading-edge set.

Filters genes by frequency.

Computes gene_freq = df.sum(axis=0).
Keeps only genes where gene_freq >= min_gene_freq.

Sorts columns
Either by frequency (descending) or alphabetically.
Plots with seaborn clustermap.

Optional clustering for rows/cols.
Removes the colorbar (cbar_pos=None) because it’s binary.
Adds axis labels and title.
If gene labels are shown, rotates them via _rotate_gene_labels and increases bottom margin.

Optionally saves and shows.

Returns#

df: pd.DataFrame. The final binary pathway × gene matrix after filtering and sorting.
g: seaborn ClusterGrid.
The object returned by sns.clustermap (useful for fine-grained figure edits).

Raises#

ImportError if seaborn is missing.
ValueError if: – min_gene_freq < 1 – no leading-edge genes are found for the selected terms – filtering removes all genes (df.shape[1] == 0) – invalid sort_genes_by
Potential errors from _leading_edge_sets if terms cannot be resolved in pre_res.

Notes / tips#

Start with min_gene_freq=2 to highlight “shared driver” genes across enriched pathways.
If you have many terms, consider: – selecting a smaller subset via terms=[…] – disabling gene labels (show_gene_labels=False) – increasing min_gene_freq
If you want clustering among genes, set col_cluster=True, but this can be slow for large matrices.

Examples#

Default: show shared leading-edge genes across all enriched terms

df_le, g = bk.pl.leading_edge_overlap_matrix(pre_res)

Focus on specific Hallmark pathways

terms = [
    "HALLMARK_E2F_TARGETS",
    "HALLMARK_G2M_CHECKPOINT",
    "HALLMARK_MYC_TARGETS_V1",
]
df_le, g = bk.pl.leading_edge_overlap_matrix(
    pre_res,
    terms=terms,
    min_gene_freq=1,
    col_cluster=True,
)

Emphasize only genes shared across many pathways

df_le, g = bk.pl.leading_edge_overlap_matrix(
    pre_res,
    min_gene_freq=4,
    show_gene_labels=False,
)