Leading edge overlap matrix#
- bullkpy.pl.leading_edge_overlap_matrix(pre_res, *, term_idx=None, terms=None, min_gene_freq=2, sort_genes_by='freq', row_cluster=True, col_cluster=False, cmap='Greys', figsize=None, show_gene_labels=True, gene_label_fontsize=8.0, show_term_labels=True, label_fontsize=9.0, dendrogram_ratio=(0.06, 0.12), title=None, show_title=True, grid_every=5, grid_color='0.90', save=None, show=True)[source]#
Pathway × gene binary matrix for leading-edge membership.
- Enhancements:
grid lines every N rows/cols for readability
label_fontsize controls pathway labels too
dendrogram_ratio shrinks left dendrogram
when returning df, returns it in the clustered order (rows/cols)
Build and plot a pathway × gene binary matrix indicating leading-edge membership from a GSEApy prerank result. This is meant to mimic a “Leading Edge Viewer”: you can quickly see whether a small set of genes drives many enriched pathways.
Rows = pathways/terms
Columns = genes
Cell value =
1if the gene is in the pathway’s leading-edge set, else0Optionally filters genes by how often they appear across leading edges (
min_gene_freq), and clusters rows/columns via seabornclustermap.
Example Leading edge overlap matrix
Parameters#
Required#
pre_res
A GSEApy prerank result object (the pre_res returned by gseapy.prerank(…)).
This function expects that pre_res contains enough information to recover each term’s leading-edge genes, via the internal helper _leading_edge_sets(pre_res, …).
Term selection
term_idx: None or index-like
Optional selection mechanism understood by _leading_edge_sets. Use this when you want to select terms by index/rank rather than by name.
terms: Sequence[str] | None
Explicit list of term names to include (subset of available terms in pre_res).
If both are provided, the helper decides precedence (typically: explicit terms wins; otherwise term_idx; otherwise all).
Gene filtering / ordering#
min_gene_freq: int (default 2)
Keep only genes that appear in at least this many leading-edge sets across the selected terms.
1 keeps everything
larger values focus on shared drivers
sort_genes_by: “freq” or “alpha”
“freq” (default): sort columns by decreasing gene frequency across pathways
“alpha”: sort columns alphabetically.
Clustering / plotting#
row_cluster: bool (default True)
Cluster pathways (rows).
col_cluster: bool (default False)
Cluster genes (columns). Often disabled because there may be many genes.
cmap: colormap (default “Greys”)
Binary heatmap color scale.
figsize: (w, h) | None
If None, size is chosen automatically based on matrix dimensions.
show_gene_labels: bool (default True)
Show gene names on the x-axis.
gene_label_fontsize: float (default 8.0)
Font size for gene labels (use smaller values for large matrices).
show_term_labels: bool (default True)
Show pathway names on the y-axis.
save: str | Path | None. If provided, saves the figure (not just the matrix) using _savefig.
show: bool (default True).
Calls plt.show().
What it does.#
Validates dependencies
Requires seaborn (uses sns.clustermap).
Requires min_gene_freq >= 1.
Extracts leading-edge sets. Calls:
term_names, le_sets = _leading_edge_sets(pre_res, term_idx=term_idx, terms=terms)
le_sets is expected to be a dict: term -> set(genes).
Builds the binary matrix.
Creates union of all leading-edge genes across selected terms.
Constructs a matrix mat[i, j] = 1 if gene j is in term i’s leading-edge set.
Filters genes by frequency.
Computes gene_freq = df.sum(axis=0).
Keeps only genes where gene_freq >= min_gene_freq.
Sorts columns
Either by frequency (descending) or alphabetically.Plots with seaborn clustermap.
Optional clustering for rows/cols.
Removes the colorbar (cbar_pos=None) because it’s binary.
Adds axis labels and title.
If gene labels are shown, rotates them via _rotate_gene_labels and increases bottom margin.
Optionally saves and shows.
Returns#
df: pd.DataFrame. The final binary pathway × gene matrix after filtering and sorting.
g: seaborn ClusterGrid.
The object returned by sns.clustermap (useful for fine-grained figure edits).
Raises#
ImportError if seaborn is missing.
ValueError if: – min_gene_freq < 1 – no leading-edge genes are found for the selected terms – filtering removes all genes (df.shape[1] == 0) – invalid sort_genes_by
Potential errors from _leading_edge_sets if terms cannot be resolved in pre_res.
Notes / tips#
Start with min_gene_freq=2 to highlight “shared driver” genes across enriched pathways.
If you have many terms, consider: – selecting a smaller subset via terms=[…] – disabling gene labels (show_gene_labels=False) – increasing min_gene_freq
If you want clustering among genes, set col_cluster=True, but this can be slow for large matrices.
Examples#
Default: show shared leading-edge genes across all enriched terms
df_le, g = bk.pl.leading_edge_overlap_matrix(pre_res)
Focus on specific Hallmark pathways
terms = [
"HALLMARK_E2F_TARGETS",
"HALLMARK_G2M_CHECKPOINT",
"HALLMARK_MYC_TARGETS_V1",
]
df_le, g = bk.pl.leading_edge_overlap_matrix(
pre_res,
terms=terms,
min_gene_freq=1,
col_cluster=True,
)
Emphasize only genes shared across many pathways
df_le, g = bk.pl.leading_edge_overlap_matrix(
pre_res,
min_gene_freq=4,
show_gene_labels=False,
)