Categorical association#

bullkpy.tl.categorical_association(adata, *, key1, key2)[source]#

Association between two categorical obs columns using chi-square contingency test.

Returns a dict with:

chi2, dof, pval
table (contingency DataFrame)
expected (numpy array)

Association analysis between two categorical annotations.

This function quantifies the relationship between two categorical columns in adata.obs using contingency tables and multiple association metrics. It is useful for comparing clusterings, annotations, batches, or any pair of categorical labels.

What it does#

Given two categorical variables (key1, key2), the function: 1. Builds a contingency table 2. Computes one or more association metrics 3. Returns all results in a single dictionary

Supported metrics include: • Chi-squared test • Cramér’s V • Adjusted Rand Index (ARI) • Normalized Mutual Information (NMI)

When to use#

Use categorical_association when you want to: • Compare cluster labels vs known annotations • Quantify agreement between two clustering solutions • Assess batch effects • Explore relationships between categorical metadata fields

Examples: • Leiden clusters vs cell type • Batch vs condition • Manual annotation vs automated labels

Parameters#

adata
AnnData object containing the annotations in .obs

key1
First categorical column in adata.obs

key2
Second categorical column in adata.obs

metrics
Iterable of metrics to compute. Supported values: • “chi2” – Chi-squared test of independence • “cramers_v” – Effect size for categorical association • “ari” – Adjusted Rand Index (requires scikit-learn) • “nmi” – Normalized Mutual Information (requires scikit-learn)

dropna
If True, rows with missing values in either column are dropped before analysis

Returned value#

Returns a dict with the following entries:

table

A pandas DataFrame representing the contingency table:

key2 categories → (columns)
key1 categories ↓ (rows)

Optional metric entries

Depending on metrics, the dictionary may also include:

chi2

{
  "statistic": float,
  "pval": float,
  "dof": int
}

cramers_v.

float

ari.

float

nmi.

float

Metrics explained#

Chi-squared test.
• Tests independence between categories • Sensitive to sample size • Returns statistic and p-value

Cramér’s V.
• Effect size for categorical association • Range: 0–1 • Interpretable regardless of table size

Adjusted Rand Index (ARI).
• Measures clustering similarity • Adjusted for chance • Range: –1 to 1 • Requires scikit-learn

Normalized Mutual Information (NMI).
• Information-theoretic similarity • Range: 0–1 • Requires scikit-learn

Examples#

Compare clustering vs annotation

out = bk.tl.categorical_association(
    adata,
    key1="leiden_1.0",
    key2="CellType",
)

out["cramers_v"]

Full metric set

out = bk.tl.categorical_association(
    adata,
    key1="Batch",
    key2="Condition",
    metrics=("chi2", "cramers_v", "ari", "nmi"),
)

Inspect contingency table

out["table"]

Notes#

ARI and NMI require scikit-learn
If scikit-learn is unavailable: • ARI/NMI are skipped with a warning
Chi-squared uses no Yates correction
All categories are cast to strings before comparison