Read counts#

bullkpy.io.read_counts(filename, *, sep='\t', orientation='genes_by_samples', dtype='int64')[source]#

Read a bulk RNA-seq count matrix and return an AnnData object.

Parameters:
  • filename – Path to counts file (tsv/csv). Rows or columns must be gene IDs.

  • sep – Column separator (default: tab).

  • orientation

    • “genes_by_samples”: genes in rows, samples in columns (common output of HTSeq, featureCounts)

    • “samples_by_genes”: samples in rows, genes in columns

  • dtype – Cast count matrix to dtype (default: int64). Set to None to disable casting.

Returns:

AnnData – AnnData object with samples in .obs and genes in .var.

Read a bulk RNA-seq count matrix and create an AnnData object.

bk.io.read_counts is the main entry point to load raw count matrices generated by tools such as HTSeq, featureCounts, or similar RNA-seq quantification pipelines.

Purpose#

This function reads a gene–sample count matrix from disk and returns an AnnData object with:

  • samples stored in .obs

  • genes stored in .var

  • raw counts stored in .X

The function is flexible with respect to matrix orientation and file format.


Supported input formats#

  • TSV / CSV text files

  • Genes in rows or genes in columns

  • Integer or numeric count values

Typical inputs include:

  • featureCounts.txt

  • htseq-count.tsv

  • custom gene × sample matrices


Basic usage#

Genes in rows, samples in columns (most common)#

import bullkpy as bk

adata = bk.io.read_counts(
    "counts.tsv",
    orientation="genes_by_samples",
)

This corresponds to matrices where rows are genes and columns are samples (e.g. HTSeq, featureCounts default output).

Samples in rows, genes in columns#

adata = bk.io.read_counts(
    "counts.tsv",
    orientation="samples_by_genes",
)

Key parameters#

adata
AnnData object with samples in .obs.

metadata_file
Path to metadata file (tsv, csv, or xlsx).

index_col
Column in metadata that matches adata.obs_names.

sep
Column separator for tsv/csv files.

how
Merge strategy:

  • “left”: keep all samples in adata (default)

  • “inner”: keep only samples present in metadata

File separators#

By default, tab-separated files are assumed:

adata = bk.io.read_counts("counts.tsv", sep="\t")

For comma-separated files:

adata = bk.io.read_counts("counts.csv", sep=",")

Data types#

By default, the count matrix is cast to int64

adata = bk.io.read_counts("counts.tsv", dtype="int64")

If your input is already normalized or contains non-integer values, disable casting:

adata = bk.io.read_counts("matrix.tsv", dtype=None)

Output#

The returned object is an AnnData instance with:

  • adata.X → count matrix (samples × genes)

  • adata.obs_names → sample IDs

  • adata.var_names → gene IDs

adata
# AnnData object with n_obs × n_vars = 120 × 18000

Notes#

  • This function does not perform normalization

  • Gene identifiers are taken directly from the input file index/columns

  • Downstream steps such as QC, normalization, PCA, clustering, etc. should be run separately.