Read counts#
- bullkpy.io.read_counts(filename, *, sep='\t', orientation='genes_by_samples', dtype='int64')[source]#
Read a bulk RNA-seq count matrix and return an AnnData object.
- Parameters:
filename – Path to counts file (tsv/csv). Rows or columns must be gene IDs.
sep – Column separator (default: tab).
orientation –
“genes_by_samples”: genes in rows, samples in columns (common output of HTSeq, featureCounts)
“samples_by_genes”: samples in rows, genes in columns
dtype – Cast count matrix to dtype (default: int64). Set to None to disable casting.
- Returns:
AnnData – AnnData object with samples in .obs and genes in .var.
Read a bulk RNA-seq count matrix and create an AnnData object.
bk.io.read_counts is the main entry point to load raw count matrices
generated by tools such as HTSeq, featureCounts, or similar
RNA-seq quantification pipelines.
Purpose#
This function reads a gene–sample count matrix from disk and returns
an AnnData object with:
samples stored in
.obsgenes stored in
.varraw counts stored in
.X
The function is flexible with respect to matrix orientation and file format.
Supported input formats#
TSV / CSV text files
Genes in rows or genes in columns
Integer or numeric count values
Typical inputs include:
featureCounts.txthtseq-count.tsvcustom gene × sample matrices
Basic usage#
Genes in rows, samples in columns (most common)#
import bullkpy as bk
adata = bk.io.read_counts(
"counts.tsv",
orientation="genes_by_samples",
)
This corresponds to matrices where rows are genes and columns are samples (e.g. HTSeq, featureCounts default output).
Samples in rows, genes in columns#
adata = bk.io.read_counts(
"counts.tsv",
orientation="samples_by_genes",
)
Key parameters#
adata
AnnData object with samples in .obs.
metadata_file
Path to metadata file (tsv, csv, or xlsx).
index_col
Column in metadata that matches adata.obs_names.
sep
Column separator for tsv/csv files.
how
Merge strategy:
“left”: keep all samples in adata (default)
“inner”: keep only samples present in metadata
File separators#
By default, tab-separated files are assumed:
adata = bk.io.read_counts("counts.tsv", sep="\t")
For comma-separated files:
adata = bk.io.read_counts("counts.csv", sep=",")
Data types#
By default, the count matrix is cast to int64
adata = bk.io.read_counts("counts.tsv", dtype="int64")
If your input is already normalized or contains non-integer values, disable casting:
adata = bk.io.read_counts("matrix.tsv", dtype=None)
Output#
The returned object is an AnnData instance with:
adata.X → count matrix (samples × genes)
adata.obs_names → sample IDs
adata.var_names → gene IDs
adata
# AnnData object with n_obs × n_vars = 120 × 18000
Notes#
This function does not perform normalization
Gene identifiers are taken directly from the input file index/columns
Downstream steps such as QC, normalization, PCA, clustering, etc. should be run separately.