-
Notifications
You must be signed in to change notification settings - Fork 0
API Reference
Complete reference for every public function and class in HarmonizePy.
from harmonizepy import harmonize, combat, remove_batch_effect, HarmonizeConfig
from harmonizepy import adjust_combat, adjust_limmaThe main pipeline entry point. Handles structural missingness, sorts and blocks batches, groups features by affiliation, corrects each group, and reassembles the output.
harmonize(
data,
description,
*,
config=None,
algorithm="ComBat",
combat_mode=1,
needed_values=None,
sort=None,
block=None,
unique_removal=True,
output_file=None,
) -> pd.DataFrame| Parameter | Type | Default | Description |
|---|---|---|---|
data |
DataFrame, str, or Path
|
required | Features x samples matrix. If a path, read from that TSV/CSV/Parquet file. Feature names must be in df.index, sample names in df.columns. |
description |
DataFrame, str, or Path
|
required | Batch description. Must have columns ID, sample, batch. If a path, read from that CSV file. |
config |
HarmonizeConfig or None
|
None |
If provided, all algorithm settings come from this object. Individual keyword arguments below are ignored. |
algorithm |
"ComBat" or "limma"
|
"ComBat" |
Batch correction algorithm. |
combat_mode |
int (1-4) |
1 |
ComBat variant. Ignored when algorithm="limma". |
needed_values |
int or None
|
None |
Minimum non-missing observations per batch for a feature to qualify. None auto-selects: 2 for modes 1, 3 and limma; 1 for modes 2, 4. |
sort |
"sparsity", "jaccard", "seriation", or None
|
None |
Batch sorting strategy applied before blocking. |
block |
int >= 2 or None
|
None |
Group this many consecutive sorted batches into one block. |
unique_removal |
bool |
True |
Rescue singleton-affiliation features by cropping to the nearest shared pattern. |
output_file |
str, Path, or None
|
None |
Write the corrected matrix to this path. Format inferred from extension: .tsv, .csv, .parquet, .pq. |
pd.DataFrame with the same shape, index, and column order as the input. NaN positions are preserved where data was insufficient for correction.
ValueError on invalid arguments, mismatched sample IDs, or duplicate feature names.
from harmonizepy import harmonize
# From file paths
result = harmonize("data.tsv", "batch.csv")
# From DataFrames
result = harmonize(data_df, desc_df, algorithm="limma")
# Sort and block for many-batch datasets
result = harmonize(data_df, desc_df, sort="sparsity", block=2)
# Non-parametric ComBat for small batches
result = harmonize(data_df, desc_df, combat_mode=3, needed_values=1)
# Write output directly
result = harmonize("data.tsv", "batch.csv", output_file="corrected.parquet")Frozen dataclass for storing a reproducible run configuration. Pass as config= to harmonize() to override all individual keyword arguments.
from harmonizepy import HarmonizeConfig
cfg = HarmonizeConfig(
algorithm="ComBat",
combat_mode=1,
sort_strategy="sparsity",
block_size=2,
unique_removal=True,
)| Field | Type | Default | Description |
|---|---|---|---|
algorithm |
str |
"ComBat" |
"ComBat" or "limma"
|
combat_mode |
int |
1 |
ComBat mode 1-4 |
needed_values |
int or None
|
None |
Minimum observations per batch (auto-selected if None) |
sort_strategy |
str or None
|
None |
"sparsity", "jaccard", "seriation", or None
|
block_size |
int or None
|
None |
Block size >= 2, or None
|
unique_removal |
bool |
True |
Rescue singleton features |
HarmonizeConfig is frozen and slots-based. Once created, fields cannot be changed. Create a new instance to use different settings.
Low-level ComBat engine. Operates directly on NumPy arrays. Does not handle structural missingness: for that, use harmonize().
combat(
data,
batch,
*,
par_prior=True,
mean_only=False,
ref_batch=None,
) -> ndarray| Parameter | Type | Default | Description |
|---|---|---|---|
data |
ndarray, shape (n_features, n_samples)
|
required | Expression matrix. Per-cell NaN is allowed. |
batch |
array-like, shape (n_samples,)
|
required | Integer batch label per sample. |
par_prior |
bool |
True |
Use parametric prior (True = modes 1/2, False = modes 3/4). |
mean_only |
bool |
False |
Correct location only, skip scale adjustment (True = modes 2/4). |
ref_batch |
int or None
|
None |
Reference batch to leave unadjusted. Other batches are adjusted toward this reference. |
ndarray with the same shape as data. NaN positions preserved.
import numpy as np
from harmonizepy import combat
data = np.random.randn(1000, 30) # 1000 features, 30 samples
batch = np.array([1]*10 + [2]*10 + [3]*10)
corrected = combat(data, batch)
corrected_np = combat(data, batch, par_prior=False) # non-parametric
corrected_loc = combat(data, batch, mean_only=True) # location onlyLow-level limma engine. Operates directly on NumPy arrays. Does not handle structural missingness: for that, use harmonize().
remove_batch_effect(data, batch) -> ndarraydata is an ndarray of shape (n_features, n_samples). Per-cell NaN is allowed. batch is a 1-D array-like of integer batch labels, length n_samples.
ndarray with the same shape as data. NaN positions preserved.
import numpy as np
from harmonizepy import remove_batch_effect
corrected = remove_batch_effect(data, batch)Wrapper that maps HarmonizR-style integer modes (1-4) to the combat() parameters. Operates on DataFrames rather than raw ndarrays.
adjust_combat(sub_df, batch_labels, mode=1, ref_batch=None) -> pd.DataFramesub_df is a features x samples DataFrame. Returns a corrected DataFrame with the same shape, index, and columns. Intended for custom pipeline use when you are managing sub-matrices yourself.
Wrapper applying limma correction to a DataFrame.
adjust_limma(sub_df, batch_labels) -> pd.DataFramesub_df is a features x samples DataFrame. Returns a corrected DataFrame with the same shape, index, and columns.
These functions are importable from their submodules for building custom pipelines. They are not re-exported from the top-level harmonizepy namespace.
from harmonizepy.sorting import sort_batches
from harmonizepy.blocking import build_block_list
from harmonizepy.affiliation import remove_unique_combinations| Function | Module | Description |
|---|---|---|
sort_batches(data, batch_labels, strategy) |
harmonizepy.sorting |
Returns a reordered copy of batch_labels according to the chosen strategy. |
build_block_list(batch_labels, block_size) |
harmonizepy.blocking |
Returns a list of batch-group assignments for the given block size. |
remove_unique_combinations(affiliation_list) |
harmonizepy.affiliation |
Rescues singleton-affiliation features by cropping to the nearest shared pattern. |
These interfaces are stable for the current version but are not covered by the same backward-compatibility guarantee as the top-level public API.
HarmonizePy
Getting started
User guide
Reference
Validation & performance