Skip to content

Normalization factors and DESeqDataSetFromTximport equivalent#412

Draft
maltekuehl wants to merge 4 commits intoscverse:mainfrom
complextissue:main
Draft

Normalization factors and DESeqDataSetFromTximport equivalent#412
maltekuehl wants to merge 4 commits intoscverse:mainfrom
complextissue:main

Conversation

@maltekuehl
Copy link
Collaborator

@maltekuehl maltekuehl commented Sep 13, 2025

Fixes #305 and closes #359. Adds support for normalization factors based on length provided by pytximport. Very much still a draft.

Open issues:

  • Is the example dataset okay? Others seem to be synthetic and much smaller. Do you have some simple or synthetic source data that could be used to create an AnnData object with pytximport? Once clear, we should probably also add tests for correctness, comparing against a reference to prevent accidental future drift.
  • I have yet to figure out how size factors should be implemented in the case that we are also calculating normalization factors. Should they be equivalent to size factors in the standard case or some transform of the norm matrix adjusted counts? Similar for logmeans, as I do not fully understand where else this data is used throughout the code. Any help would be appreciated.
  • The statistical testing structure is quite a bit different in PyDESeq2 compared to DESeq2 and I may not have grasped all subtleties, would be thankful for help from maintainers to ensure that everything was adjusted.
  • The initial scaffold of this PR was LLM-generated, and while I provided ample context (including the issues and relevant code from DESeq2) and clear instructions and have checked and already adjusted the output quite a bit, it would be best to examine these changes critically.

Aside: Tests pass locally but fail due to an older AnnData version on Python 3.10 here. Would you be open for this PR to also include an update of the pre-commit, GitHub Actions, full move to uv/hatch/ruff like other scverse ecosystem packages and targeting Python 3.11 - 3.13?

CC @BorisMuzellec

@BorisMuzellec
Copy link
Collaborator

Hi @maltekuehl, sorry I haven't had the time to review your PR yet.

To answer your last remark: I'm all for switching to uv for package management. I don't see much of an issue in dropping support for python 3.10 starting from v0.5.3 as many packages (e.g., numpy) have also stopped supporting it in their latest releases.

Please go ahead if you wish to handle those changes! I'd just suggest you make them in a separate PR.

I'll try to have a look at this PR ASAP

[pre-commit.ci] pre-commit autoupdate (scverse#415)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for normalization factors from pytximport, implementing functionality equivalent to DESeq2's DESeqDataSetFromTximport. It enables gene-length correction for transcript-level quantification data (e.g., from Salmon/Kallisto/RSEM).

Changes:

  • Adds from_pytximport parameter to DeseqDataSet with validation, normalization factor computation in fit_size_factors, and propagation through IRLS/Wald test/LFC shrinkage
  • Adds estimate_norm_factors function in preprocessing and updates irls_solver to accept per-gene normalization factors
  • Adds tests, example notebook, and documentation for the pytximport integration

Reviewed changes

Copilot reviewed 12 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pydeseq2/dds.py Adds from_pytximport flag, validation, normalization factor computation in fit_size_factors, and passes norm factors to IRLS
pydeseq2/ds.py Updates Wald test and LFC shrinkage to use normalization factors when available
pydeseq2/preprocessing.py Adds estimate_norm_factors function implementing DESeq2's estimateNormFactors
pydeseq2/utils.py Updates irls_solver to accept and use per-gene normalization factors
pydeseq2/default_inference.py Passes per-gene normalization factors through to irls_solver
pydeseq2/inference.py Adds normalization_factors parameter to abstract irls method
tests/test_pytximport.py Tests for detection, validation, normalization computation, and full pipeline
examples/plot_pytximport_example.py Example notebook demonstrating pytximport integration
docs/source/index.rst Documents pytximport support
docs/source/refs.bib Adds pytximport citation
pyproject.toml Updates ruff config to newer tool.ruff.lint format
docs/source/.DS_Store Accidentally committed macOS metadata file
.gitignore Adds sphinx auto_examples and ruff cache

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +747 to +754
# Store sample-wise geometric mean of norm factors as size_factors
# This maintains compatibility with existing code while incorporating
# the gene-length correction
# with np.errstate(divide="ignore"):
# log_norm_factors = np.log(norm_factors)
# log_geom_mean_per_sample = np.mean(log_norm_factors, axis=1)
# self.obs["size_factors"] = np.exp(log_geom_mean_per_sample)

# integer counts and that pytximport was used with counts_from_abundance=None
# (raw counts) to generate the AnnData object.

adata = ad.read_h5ad("../tests/data/pytximport/test_pytximport.h5ad")
# When pytximport data is used, PyDESeq2 computes normalization factors
# that account for both library size and gene length differences.

dds_explicit.fit_size_factors()
self.from_pytximport = from_pytximport

if self.from_pytximport:
print("Detected pytximport data with length offsets.")
offset = np.log(self.dds.obs["size_factors"]).values

if "normalization_factors" in self.dds.obsm:
offset = np.log(self.dds.obsm["normalization_factors"])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plans to add DESeqDataSetFromTximport? Add support for sample-/gene-dependent normalization factors (e.g., length offsets from pytximport)

3 participants