diff --git a/BUG_REVIEW.md b/BUG_REVIEW.md new file mode 100644 index 00000000..46a90f2e --- /dev/null +++ b/BUG_REVIEW.md @@ -0,0 +1,72 @@ +# msgf-rust bug review (2026-05-23) + +Branch: `review/bug-hunt` (from `master` @ 18360a3d) + +Systematic review of the Rust MS-GF+ port: static analysis of critical paths, +full `cargo test --release --workspace`, and targeted code reading. + +## Fixed in this branch + +| ID | Severity | Location | Issue | Fix | +|---|---|---|---|---| +| B1 | **Critical** | `msgf-rust.rs` `send_chunks` | Bench cap (`--max-spectra N`) truncated the final partial chunk to zero when `total == N` (e.g. N=100 with chunk size 5000 → empty output). | Removed erroneous tail `truncate` block; loop already stops at cap. | +| B2 | **High** | `msgf-rust.rs` param routing | Activation auto-detect was gated on `instrument == low-res`, so `--fragmentation auto --instrument QExactive` on mzML skipped peek and resolved to CID params for HCD data. | Gate auto-route on `fragmentation == auto` + mzML extension only. | +| B3 | **High** | `msgf-rust.rs` TSV write | `write_tsv(..., is_mgf=true)` always emitted MGF layout (extra `Title` column) even for mzML inputs. | Pass `!is_mzml`. | +| B4 | **High** | `match_engine.rs` GF | SpecE GF graph used `start_offset == 0` for protein N-term instead of `cand.is_protein_n_term`, breaking Met-cleaved N-termini at offset 1. | Use `cand.is_protein_n_term` / `is_protein_c_term`. | +| B5 | **Medium** | `tsv.rs` | `IsotopeError` column hardcoded to 0 while PIN writes `psm.isotope_offset`. | Thread isotope offset from PSM. | +| B6 | **Medium** | `msgf-rust.rs` CLI | Inverted `--charge-min/--charge-max` or isotope ranges produced empty ranges with no error. | Validate at startup and return clear error. | +| B7 | **High** | `match_engine.rs` dedup | Dedup used bare sequence + pin score; merged mod variants incorrectly. | Mod-aware pepSeq key + `rank_score`. | +| B8 | **Medium** | `match_engine.rs` dedup | HashMap survivor order was nondeterministic. | `BTreeMap` + best-`rank_score` survivor rule. | + +## Open — not fixed (documented for follow-up) + +| ID | Severity | Location | Issue | +|---|---|---|---| +| B9 | **Low** | `sa_walk.rs` | Test-only SA walk helper does not enforce `max_missed_cleavages`; production search uses `candidate_gen::enumerate_candidates`, which does. | +| B10 | **High** | `mzml.rs` `Iterator::next` | First per-spectrum parse error sets `done=true` and aborts the entire file; remaining spectra are silently skipped. | +| B11 | **Low** | `sa_walk.rs` Met pass | Dedupes Met-cleaved peptides on residue bytes only, collapsing distinct C-terminal contexts. | + +## Known test failures (pre-existing, CI-skipped) + +These fail on `master` without the 7 CI skip flags; tracked as parity/min_peaks regressions: + +- `match_engine_smoke::known_peptide_appears_in_top_n` +- `match_engine_smoke::charge_missing_spectrum_uses_per_charge_scored_spec` +- `match_engine_smoke::spectrum_without_charge_tries_charge_range` +- Maven fixture loads, thread-determinism test (see `.github/workflows/ci.yml`) + +## Verification + +```bash +cargo test --release --workspace -- \ + --skip charge_missing_spectrum_uses_per_charge_scored_spec \ + --skip spectrum_without_charge_tries_charge_range \ + --skip known_peptide_appears_in_top_n \ + --skip read_bsa_canno_text_format \ + --skip read_tryp_pig_bov_revcat_csarr_cnlcp \ + --skip tryp_pig_bov_revcat_full_set_loads \ + --skip match_spectra_output_invariant_across_thread_counts +``` + +## Performance (dedup pass) + +- PepSeq dedup keys use integer mod units + `Arc` cache per candidate (avoids repeated string formatting). +- Per-charge `TopNQueue` map uses `FxHashMap` (typically 1–3 charges per spectrum). + +## Documentation review (2026-05-24) + +Fixes applied on this branch: + +| Issue | Location | Fix | +|---|---|---| +| PIN column count said "28" | `README.md` | Corrected to 36 (default charge 2–3) + EdgeScore note | +| Auto-detect described "first spectrum" only | `README.md` | First 64 MS2 histogram; `--instrument` does not gate peek | +| Auto-detect required `--instrument low-res` | `DOCS.md` §4 | Matches code: only `--fragmentation auto` + mzML | +| TSV `IsotopeError` documented as always 0 | `DOCS.md` §3b | Updated after B5 fix | +| Broken `known-divergences.md` links | `README.md`, `DOCS.md` §8d | Legacy file removed in iter39; point to §8d / tests | +| Inverted charge/isotope ranges undocumented | `DOCS.md` §1 | Startup validation documented | + +**Still stale (not fixed here):** + +- `benchmark/ci/README.md` — references Java Maven workflow; no Rust benchmark workflow in `.github/workflows/` yet. +- `.claude/CLAUDE.md` — Java-tree context; accurate on `java-legacy` branch only. diff --git a/README.md b/README.md index f57bc6da..f3a1b553 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ msgf-rust \ This runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to `out.pin`, and prints per-phase timings to stderr. Feed `out.pin` directly into Percolator (Docker or native) to compute q-values. -A row in `out.pin` is one peptide–spectrum match with 28 columns: `SpecId`, `Label`, `ScanNr`, charge one-hot encoding, then features like `RawScore`, `lnSpecEValue`, `DeNovoScore`, ion-current ratios, peptide-length stats, etc. Full column reference: `DOCS.md` §3a. +A row in `out.pin` is one peptide–spectrum match. With the default charge range (2–3), each row has **36 tab-separated columns**: 35 Java-parity Percolator features plus Rust-only `EdgeScore` (inserted before `Peptide`). Charge one-hot columns scale with `[--charge-min, --charge-max]`. Full column reference: `DOCS.md` §3a. ## Common workflows @@ -131,11 +131,11 @@ Run `msgf-rust --help` for the auto-generated help with full descriptions. ## Auto-detection -For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled `.param` file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from `(fragmentation, instrument, protocol)`. MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit `--fragmentation` / `--instrument` flags). Full resolution table: `DOCS.md` §4. +For mzML inputs with `--fragmentation auto` (the default), msgf-rust peeks the first 64 MS2 spectra, histograms activation methods and analyzer types, and selects a bundled `.param` file from the dominant values. The `--instrument` CLI flag is **not** required for this path — instrument class is read from the mzML when possible. `--protocol` from the CLI is still applied when resolving the bundled model. MGF files have no activation metadata, so they use flag-based resolution (defaulting to `HCD_QExactive_Tryp.param`). Full resolution table: `DOCS.md` §4. ## Parity vs Java MS-GF+ -PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (`known-divergences.md` item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d. +PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from deconvolution-implementation differences. None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d. ## Citation diff --git a/benchmark/ci/README.md b/benchmark/ci/README.md index 4375bbc3..4114618c 100644 --- a/benchmark/ci/README.md +++ b/benchmark/ci/README.md @@ -1,6 +1,8 @@ # CI benchmark (PXD001819) -- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`) +> **Note:** This scaffold targets the Java MS-GF+ tree (`mvn`, mzIdentML metrics). The Rust port (`msgf-rust`) uses `.github/workflows/ci.yml` for tests but does not yet wire this benchmark harness. See [`benchmark/README.md`](../README.md) for scope. + +- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`) — Java branch only - **Run locally:** `mvn -B package -DskipTests && bash benchmark/ci/PXD001819/run_ci.sh` - **Compare:** `python3 benchmark/ci/PXD001819/compare_metrics.py benchmark/results/PXD001819/ci/ci_metrics.txt benchmark/ci/PXD001819/baseline.tsv` - **Self-test:** `python3 -m unittest benchmark.ci.PXD001819.test_compare_metrics` diff --git a/crates/search/src/match_engine.rs b/crates/search/src/match_engine.rs index 34e4a0f3..210bf6f8 100644 --- a/crates/search/src/match_engine.rs +++ b/crates/search/src/match_engine.rs @@ -3,6 +3,7 @@ use std::collections::{BTreeMap, HashMap}; use std::hash::Hasher; use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; // GF failure-mode diagnostics (2026-05-19). Module-level atomics // incremented per-bin from compute_spec_e_values_for_spectrum and @@ -716,13 +717,9 @@ fn compute_spec_e_values_for_spectrum( let mut any_c = false; for psm in queue.iter_psms() { let cand = &candidates[psm.primary_candidate_idx() as usize]; - if let Some(prot) = search_index.protein_at(cand.protein_index) { - let start = cand.start_offset_in_protein; - let pep_len = cand.peptide.length(); - if start == 0 { any_n = true; } - if start + pep_len >= prot.sequence.len() { any_c = true; } - if any_n && any_c { break; } - } + if cand.is_protein_n_term { any_n = true; } + if cand.is_protein_c_term { any_c = true; } + if any_n && any_c { break; } } (any_n, any_c) }; @@ -1402,51 +1399,200 @@ mod feature_tests { } } -/// Pre-merge dedup pass (R-2.2): collapse PSMs that share the same -/// (peptide_residue, rounded_score) key into a single entry, aggregating -/// their `candidate_idxs` into a unified Vec. Mirrors Java's -/// `DBScanner.java:719-733` `pepSeqMap` dedup. -/// -/// Called by the per-spectrum loop after the per-candidate scoring loop, -/// before per-charge GF compute (so SpecE is computed on the deduped set). -/// -/// Inputs: -/// - `psms`: drained from a per-charge `TopNQueue` via `drain_into_vec` -/// - `candidates`: the search's enumerated candidate slice; used to resolve -/// each PSM's peptide residue sequence for the dedup key -/// -/// Returns: deduped `Vec`. The caller re-pushes these into the -/// per-charge queue via `queue.push()` for each entry. +/// Pre-merge dedup pass (R-2.2): collapse PSMs sharing the same Java +/// `(pepSeq, score)` key before per-charge GF compute. pub(crate) fn dedup_pepseq_score( psms: Vec, candidates: &[Candidate], ) -> Vec { - use std::collections::HashMap; + use std::collections::btree_map::Entry; + use std::collections::BTreeMap; - // Key: (peptide_residue_bytes, rounded_score_i32) - // The residue sequence is the unmodified bare AA string, matching Java's - // `m.getPepSeq()` used as the dedup key (DBScanner.java:721). - let mut groups: HashMap<(Vec, i32), PsmMatch> = HashMap::new(); + let mut pep_key_cache: FxHashMap> = FxHashMap::default(); + let mut groups: BTreeMap = BTreeMap::new(); for psm in psms { - let cand = &candidates[psm.primary_candidate_idx() as usize]; - let pep_residues: Vec = cand.peptide.residues.iter().map(|aa| aa.residue).collect(); - let score_rounded = psm.score.round() as i32; - let key = (pep_residues, score_rounded); - - groups - .entry(key) - .and_modify(|existing| { - // Aggregate this PSM's indices into the surviving entry. - // Avoid duplicates if the same idx somehow appears twice. - for &idx in &psm.candidate_idxs { - if !existing.candidate_idxs.contains(&idx) { - existing.candidate_idxs.push(idx); - } + let primary = psm.primary_candidate_idx(); + let pep_key = pep_key_cache + .entry(primary) + .or_insert_with(|| Arc::new(PepDedupKey::from_peptide(&candidates[primary as usize].peptide))) + .clone(); + let key = DedupMapKey { + pep: pep_key, + score: psm.rank_score.round() as i32, + }; + + match groups.entry(key) { + Entry::Vacant(slot) => { + slot.insert(psm); + } + Entry::Occupied(mut slot) => { + let existing = slot.get_mut(); + merge_unique_candidate_idxs(&mut existing.candidate_idxs, &psm.candidate_idxs); + if psm.rank_score > existing.rank_score { + let merged_idxs = std::mem::take(&mut existing.candidate_idxs); + let mut survivor = psm; + merge_unique_candidate_idxs(&mut survivor.candidate_idxs, &merged_idxs); + *existing = survivor; } - }) - .or_insert(psm); + } + } } - groups.into_values().collect() + let mut out = Vec::with_capacity(groups.len()); + out.extend(groups.into_values()); + out +} + +/// Mod-aware dedup key: bare residues plus per-position mod mass (1e-5 Da units). +/// Matches Java pepSeq semantics without string formatting on the hot path. +#[derive(Clone, PartialEq, Eq, PartialOrd, Ord)] +struct PepDedupKey { + residues: Vec, + mod_units: Vec, +} + +impl PepDedupKey { + fn from_peptide(peptide: &model::Peptide) -> Self { + let len = peptide.residues.len(); + let mut residues = Vec::with_capacity(len); + let mut mod_units = Vec::with_capacity(len); + for aa in &peptide.residues { + residues.push(aa.residue); + mod_units.push( + aa.mod_ + .as_ref() + .map(|m| (m.mass_delta * 100_000.0).round() as i32) + .unwrap_or(0), + ); + } + Self { residues, mod_units } + } +} + +#[derive(Clone, PartialEq, Eq)] +struct DedupMapKey { + pep: Arc, + score: i32, +} + +impl PartialOrd for DedupMapKey { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} + +impl Ord for DedupMapKey { + fn cmp(&self, other: &Self) -> std::cmp::Ordering { + self.pep + .residues + .cmp(&other.pep.residues) + .then_with(|| self.pep.mod_units.cmp(&other.pep.mod_units)) + .then(self.score.cmp(&other.score)) + } +} + +fn merge_unique_candidate_idxs(into: &mut Vec, from: &[u32]) { + for &idx in from { + if !into.contains(&idx) { + into.push(idx); + } + } +} + +#[cfg(test)] +mod dedup_tests { + use super::*; + use std::sync::Arc; + use model::amino_acid::AminoAcid; + use model::modification::{ModLocation, Modification}; + use model::peptide::Peptide; + use model::ResidueSpec; + use crate::psm::PsmMatch; + + fn seq_peptide(bytes: &[u8]) -> Peptide { + let residues: Vec = bytes + .iter() + .filter_map(|&b| AminoAcid::standard(b)) + .collect(); + Peptide::new(residues, b'R', b'K') + } + + fn cand_with_peptide(peptide: Peptide) -> Candidate { + Candidate { + peptide, + protein_index: 0, + start_offset_in_protein: 0, + is_decoy: false, + is_protein_n_term: false, + is_protein_c_term: false, + } + } + + fn psm(primary: u32, rank: f32, pin: f32) -> PsmMatch { + PsmMatch { + spectrum_idx: 0, + candidate_idxs: vec![primary], + charge_used: 2, + mass_error_ppm: 0.0, + score: pin, + rank_score: rank, + edge_score: (rank - pin) as i32, + spec_e_value: 1.0, + de_novo_score: 0, + activation_method: None, + e_value: 1.0, + features: Default::default(), + isotope_offset: 0, + } + } + + #[test] + fn dedup_uses_rank_score_not_pin_score() { + let pep = seq_peptide(b"PEPTK"); + let cands = vec![cand_with_peptide(pep.clone())]; + let psms = vec![ + psm(0, 100.4, 99.6), + psm(0, 120.0, 99.6), + ]; + let out = dedup_pepseq_score(psms, &cands); + assert_eq!(out.len(), 2, "different rank_score keys must not merge"); + } + + #[test] + fn dedup_distinguishes_mod_state() { + let mut ox = seq_peptide(b"PEPMK"); + ox.residues[3].mod_ = Some(Arc::new(Modification { + name: "Ox".into(), + mass_delta: 15.99491, + residue: ResidueSpec::Specific(b'M'), + location: ModLocation::Anywhere, + fixed: false, + accession: None, + })); + let cands = vec![ + cand_with_peptide(seq_peptide(b"PEPMK")), + cand_with_peptide(ox), + ]; + let psms = vec![ + psm(0, 50.0, 50.0), + psm(1, 50.0, 50.0), + ]; + let out = dedup_pepseq_score(psms, &cands); + assert_eq!(out.len(), 2, "mod-aware pepSeq keys must differ"); + } + + #[test] + fn dedup_keeps_highest_rank_score_survivor() { + let pep = seq_peptide(b"PEPTK"); + let cands = vec![cand_with_peptide(pep)]; + // Same rounded score bucket (60) but different float rank_score — merge to best. + let psms = vec![ + psm(0, 59.6, 50.0), + psm(0, 60.4, 50.0), + ]; + let out = dedup_pepseq_score(psms, &cands); + assert_eq!(out.len(), 1); + assert!((out[0].rank_score - 60.4).abs() < f32::EPSILON); + } } diff --git a/crates/search/tests/gf_java_parity.rs b/crates/search/tests/gf_java_parity.rs index 8eb3c93d..0ff65363 100644 --- a/crates/search/tests/gf_java_parity.rs +++ b/crates/search/tests/gf_java_parity.rs @@ -11,7 +11,7 @@ //! Java SP values are now captured directly via //! `-Dmsgfplus.gftrace=true` against `target/MSGFPlus.jar` (commit e918376) //! so the test compares SP-vs-SP. The remaining `num_distinct`-level -//! discrepancy is tracked separately as known-divergences item #2 +//! discrepancy is tracked in `DOCS.md` §8d (lnEValue / num_distinct). //! (e_value proxy follow-up). //! //! Reference fixture (for context, not used for the assertion): @@ -91,7 +91,7 @@ const FIVE_TRACED_PSMS: &[(i32, &str, u8, f64)] = &[ /// Best case; Rust and Java agree to within ~18%. /// /// The remaining SP-level drift is small and is tracked under the -/// known-divergences list (RawScore scale + Float.MIN_VALUE underflow +/// known-divergences list (see `DOCS.md` §8d: RawScore scale + Float.MIN_VALUE underflow /// guard). The previously suspected scan-3353-specific score-distribution /// width bug appears to have been an artifact of the SEV-vs-SP comparison. /// @@ -100,7 +100,7 @@ const FIVE_TRACED_PSMS: &[(i32, &str, u8, f64)] = &[ /// `NewScoredSpectrum.java:83-88`). The two charge-3 PSMs in this fixture /// (scan 3416 and 3353) moved from 0.24/0.13 OOM → 1.03/1.20 OOM. The shift /// EXPOSES an underlying deconvolution-implementation divergence between -/// Rust and Java (`known-divergences.md` item #3, still open). The fix is +/// Rust and Java (`DOCS.md` §8d — BSA charge-3 deconvolution, still open). The fix is /// algorithmically correct — Rust now matches Java's prob_peak ordering — /// but the deconvoluted peak list differs from Java's implementation, /// shifting ion_existence_score. Charge-2 PSMs (3 of 5 in this fixture) are @@ -112,7 +112,7 @@ const FIVE_TRACED_PSMS: &[(i32, &str, u8, f64)] = &[ /// validated on Astral (Rust now BEATS Java by +287 PSMs at 1% FDR; /// see project memory `iter32-37-shipped`). It also widens the BSA /// charge-3 SEV gap from 1.03/1.20 OOM → 2.56-3.58 OOM because the -/// deconvolution-implementation divergence (`known-divergences.md` #3) +/// deconvolution-implementation divergence (`DOCS.md` §8d) /// now feeds the corrected score path. Bumping tolerance to 4.0 OOM /// keeps this test as a coarse smoke gate while #3 remains open; a /// regression beyond 4.0 OOM would still signal a new bug.