msgf-rust documentation

This is the full reference for the msgf-rust binary and its outputs. For a quick start and benchmark summary, see README.md. For porting Java MS-GF+ command lines and numeric legacy flags, see CLI_MIGRATION.md.

Run msgf-rust --help for auto-generated help derived from the same Cli struct documented below.

CLI reference
Mods.txt format
Output formats
Auto-detection
Building from source
Training new .param files
Isobaric labeling
Java MS-GF+ → msgf-rust migration
License and citation

1. CLI reference

All flags use kebab-case long options (--flag-name). Several flags also accept legacy Java MS-GF+ numeric values (see §8). The CLI is implemented in crates/msgf-rust/src/bin/msgf-rust.rs.

Required

Flag	Type	Default	Description	Legacy form
`--spectrum`	path	(required)	Input spectrum file. Extension `.mzML`/`.mzml` selects the mzML reader; any other extension (including `.mgf`) selects MGF.	Java `-s <FILE>`
`--database`	path	(required)	Target FASTA database. Decoys are generated automatically by reversing target sequences (see `--decoy-prefix`).	Java `-d <FILE>`
`--output-pin`	path	(required)	Output Percolator `.pin` file path. Always written unless the process exits with an error before the write phase.	Java `-o <FILE>` (when `-outputFormat pin`)

Search parameters

Flag	Type	Default	Description	Legacy form
`--precursor-tol-ppm`	f64	`20.0`	Symmetric precursor mass tolerance in parts per million.	Java `-t 20ppm`
`--charge-min`	u8	`2`	Minimum precursor charge to try when the spectrum record does not specify charge.	(no direct Java flag; set via param file in Java)
`--charge-max`	u8	`3`	Maximum precursor charge to try when charge is missing from the spectrum.	(same)
`--enzyme-specificity`	enum	`fully`	Enzymatic cleavage enforcement at peptide termini (Number of Tolerable Termini). `fully`: both termini must be cleavage sites (Java `-ntt 2`). `semi`: at least one terminus (Java `-ntt 1`). `non-specific`: neither required (Java `-ntt 0`).	`--ntt` alias; numeric `0`/`1`/`2`
`--max-missed-cleavages`	u32	`1`	Maximum missed enzymatic cleavages allowed per candidate peptide.	Java `-maxMissedCleavages 1`
`--min-length`	u32	`6`	Minimum peptide length in residues (excluding flanking context).	Java `-minLength 6`
`--max-length`	u32	`40`	Maximum peptide length in residues.	Java `-maxLength 40`
`--top-n`	u32	`10`	Maximum PSMs retained per spectrum (ranked by SpecEValue).	Java `-n 10`
`--isotope-error-min`	i8	`-1`	Minimum isotope error offset to evaluate during precursor matching.	Java `-ti -1,2` (first value)
`--isotope-error-max`	i8	`2`	Maximum isotope error offset to evaluate.	Java `-ti -1,2` (second value)
`--min-peaks`	u32	`10`	Minimum number of MS2 peaks required to score a spectrum; spectra below this threshold are skipped.	Java `-minNumPeaks 10`

Modifications

Flag	Type	Default	Description	Legacy form
`--mods`	path	(off)	Path to a Java-format `mods.txt` file describing fixed and variable modifications. When omitted, built-in defaults apply: Carbamidomethyl on C (fixed) and Oxidation on M (variable, max 3 per peptide). Composition strings (e.g. `C2H3N1O1`) are not supported — use numeric Da masses.	Java `-mod <FILE>`
			Hidden alias: `--mod` (singular).

Scoring

Flag	Type	Default	Description	Legacy form
`--fragmentation`	enum	`auto`	Fragmentation method for bundled `.param` resolution. Named: `auto`, `CID`, `ETD`, `HCD`, `UVPD`. `auto` on mzML triggers activation detection (§4); on MGF falls back to bundled defaults.	Java `-m`; numeric `0`=auto, `1`=CID, `2`=ETD, `3`=HCD, `4`=UVPD
`--instrument`	enum	`low-res`	Instrument class for bundled `.param` resolution. Named: `low-res`, `high-res`, `TOF`, `QExactive`.	Java `-inst`; numeric `0`=low-res, `1`=high-res, `2`=TOF, `3`=QExactive
`--protocol`	enum	`auto`	Search protocol suffix for bundled `.param` resolution. Named: `auto`, `phospho`, `iTRAQ`, `iTRAQ-phospho`, `TMT`, `standard`.	Java `-protocol`; numeric `0`=auto, `1`=phospho, `2`=iTRAQ, `3`=iTRAQ-phospho, `4`=TMT, `5`=standard
`--param-file`	path	(auto)	Explicit path to a `.param` scoring model file. When set, overrides all auto-detection and bundled resolution. Required when running a release binary outside the source tree if bundled resources are not present.	Java `-conf` / model path

Bundled default when all scoring flags are at their defaults (--fragmentation auto --instrument low-res --protocol auto): HCD_QExactive_Tryp.param. This preserves pre-auto-detect behaviour for MGF inputs and mzML files without activation metadata.

Resolution ladder (when --param-file is not set):

Try exact {Frag}_{Inst}_Tryp{ProtocolSuffix}.param.
If protocol-specific file missing, drop protocol suffix → {Frag}_{Inst}_Tryp.param.
Final fallback: CID_TOF_Tryp.param (HCD + TOF/HighRes), ETD_LowRes_Tryp.param (ETD), or CID_LowRes_Tryp.param (everything else).

Normalisation rules (mirrors Java NewScorerFactory):

auto fragmentation → treated as CID for filename resolution (except mzML auto-detect path, §4).
HCD + low-res instrument → upgraded to QExactive.

Only tryptic enzyme models are bundled; other enzymes require --param-file.

Runtime

Flag	Type	Default	Description	Legacy form
`--threads`	usize	logical CPU count	Rayon worker threads for the search loop. Pool is initialised once per process.	Java `-thread N`
`--ms-level`	u8	`2`	MS level to search. Non-matching spectra are filtered at load time. Meaningful for mzML only; MGF has no MS-level metadata and is always treated as MS2 (a warning is printed if `--ms-level` ≠ 2 on MGF).	(no Java equivalent)
`--max-spectra`	usize	`0`	Bench mode: process only the first N MS2 spectra. `0` = full input. When > 0, TSV output is skipped (PIN is still written).	(no Java equivalent)
`--decoy-prefix`	string	`XXX_`	Prefix prepended to reversed decoy protein accessions during index construction.	Java decoy tag in `-tda` workflows

Output

Flag	Type	Default	Description	Legacy form
`--output-tsv`	path	(off)	Optional tab-separated PSM report (§3b). Skipped in bench mode (`--max-spectra > 0`).	Java `-outputFormat 1` with output path

Environment variable: set MSGFRUST_RSS_PROBE=1 on Linux to print VmRSS checkpoints to stderr during long runs (debugging memory use).

2. Mods.txt format

msgf-rust reads the same modification file format as Java MS-GF+. The parser lives in crates/model/src/modification.rs and crates/model/src/aa_set.rs.

Grammar

Each non-comment line is five comma-separated fields:

<mass>,<aa>,<fix|opt>,<location>,<name>

Field	Rule
`<mass>`	Numeric monoisotopic mass delta in Da. Composition strings (`C2H3N1O1`) are not supported in msgf-rust.
`<aa>`	Single uppercase ASCII letter, or `` (wildcard). Multi-residue strings like `STY` are not* supported — declare one line per residue.
`<fix\|opt>`	`fix` = fixed (static) modification; `opt` = variable modification. Case-insensitive.
`<location>`	One of `any`, `N-term`, `C-term`, `Prot-N-term`, `Prot-C-term` (case-insensitive; hyphens optional).
`<name>`	Human-readable modification name (used in logs; not written to mzIdentML — that format is not supported).

Special directive: a line NumMods=N sets the maximum number of variable modifications per peptide. Parsed separately and applied to SearchParams.max_variable_mods_per_peptide. Default when absent: 3.

Comments: lines whose first non-whitespace character is # are ignored. Inline # ... comments are stripped from the end of a line (Java stripComment semantics). Blank lines are ignored.

Conflicts: a fixed and variable mod targeting the same (residue, location) slot is rejected at build time.

Example (a) — Carbamidomethyl C + Oxidation M

NumMods=3
57.02146,C,fix,any,Carbamidomethyl
15.99491,M,opt,any,Oxidation

When --mods is omitted, msgf-rust uses these two modifications as built-in defaults.

Example (b) — TMT 10-plex on K and peptide N-term

NumMods=2
57.02146,C,fix,any,Carbamidomethyl
229.162932,K,fix,any,TMT10plex
229.162932,*,fix,N-term,TMT10plex

Pair with --protocol TMT --fragmentation HCD --instrument QExactive to select HCD_QExactive_Tryp_TMT.param (§4, §7).

Example (c) — Phosphorylation on S, T, Y

NumMods=3
57.02146,C,fix,any,Carbamidomethyl
79.966331,S,opt,any,Phospho
79.966331,T,opt,any,Phospho
79.966331,Y,opt,any,Phospho

Pair with --protocol phospho to prefer a _Phosphorylation protocol-suffixed .param file when bundled.

3. Output formats

msgf-rust writes Percolator .pin (always) and optionally .tsv. Implementation: crates/output/src/pin.rs, crates/output/src/tsv.rs.

3a. PIN columns

Tab-separated, one header row, one row per PSM. Rows are sorted best-first within each spectrum (lowest SpecEValue first). Charge one-hot columns are emitted for every integer charge in [--charge-min, --charge-max]; the table below uses the default range 2–3 (charge2, charge3).

Column	Type	Description
`SpecId`	string	`{specID}_{scan}_{rank}` — unique PSM identifier.
`Label`	int	`+1` target, `-1` decoy (by source protein, not peptide sequence).
`ScanNr`	int	MS2 scan number from the input file.
`ExpMass`	float	Experimental neutral precursor mass (Da): `precursor_mz × charge − charge × proton`.
`CalcMass`	float	Theoretical neutral peptide mass (includes H₂O).
`mass`	float	Duplicate of `ExpMass` (OpenMS PercolatorAdapter convention).
`RawScore`	int	Rounded MS-GF+ score (`MSGFScore`).
`DeNovoScore`	int	Best de novo graph score for the spectrum.
`lnSpecEValue`	float	`ln(SpecEValue)`; `-MAX` if non-positive.
`lnEValue`	float	`ln(EValue)` where EValue = SpecEValue × num_distinct peptides.
`isotope_error`	int	Winning isotope offset (−1…2 by default).
`peplen`	int	Peptide residue count + 2 (includes flanking pre/post residues).
`dm`	float	Precursor mass error (Da) after isotope correction.
`absdm`	float	Absolute value of `dm`.
`charge2` … `chargeK`	0/1	One-hot encoding of assigned precursor charge.
`enzN`	0/1	N-terminal boundary consistent with enzyme rules.
`enzC`	0/1	C-terminal boundary consistent with enzyme rules.
`enzInt`	int	Count of internal enzymatic cleavage positions in the peptide.
`NumMatchedMainIons`	int	Matched charge-1 b/y fragment positions.
`longest_b`	int	Longest contiguous matched b-ion run.
`longest_y`	int	Longest contiguous matched y-ion run.
`longest_y_pct`	float	`longest_y / peptide.length()` (6 decimal places).
`ExplainedIonCurrentRatio`	float	Matched b+y intensity / total MS2 intensity.
`NTermIonCurrentRatio`	float	Matched b-ion intensity / total MS2 intensity.
`CTermIonCurrentRatio`	float	Matched y-ion intensity / total MS2 intensity.
`MS2IonCurrent`	float	Sum of all MS2 peak intensities (not log-scaled).
`IsolationWindowEfficiency`	float	Always `0.0` (not available from parsed spectra).
`MeanErrorTop7`	float	Mean absolute Da error of top-7 most-intense matched ions.
`StdevErrorTop7`	float	Population stdev of absolute Da errors (top-7).
`MeanRelErrorTop7`	float	Mean signed ppm error of top-7 ions.
`StdevRelErrorTop7`	float	Population stdev of signed ppm errors (top-7).
`lnDeltaSpecEValue`	float	`ln(rank1 SpecEValue / rank2 SpecEValue)` for rank-1 PSMs; `0` otherwise.
`matchedIonRatio`	float	`NumMatchedMainIons / peptide.length()`.
`EdgeScore`	int	Per-bond DBScanScorer edge sum (IES + error score). Rust-only additive column — not present in Java MS-GF+ PIN output; placed before `Peptide` so legacy column-index parsers still find sequence at the tail.
`Peptide`	string	`pre.SEQUENCE.post` with `+mass` mod annotations.
`Proteins`	string	Protein accession(s); decoy accessions carry `--decoy-prefix`. Multiple accessions tab-separated when one peptide maps to several proteins.

3b. TSV columns

Tab-separated human-readable report. The Title column appears only for MGF inputs (Java parity).

MGF header (is_mgf = true):

Column	Type	Description
`#SpecFile`	string	Bare filename of the input spectrum file.
`SpecID`	string	Spectrum identifier (MGF title, or `scan=N`).
`ScanNum`	int	Scan number.
`Title`	string	MGF `TITLE=` field.
`FragMethod`	string	Activation method name (`HCD`, `CID`, …) or `UNKNOWN`.
`Precursor`	float	Precursor m/z (4 decimal places).
`IsotopeError`	int	Always `0` in current release (winning offset not threaded to TSV).
`PrecursorError(ppm)`	float	Mass error in ppm when tolerance is ppm mode; column named `PrecursorError(Da)` in Da mode.
`Charge`	int	Assigned precursor charge.
`Peptide`	string	Annotated peptide sequence with modifications.
`Protein`	string	Single protein accession (primary candidate).
`DeNovoScore`	int	De novo score.
`MSGFScore`	int	Rounded raw score.
`SpecEValue`	float	SpecEValue in `%.6e` notation.
`EValue`	float	Database E-value in `%.6e` notation.

mzML header — same as above without the Title column (14 columns total).

Decoy PSMs are included in TSV output; downstream tools label them via Percolator or manual filtering.

3c. PIN vs TSV — which to use

Use PIN when the goal is FDR calibration or rescoring: Percolator, MS²Rescore, Mokapot, and quantms-style pipelines consume .pin directly and learn feature weights from the full Percolator feature set (including EdgeScore). Use TSV for spreadsheet inspection, custom reporting, or tools that expect Java MS-GF+'s flat PSM table. You can emit both in one run with --output-pin and --output-tsv. For production quantms workflows, PIN is the standard path; TSV is optional diagnostics.

4. Auto-detection

For mzML inputs, when --fragmentation auto, --instrument low-res, and --protocol auto (the CLI defaults), msgf-rust peeks the input file before loading the full dataset:

Activation method — histogram of <activation> cvParams across the first 64 MS2 spectra; dominant method wins. Mixed methods trigger an stderr warning but the dominant method is still used file-wide.
Instrument class — scans <instrumentConfiguration> / analyzer cvParams via input::detect_instrument_type; dominant analyzer among MS2 spectra wins. None → low-res (Java LOW_RESOLUTION_LTQ default).

MGF files carry no activation or instrument metadata → auto-detect returns None → bundled default HCD_QExactive_Tryp.param unless explicit --fragmentation / --instrument flags override.

Explicit --fragmentation (non-auto) or non-default --instrument disables the activation peek and uses flag-based resolution directly (§1).

Activation CV mapping (mzML `<activation>` cvParam accession → method)

CV accession	Name (PSI-MS)	msgf-rust method	Notes
`MS:1000133`	collision-induced dissociation	CID
`MS:1000422`	beam-type collision-induced dissociation (HCD)	HCD
`MS:1000598`	electron transfer dissociation	ETD
`MS:1000599`	pulsed Q dissociation	CID	Java collapses PQD → CID (`NewScorerFactory`)
`MS:1000435`	photodissociation	UVPD	Java UVPD mapping
`MS:1000250`	electron capture dissociation	ETD	Mapped to ETD (no dedicated ECD variant)

Instrument detection (analyzer cvParam → class)

Analyzer family	Examples	Instrument class
Ion trap / linear ion trap	`MS:1000264`, Velos, LTQ	`low-res`
Orbitrap / Fusion	`MS:1000480`, Fusion Lumos	`QExactive`
FT-ICR	`MS:1000480` (FT)	`high-res`
TOF	`MS:1000128`	`TOF`

Bundled `.param` files (`resources/ionstat/`)

39 scoring models ship with the binary (Tryp-centric unless noted):

CID_HighRes_NoCleavage.param          CID_HighRes_Tryp.param
CID_LowRes_ArgC.param                 CID_LowRes_AspN.param
CID_LowRes_GluC.param                 CID_LowRes_LysC.param
CID_LowRes_LysN.param                 CID_LowRes_LysN_Phosphorylation.param
CID_LowRes_NoCleavage.param           CID_LowRes_Tryp.param
CID_LowRes_Tryp_Phosphorylation.param CID_LowRes_aLP.param
CID_TOF_Tryp.param                    CID_TOF_aLP.param
ETD_HighRes_NoCleavage.param         ETD_HighRes_Tryp.param
ETD_LowRes_ArgC.param                 ETD_LowRes_AspN.param
ETD_LowRes_GluC.param                 ETD_LowRes_LysC.param
ETD_LowRes_LysN.param                 ETD_LowRes_LysN_Phosphorylation.param
ETD_LowRes_Tryp.param                 ETD_LowRes_Tryp_Phosphorylation.param
ETD_LowRes_aLP.param
HCD_HighRes_NoCleavage.param          HCD_HighRes_Tryp.param
HCD_HighRes_Tryp_Phosphorylation.param HCD_HighRes_Tryp_TMT.param
HCD_HighRes_Tryp_iTRAQ.param          HCD_HighRes_Tryp_iTRAQPhospho.param
HCD_QExactive_Tryp.param              HCD_QExactive_Tryp_Phosphorylation.param
HCD_QExactive_Tryp_TMT.param          HCD_QExactive_Tryp_iTRAQ.param
HCD_QExactive_Tryp_iTRAQPhospho.param HCD_TOF_aLP.param
UVPD_QExactive_Tryp.param             UVPD_QExactive_Tryp_TMT.param

When auto-detection fails (missing activation block, unknown CV term, or running outside the source tree without bundled resources): msgf-rust falls back to HCD_QExactive_Tryp.param for default-flag runs, or to the resolution ladder in §1 for explicit flags. If no bundled file resolves, the process exits with an error instructing you to pass --param-file <PATH> explicitly.

5. Building from source

Requirements: Rust 1.85+ (workspace pins 1.87.0 in rust-toolchain.toml because transitive dependencies use edition = "2024").

git clone https://github.com/bigbio/msgf-rust
cd msgf-rust
cargo build --release
# Binary: target/release/msgf-rust

Run the full workspace test suite:

cargo test --release --workspace

CI-skipped tests: GitHub Actions (.github/workflows/ci.yml) skips seven tests that fail on a clean checkout or are tracked as follow-up work. The release binary is unaffected.

Skipped test	Reason
`charge_missing_spectrum_uses_per_charge_scored_spec`	`min_peaks` filter regression (pre-iter32 baseline)
`spectrum_without_charge_tries_charge_range`	same category
`known_peptide_appears_in_top_n`	same category
`read_bsa_canno_text_format`	Maven fixture under `target/test-classes/` not generated in CI
`read_tryp_pig_bov_revcat_csarr_cnlcp`	same
`tryp_pig_bov_revcat_full_set_loads`	same
`match_spectra_output_invariant_across_thread_counts`	Rayon tie-breaking nondeterminism when scores tie

Reproduce the CI test invocation:

cargo test --release --workspace -- \
  --skip charge_missing_spectrum_uses_per_charge_scored_spec \
  --skip spectrum_without_charge_tries_charge_range \
  --skip known_peptide_appears_in_top_n \
  --skip read_bsa_canno_text_format \
  --skip read_tryp_pig_bov_revcat_csarr_cnlcp \
  --skip tryp_pig_bov_revcat_full_set_loads \
  --skip match_spectra_output_invariant_across_thread_counts

Release archives bundle the binary, all 39 .param files, and unimod.obo under resources/ — see README.md §Install.

6. Training new `.param` files

msgf-rust loads Java MS-GF+ .param scoring models without conversion. The 39 bundled files in resources/ionstat/ were copied from the Java distribution unchanged; the on-disk binary format is identical.

Training new models (novel fragmentation chemistry, instrument class, or acquisition protocol) requires a scoring-parameter generator. Java MS-GF+'s ScoringParamGen is the canonical trainer.

Status in v0.1.0: search and scoring are fully ported and benchmark-validated; ScoringParamGen is not yet ported to Rust. Track progress on the GitHub issues page.

Interim workflows:

Use bundled models — covers HCD QExactive tryptic DDA, CID low-res ion trap, ETD, phosphorylation, TMT, and iTRAQ variants (§4 file list).
Train on the java-legacy branch — check out the preserved Java tree (git checkout java-legacy), run Java ScoringParamGen on representative training data, then point msgf-rust at the output: --param-file /path/to/MyModel.param.

The Rust scorer reads any valid Java .param file via Param::load_from_file.

7. Isobaric labeling

TMT and iTRAQ searches require both protocol-aware scoring models and correct fixed modifications in mods.txt. Set --protocol TMT or --protocol iTRAQ (or legacy --protocol 4 / --protocol 2) so the resolver prefers protocol-suffixed bundled files such as HCD_QExactive_Tryp_TMT.param or HCD_QExactive_Tryp_iTRAQ.param.

TMT (10-plex example)

Mod masses: TMT10plex = 229.162932 Da on lysine and peptide N-terminus (Unimod). Carbamidomethyl on C is standard.

mods.txt:

NumMods=2
57.02146,C,fix,any,Carbamidomethyl
229.162932,K,fix,any,TMT10plex
229.162932,*,fix,N-term,TMT10plex

Command:

msgf-rust \
  --spectrum tmt_spectra.mzML \
  --database hsapiens.fasta \
  --output-pin out.pin \
  --mods tmt_10plex_mods.txt \
  --protocol TMT \
  --fragmentation HCD \
  --instrument QExactive

iTRAQ (8-plex example)

Mod masses: iTRAQ8plex = 304.20536 Da on K and peptide N-terminus.

mods.txt:

NumMods=2
57.02146,C,fix,any,Carbamidomethyl
304.20536,K,fix,any,iTRAQ8plex
304.20536,*,fix,N-term,iTRAQ8plex

Command:

msgf-rust \
  --spectrum itraq_spectra.mzML \
  --database hsapiens.fasta \
  --output-pin out.pin \
  --mods itraq_8plex_mods.txt \
  --protocol iTRAQ \
  --fragmentation HCD \
  --instrument QExactive

For phospho-enriched isobaric data use --protocol iTRAQ-phospho (legacy --protocol 3) and include phospho variable mods in mods.txt (§2 example c).

8. Java MS-GF+ → msgf-rust migration

msgf-rust accepts both canonical kebab-case flags with named enum values and legacy Java short flags / numeric IDs. Existing quantms scripts using --fragmentation 3 --instrument 3 --protocol 4 continue to work.

8a. Flag rename table

Java MS-GF+	msgf-rust canonical	msgf-rust legacy alias
`-s <FILE>`	`--spectrum <FILE>`	—
`-d <FILE>`	`--database <FILE>`	—
`-o <FILE>`	`--output-pin <FILE>`	—
`-mod <FILE>`	`--mods <FILE>`	`--mod <FILE>` (hidden)
`-t 20ppm`	`--precursor-tol-ppm 20`	—
`-ti -1,2`	`--isotope-error-min -1 --isotope-error-max 2`	—
`-m 3` (HCD)	`--fragmentation HCD`	`--fragmentation 3`
`-inst 3` (QExactive)	`--instrument QExactive`	`--instrument 3`
`-protocol 4` (TMT)	`--protocol TMT`	`--protocol 4`
`-ntt 2` (fully specific)	`--enzyme-specificity fully`	`--ntt 2`
`-tda 1`	(omit — decoys auto-generated)	—
`-e 1` (Trypsin)	(omit — Trypsin only; other enzymes need `--param-file`)	—
`-outputFormat 1` (TSV)	`--output-tsv <FILE>`	—
`-thread N`	`--threads N`	—
`-minLength 6`	`--min-length 6`	—
`-maxLength 40`	`--max-length 40`	—
`-maxMissedCleavages 1`	`--max-missed-cleavages 1`	—
`-minNumPeaks 10`	`--min-peaks 10`	—
`-n 10`	`--top-n 10`	—
model path / `-conf`	`--param-file <FILE>`	—

8b. Numeric-legacy values

Full legacy 0…N → named-value tables for --fragmentation, --instrument, --protocol, and --enzyme-specificity (--ntt) live in CLI_MIGRATION.md. clap accepts named values case-insensitively (--fragmentation hcd ≡ HCD).

8c. Behavior differences

Area	Java MS-GF+	msgf-rust
Spectrum inputs	mzML, MGF, mzXML, MS2, PKL, `_dta.txt`, …	mzML and MGF only
Identification output	PIN, TSV, mzIdentML	PIN + optional TSV (no mzIdentML)
Decoys	Separate target/decoy FASTA or `-tda` modes	Always auto-generated reversed decoys from target FASTA (`--decoy-prefix`)
Enzymes	Many via param file / CLI	Trypsin only in bundled models; other enzymes via `--param-file`
Mods file	Composition strings supported	Numeric Da masses only
Help	Picocli	clap-derived `--help`
Memory model	Loads full spectrum list	Chunked streaming (5000 spectra/chunk) for large mzML files

8d. Known parity divergences

On PSMs where Java and Rust agree on scan + top-1 peptide (the "agreement bucket"), three PIN features still differ systematically. None block production use — aggregate 1% FDR PSM counts meet or beat Java on all three benchmark datasets (see README.md).

Feature	Divergence	Status
`lnEValue`	~4 orders of magnitude mean shift (Rust more confident)	Deferred — `num_distinct` semantics differ (`known-divergences.md` #2)
`MeanRelErrorTop7` / `MeanErrorTop7` / `StdevRelErrorTop7`	>1% relative difference on ~99% of agreement-bucket PSMs	Deferred — error-stat normalization differs
BSA charge-3 SpecEValue (BSA.fasta + test.mgf fixture)	1–4 OOM gap depending on deconvolution iteration	Known — deconvolution implementation divergence (`known-divergences.md` #3); kept as dev-branch smoke gate

Percolator's learned weights absorb these distribution shifts; rescored FDR results remain competitive or better than Java.

9. License and citation

msgf-rust is distributed under the UCSD Noncommercial License — the same terms as upstream MS-GF+. The license permits copying, modification, and distribution for educational, research, and non-profit purposes without fee, provided the copyright notice and liability paragraphs are retained. Commercial use requires written permission from the UCSD Technology Transfer Office (see LICENSE for contact details).

The software is provided "as is" without warranty. See LICENSE for the full upstream text and NOTICE for port attribution.

Citation

If you use msgf-rust in published work, cite the original MS-GF+ paper:

Kim, S. and Pevzner, P.A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications, 5:5277.

And optionally this Rust port:

bigbio (2026). msgf-rust: a Rust port of MS-GF+ for the quantms pipeline. https://github.com/bigbio/msgf-rust

The original Java implementation is preserved on the java-legacy branch; upstream MS-GF+ lives at https://github.com/MSGFPlus/msgfplus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msgf-rust documentation

Contents

1. CLI reference

Required

Search parameters

Modifications

Scoring

Runtime

Output

2. Mods.txt format

Grammar

Example (a) — Carbamidomethyl C + Oxidation M

Example (b) — TMT 10-plex on K and peptide N-term

Example (c) — Phosphorylation on S, T, Y

3. Output formats

3a. PIN columns

3b. TSV columns

3c. PIN vs TSV — which to use

4. Auto-detection

Activation CV mapping (mzML `<activation>` cvParam accession → method)

Instrument detection (analyzer cvParam → class)

Bundled `.param` files (`resources/ionstat/`)

5. Building from source

6. Training new `.param` files

7. Isobaric labeling

TMT (10-plex example)

iTRAQ (8-plex example)

8. Java MS-GF+ → msgf-rust migration

8a. Flag rename table

8b. Numeric-legacy values

8c. Behavior differences

8d. Known parity divergences

9. License and citation

Citation

FilesExpand file tree

DOCS.md

Latest commit

History

DOCS.md

File metadata and controls

msgf-rust documentation

Contents

1. CLI reference

Required

Search parameters

Modifications

Scoring

Runtime

Output

2. Mods.txt format

Grammar

Example (a) — Carbamidomethyl C + Oxidation M

Example (b) — TMT 10-plex on K and peptide N-term

Example (c) — Phosphorylation on S, T, Y

3. Output formats

3a. PIN columns

3b. TSV columns

3c. PIN vs TSV — which to use

4. Auto-detection

Activation CV mapping (mzML <activation> cvParam accession → method)

Instrument detection (analyzer cvParam → class)

Bundled .param files (resources/ionstat/)

5. Building from source

6. Training new .param files

7. Isobaric labeling

TMT (10-plex example)

iTRAQ (8-plex example)

8. Java MS-GF+ → msgf-rust migration

8a. Flag rename table

8b. Numeric-legacy values

8c. Behavior differences

8d. Known parity divergences

9. License and citation

Citation

Activation CV mapping (mzML `<activation>` cvParam accession → method)

Bundled `.param` files (`resources/ionstat/`)

6. Training new `.param` files