scGroupComparisonTutorial/FormattingProteomicsData.Rmd at main · PayneLab/scGroupComparisonTutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
title: "Formatting Proteomics Data"
output: html_document
date: "2025-05-28"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Specifications For Raw Proteomics Data

Running any Model 7 hypothesis is a two step process: 1) Process the data. 2) Run group comparison on processed data. Both steps are made easy by MSstats, which is what Model 7 is built upon. MSstats has particular requirements for raw proteomics data before it can be processed.

Specifically, these requirements are:
- The data is stored in a data frame.
- The data frame needs the following column names: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity. The variable names should be fixed.
- If the information of one or more columns is not available for the original raw data, please retain the column variables and type in fixed value. For example, the original raw data does not contain the information of ProductCharge, we retain the column ProductCharge and type in NA for all transitions in RawData.
- The column BioReplicate should label with unique patient ID (i.e., same patients should label with the same ID).
- Variable Intensity is required to be original signal without any log transformation and can be specified as the peak of height or the peak of area under curve.

## Formatting The Conditions Column

The MSstats groupComparison() method uses the values in the Condition column to group data points. Every value in the Condition column needs to match one of the column values in the comparison matrix. For more information, see the CreatingComparisonMatrices document.