mineDisProt/README.Rmd at master · vanbibn/mineDisProt · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
output:
    md_document:
        variant: markdown_github
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "#>",
  fig.path = "README-"
)
```

# mineDisProt

The package mineDisProt was developed to extract data from unstructured or semi-structured formats and compile into matrices or data frames more amenable to analysis of intrinsically disorder (ID).

In proteins, intrinsic disorder (ID) is a phenomenon that describes the lack of a stable, or ordered, tertiary structure while still maintaining physiologic functions.  Informatics tools, such as those provided by PONDR](http://www.pondr.com/) and [PONDR-FIT]( http://original.disprot.org/pondr-fit.php) make it easy to analyze a protein sequence for intrinsic disorder; however, this task can become tedious if one has tens or even hundreds of sequences to analyze.

My goal in developing `mineDisProt` was to simplify the data collection process so you can focus on the analysis of your data set.


## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("vanbibn/mineDisProt")
```

## Examples

First, we need to load the package.

```{r load}
library(mineDisProt)
```

### extract_pondr()

This function extracts numerical data from text porduced by the VLXT, VL3, and VSL2 disorder predictors from the Predictor of Natural Disordered Regions [PONDR](http://www.pondr.com/). For each protien sequence, the data from the Raw Output of all three predictors were pasted into a `.txt` file.

```{r example_1, warning=FALSE}
pondr_data <- extract_pondr("inst/extdata/pondr_text/")

# here we will view the data from the VLXT predictor
pondr_data[,1:6]
```

###### * You may get warnings about "incomplete final line found [in the text file]."


### extract_pondr.noVL3()

This is a version of `extract_pondr()`, except the raw data for the VL3 predictor is missing from the raw data text files.

```{r example_2, warning=FALSE}
extract_pondr.noVL3("inst/extdata/pondr_text_withoutVL3/")
```


### extract_pondrFIT()

This function extracts the relevant data from the temporaty URL produced by analyzing a protein sequence in the [PONDR-FIT](http://original.disprot.org/pondr-fit.php) protein disorder meta-predictor. It then calculates  average and percent disorder scores from the per-residue scores.  Before using this function, URLs should be collected and put in a `.csv` file with the UniProt ID in the first column and URL in the second.

```{r example_3, message=FALSE}
extract_pondrFIT("inst/extdata/pondrfit-url.csv")
```

###### * You may get messages about "Parsed with column specification:".


### extract_string()

Extract data on the quantity of protein interactions given by [STRING](https://string-db.org/) with the minimum number of interactions at medium (0.4), high (0.7), and highest (0.9) confidence.

```{r example_4, warning=FALSE}
string_data <- extract_string("inst/extdata/string/")

# here are the interaction data for each protein at the 0.4 confidence level
string_data[,1:6]
```

###### * You may get some warnings like "NAs introduced by coercion" if your protein is missing data at higher confidence levels.

***

Note: In future versions of this package, I hope to fully automate the data collection process.