Logistic Spectral Validation Instructional Manual

Kebing Yu




Table of Content


1.       Installation

2.    GUI Mode

    2.1     Making a new model

    2.2     Applying an existing model

3.    Automation Mode

4.       FAQ

 




1. Installation

To install R programming environment, first download a correct distribution of R installer according to your OS from http://www.r-project.org/. Then Run installer to install R using default settings.

Additional libraries (ROC and biobase) need to be installed. Launch R program and in the R Console type in the following commands:

source("http://bioconductor.org/biocLite.R")
biocLite("ROC")
biocLite("Biobase")


Logistic spectral validation package can be installed to any folder by unzipping the source package. Two folders are created: ./gui and ./commandline

Download perl from http://www.activestate.com/activeperl/ and install it to c:/perl. If your path to perl.exe is not c:/perl/bin/perl.exe, you have to modify ./gui/PredictGui.r at line 21 and change it to your path. A perl script is embeded in the spectral validation package to parse SEQUEST out files.


2. GUI Mode

GUI related source files are located in ./gui folder. Before running logistic validation model in GUI mode, open the file runAll.r and change the variable codedir at line 3 to where the code is located on your computer. (note: R only recognizes forward slashes [/], and don't forget the final /)

in the R Console type in:

setwd('yourdir')
source('runAll.r')

In the following screen you are able to choose one of two options: 'Make new model' and 'Apply old model'. Select one and click 'OK'


2.1 Making a new model

If 'make new model' is selected, in the next screen, a dialog that asks if the data set is already validated shows up. Select 'yes'. Then you are directed to a dialog to locate the target folder that contains both dta and out files created by SEQUEST. Those files are used to compute variables for logistic regression.

A validation file is required in the following screen to train new models. A user provides calls for correct and incorrect SEQUEST assignments and the software builds a logistic model based on these user calls. Validation files are tab delimited of the format "valid boolean"\t"1st scan number"\t"charge state"\n.  Valid is a boolean with 1 for correct and 0 for incorrect.  The scan number is contained in the DTA/OUT file names as the first number after the file name.

After calculations are completed, four models (sequest, sequest plus, reduced spectral, full spectral) trained using user-provided data are saved in the same folder as dta/out with the file extension .Rdata. These rdata files can be used in subsequent application of the prebuilt models to newly acquired data.


2.2 Applying an existing model

User-trained models can be applied to new proteomic data for prediction. Following the instructions on screen, prebuilt model and dta/out files are located. Then, once calculations are finished, result files are saved to the same folder as dta/out files. You can find comma-delimited [****]validations.csv that contains the out put data. (content in [****] depends on which validation model you have selected.)

valid xcorr scan charge reverse DB filename
1 0.971268 3.6384 2717 3 0 ctrl_mcp5.2717.2743.3.txt
2 0.068586 3.0071 2514 3 0 mcp5_1min.2514.2536.3.txt
3 0.991618 3.5538 2544 3 0 mcp5_1min.2544.2548.3.txt
4 0.68889 2.917 2684 3 0 mcp5_1min.2684.2704.3.txt
5 0.982446 3.8396 2504 3 0 mcp5_2min.2504.2504.3.txt
6 0.647182 2.5267 2633 3 0 mcp5_2min.2633.2646.3.txt
7 0.981636 3.5918 2584 3 0 mcp5_30sec.2584.2604.3.txt
8 0.947145 3.1169 2599 3 0 mcp5_30sec.2599.2622.3.txt
9 0.968721 2.8528 2629 3 0 mcp5_30sec.2629.2644.3.txt

In the table above, column labeled as 'valid' represents a quantitative measure of how good the sequence assignment matches to raw MS/MS spectrum. A score between 0 to 1 is provided in which 1 means the most likely correct and 0 means the least likely.


3. Automation Mode

Logistic regression can be operated in command-line mode to provide automated, high-throughput statistical validation. All related files are located in the folder ./commandline which shares most core statistical codes with the GUI version except for some user interface issue. Some modifications are necessary for your own implementation since the current release is part of HTAPP system and codes are specificly amended for this system.

A basic workflow in automation mode is illustrated:
1) put "dta" files in c:\temp\dta\. File name should have an extension ".txt" and only one dot is allowed. (eg. mysample_10_12_3.txt) --you may change this restriction in combineDtaOut.r
2) put pre-parsed "out" files in c:\temp\LogisticScoreInput\. Use the same file name as the correspoding dta file. Pre-parsed file contains all necessary data extracted from Sequest OUT file for the later statistical analysis. You may implement the perl script parsepepdoc.pl provided in this package to generate pre-parsed file from the original OUT file. A sample file is illustrated here. From the first line: scan number; charge state; observed MH+; differential modifications, separated by "|"; static modifications, separated by " "; 1-10 top hits (from left to right: theoretical MH+, delta CN, XCorr, Sp, ions matched, ions total, sequence, decoy hits);

1024
2
1691.77700
* +79.9663|# +10.0083|@ +8.0142|
C=160.0612
1691.76930    0.0000    3.7446    137.3    16    52    R.FLLPS*VGT*VVDQEK.G    R
1691.82350    0.0309    3.6287    183.9    11    28    K.DLENNLPYDGQGTKK.S    R
1691.64942    0.0400    3.5947    114.3    13    55    K.QPNIS*LY*CT*VEK.I    R
1691.86603    0.0550    3.5385    147.1    11    36    R.TK@VK@ELY*DVLMEK.M    F
1691.84041    0.0572    3.5303    124.0    11    36    -.EKITFLY*VRGEEK.K    R
1691.78332    0.0967    3.3826    130.4    13    44    K.R#WAKT*Y*LLVDEK.L    F
1691.67877    0.1097    3.3339    157.3    12    44    R.DNY*KNT*LYLEMK.S    F
1691.77930    0.1098    3.3335    99.0    12    48    R.KTY*VSAPR#IT*ETR.G    R
1691.75940    0.1115    3.3270    159.6    15    52    R.FPLYPPNS*GS*LLAR.Y    R
1691.69894    0.1557    3.1615    119.2    13    55    K.FEVLICT*T*LY*GK@.K    F


3) run combDir.bat in C:\Temp\RLogisticScore\ to split input files into four threads
4) run RunR1.bat, RunR2.bat, RunR3.bat, RunR.bat to start calculations. Calculation results are saved in c:\temp\validations. One file that contains spectral validation score and decoy database boolean (0: Forward; 1: Reversed) per each out file is created.

To change the default directories, these files should be modified with new locations: comDir.bat, runComb.r, combineDtaOut.r, runR.bat, runR1.bat, runR2.bat, runR3.bat, runAll.r, runAll1.r, runAll2.r, runAll3.r, commandMain.r, commandMain1.r, commandMain2.r, commandMain3.r


4. FAQ