Logistic
Spectral Validation Instructional Manual
Kebing Yu
|
Table of
Content
1. Installation
2. GUI
Mode
2.1
Making a new model
2.2
Applying an existing model
3.
Automation Mode
4. FAQ
|
1.
Installation
To install R programming environment, first download a
correct distribution of R installer according to your OS from
http://www.r-project.org/. Then Run installer to install R using default
settings.
Additional libraries (ROC and biobase) need to be installed.
Launch R program and in the R Console type in the following commands:
source("http://bioconductor.org/biocLite.R") biocLite("ROC") biocLite("Biobase")
|
Logistic spectral validation package can be installed
to any folder by unzipping the source package. Two folders are created: ./gui
and ./commandline
Download perl from
http://www.activestate.com/activeperl/ and install it to c:/perl. If your path
to perl.exe is not c:/perl/bin/perl.exe, you have to modify ./gui/PredictGui.r
at line 21 and change it to your path. A perl script is embeded in the spectral
validation package to parse SEQUEST out files.
2. GUI
Mode
GUI related source files are located in ./gui folder. Before
running logistic validation model in GUI mode, open the file runAll.r and change
the variable codedir at line 3 to where the code is located on your computer.
(note: R only recognizes forward slashes [/], and don't forget the final
/)
in the R Console type in:
setwd('yourdir') source('runAll.r')
|
In
the following screen you are able to choose one of two options: 'Make new model'
and 'Apply old model'. Select one and click 'OK'
2.1 Making a new
modelIf 'make new model' is selected, in the next screen, a dialog
that asks if the data set is already validated shows up. Select 'yes'. Then you
are directed to a dialog to locate the target folder that contains both dta and
out files created by SEQUEST. Those files are used to compute variables for
logistic regression.
A validation file is required in the following
screen to train new models. A user provides calls for correct and incorrect
SEQUEST assignments and the software builds a logistic model based on these user
calls. Validation files are tab delimited of the format "valid boolean"\t"1st
scan number"\t"charge state"\n. Valid is a boolean with 1 for correct and
0 for incorrect. The scan number is contained in the DTA/OUT file names as
the first number after the file name.
After calculations are completed,
four models (sequest, sequest plus, reduced spectral, full spectral) trained
using user-provided data are saved in the same folder as dta/out with the file
extension .Rdata. These rdata files can be used in subsequent application of the
prebuilt models to newly acquired data.
2.2 Applying an existing
modelUser-trained models can be applied to new proteomic data for
prediction. Following the instructions on screen, prebuilt model and dta/out
files are located. Then, once calculations are finished, result files are saved
to the same folder as dta/out files. You can find comma-delimited
[****]validations.csv that contains the out put data. (content in [****] depends
on which validation model you have selected.)
|
valid |
xcorr |
scan |
charge |
reverse DB |
filename |
|
| 1 |
0.971268 |
3.6384 |
2717 |
3 |
0 |
ctrl_mcp5.2717.2743.3.txt |
| 2 |
0.068586 |
3.0071 |
2514 |
3 |
0 |
mcp5_1min.2514.2536.3.txt |
| 3 |
0.991618 |
3.5538 |
2544 |
3 |
0 |
mcp5_1min.2544.2548.3.txt |
| 4 |
0.68889 |
2.917 |
2684 |
3 |
0 |
mcp5_1min.2684.2704.3.txt |
| 5 |
0.982446 |
3.8396 |
2504 |
3 |
0 |
mcp5_2min.2504.2504.3.txt |
| 6 |
0.647182 |
2.5267 |
2633 |
3 |
0 |
mcp5_2min.2633.2646.3.txt |
| 7 |
0.981636 |
3.5918 |
2584 |
3 |
0 |
mcp5_30sec.2584.2604.3.txt |
| 8 |
0.947145 |
3.1169 |
2599 |
3 |
0 |
mcp5_30sec.2599.2622.3.txt |
| 9 |
0.968721 |
2.8528 |
2629 |
3 |
0 |
mcp5_30sec.2629.2644.3.txt |
In
the table above, column labeled as 'valid' represents a quantitative measure of
how good the sequence assignment matches to raw MS/MS spectrum. A score between
0 to 1 is provided in which 1 means the most likely correct and 0 means the
least likely.
3. Automation ModeLogistic regression
can be operated in command-line mode to provide automated, high-throughput
statistical validation. All related files are located in the folder
./commandline which shares most core statistical codes with the GUI version
except for some user interface issue. Some modifications are necessary for your
own implementation since the current release is part of HTAPP system and codes
are specificly amended for this system.
A basic workflow in automation
mode is illustrated:
1) put "dta" files in
c:\temp\dta\. File name
should have an extension ".txt" and only one dot is allowed. (eg.
mysample_10_12_3.txt) --you may change this restriction in combineDtaOut.r
2)
put pre-parsed "out" files in
c:\temp\LogisticScoreInput\. Use the same
file name as the correspoding dta file. Pre-parsed file contains all necessary
data extracted from Sequest OUT file for the later statistical analysis. You may
implement the perl script
parsepepdoc.pl provided in this package to
generate pre-parsed file from the original OUT file. A sample file is
illustrated here. From the first line: scan number; charge state; observed MH+;
differential modifications, separated by "|"; static modifications, separated by
" "; 1-10 top hits (from left to right: theoretical MH+, delta CN, XCorr, Sp,
ions matched, ions total, sequence, decoy hits);
1024 2 1691.77700 * +79.9663|# +10.0083|@ +8.0142| C=160.0612 1691.76930 0.0000 3.7446 137.3 16 52 R.FLLPS*VGT*VVDQEK.G R 1691.82350 0.0309 3.6287 183.9 11 28 K.DLENNLPYDGQGTKK.S R 1691.64942 0.0400 3.5947 114.3 13 55 K.QPNIS*LY*CT*VEK.I R 1691.86603 0.0550 3.5385 147.1 11 36 R.TK@VK@ELY*DVLMEK.M F 1691.84041 0.0572 3.5303 124.0 11 36 -.EKITFLY*VRGEEK.K R 1691.78332 0.0967 3.3826 130.4 13 44 K.R#WAKT*Y*LLVDEK.L F 1691.67877 0.1097 3.3339 157.3 12 44 R.DNY*KNT*LYLEMK.S F 1691.77930 0.1098 3.3335 99.0 12 48 R.KTY*VSAPR#IT*ETR.G R 1691.75940 0.1115 3.3270 159.6 15 52 R.FPLYPPNS*GS*LLAR.Y R 1691.69894 0.1557 3.1615 119.2 13 55 K.FEVLICT*T*LY*GK@.K F
|
3)
run combDir.bat in
C:\Temp\RLogisticScore\ to split input files into four
threads
4) run RunR1.bat, RunR2.bat, RunR3.bat, RunR.bat to start
calculations. Calculation results are saved in
c:\temp\validations. One
file that contains spectral validation score and decoy database boolean (0:
Forward; 1: Reversed) per each out file is created.
To change the default
directories, these files should be modified with new locations: comDir.bat,
runComb.r, combineDtaOut.r, runR.bat, runR1.bat, runR2.bat, runR3.bat, runAll.r,
runAll1.r, runAll2.r, runAll3.r, commandMain.r, commandMain1.r, commandMain2.r,
commandMain3.r
4. FAQ