DDA Data Processing from Mass Spec

02/05/2017

Data processing steps after Mass Spectrometry

I carried out all of my data processing by using ssh to remote login to Emu in the Roberts Lab. The following programs were installed to run the DDA proteomics analyses: Comet, Trans Proteomic Pipeline (Peptide and Protein Prophet), and Abacus. To use Jupyter Notebook to document my progress I tunnelled into Emu using the following procedure:

Open a Git Bash terminal locally and enter the following code:

ssh -N -L localhost:8000:localhost:7000 srlab@128.95.149.195

Enter password when prompted
Terminal is active but it won’t return any command prompt. You have created a tunnel into Emu. To start ipython on Emu
Open a new Git Bash terminal, but keep the first one open.
Enter ssh srlab@128.95.149.195
Enter password when prompted
Enter ipython notebook --no-browser --port=7000
Now you are remotely logged into Emu and have started ipython without the browser.
Open a browser locally and enter localhost:8000
This will open Jupyter in your browser on the Emu machine. It will prompt you for the token which you can find in the second Git Bash terminal you opened.

1) The MS creates .raw files (which I stored on Owl) and to run peptide searches in Comet you should convert the files into .mzXML files. I saved these files in a local directory on Emu. I did this for all my files by running the following code:

for file in ~/Documents/rhonda2016oyster/trochophora/C_gigas_proteome2016/20161205Sample*.raw do no_path=${file##*/} no_ext=${no_path%.raw} WINEPREFIX=~/.wine32 wine ReAdW.2016010.msfilereader.exe "no_ext".mzXML done

See Jupyter notebook

2) Next you will run a program called Comet which will search your .mzXML files against a protein sequence database. For this step you will need three items:

A protein database for your target organism, in my case a Crassostrea gigas proteome (contigs.fasta.transdecoder.pep) with the addition of three contaminant files (contam.other, contam.human, and contam.bovin). I added the contaminant files using the following code:

!cat contam.bovin contam.human contam.other contigs.fasta.transdecoder.pep > contigs_contam.fasta.transdecoder.pep

Your .mzXML files in one directory (excluding blanks and QCs)
A comet.params files which I downloaded from this website. I chose the comet.params.high-low high res MS1 and low res MS2 e.g. Velos-Orbitrap.

All these files need to be in the same directory. Comet will produce pep.xml files for all your samples.

I used the following code to conduct my searches:

!/home/shared/comet/comet.2016012.linux.exe -PPcomet.params.high-low -Dcontigs_contam.fasta.transdecoder.pep 20161205_Sample_*.mzXML

See Jupyter notebook

3) Next I calculated statistics associated with the peptide to protein matches using Trans Proteomic Pipeline (TPP). I used a p-value cut off of 0.9. I used the pep.xml files for this step and kept everything in the same directory. this step will create a series of interact- files for each of your pep.xml files. The following code was used:

!/usr/tpp_install/tpp/bin/xinteract -dDECOY_ -N20161205_sample_1 20161205_sample_1.pep.xml -p0.9 -OAp

See Jupyter notebooks 1 2 3 4

4) Abacus correlates protein inferences across my sample files so that a single protein can be associated with each peptide. The output will be a compiled single .tsv file with corresponding spectral counts and normalized spectral abundance factor (NSAF) which is a proxy for protein abundance.

I used the following code:

!/usr/tpp_install/tpp/bin/ProteinProphet interact*.pep.xml interact-COMBINED.prot.xml

I made an Abacus parameter file

And used this code in the bash window:

java -Xmx16g -jar /home/shared/abacus/abacus.jar -p ~/Documents/rhonda/Abacus_parameters.txt

See Jupyter notebook

5) The next step will involve the Qspec spectral counter which is a web-based program found here. Input files need to be in .txt file format. This program finds differntially abundant proteins by giving you a log fold change and z-statistic for each protein. A protein is generally considered significant if it has a log fold change of at least absolute value of 0.5 and a z-stat of at least absolute value of 2.

Written on February 5, 2017