LIMMA pipeline processing proteomics (MS)

gene_x 0 like s 1199 view s

Tags: R, pipeline

Statistical analysis on LC-MS data

In order to detect significant changes between two experimental groups we performed statistical analysis on LC-MS data. We have chosen LIMMA moderated t test statistics as proposed in Kammers et al. (2015) over standard t test, since it utilizes full data to shrink the observed sample variance toward a pooled estimate, resulting in far more stable and powerful inference compared to ordinary t test. Missing values in proteomic data are a common problem and it is widely used in practice to impute missing information before applying statistical methods. We used the imputation algorithm proposed in Tyanova et al. (2016) because it allows us to randomly draw values from a distribution meant to simulate values below the detection limit of the MS instrument. We implemented statistical analysis in R and codes are freely available at the following GitHub repository https://github.com/wasimaftab/LIMMA-pipeline- proteomics/tree/master/Mitochondrial_Loop. The final volcano plots only feature proteins that are part of the high-confidence mito- chondrial proteome (Morgenstern et al., 2017).

DATA AND CODE AVAILABILITY

The codes generated during this study to perform statistical analysis on the BioID LC-MS data are freely available at GitHub repository https://github.com/wasimaftab/LIMMA-pipeline-proteomics/tree/master/Mitochondrial_Loop. The data supporting this study have been uploaded on the Mendeley Dataset public repository at https://doi.org/10.17632/9symckg9wn.

#This is a pipeline to analyze proteiomic data in a proteinGroups.txt (MaxQuant output) file for two group comparision
#install.packages("matlab")
library(dplyr)
library(stringr)
library(MASS)
library(matlab)
library(plotly)
library(limma)
library(qvalue)
library(htmlwidgets)
library(rstudioapi)

source("limma_helper_functions.R")

# ---- INPUT for Gunnar Project, directly read the sheet "Gene Prot (FOUND)" ----
## Load the proteingroups file
proteingroups_ <- read.csv("210618_MG_Carotisartery_consensus.csv", sep=",", header=TRUE)
proteingroups <- proteingroups_ %>% filter(Checked == "WAHR")

#Gene Symbol    Accession   CA1 CA2 CA3 CA10    CA5 CA6 CA8 CA14    CA12    CA13    CA11    CA4 Con1    Con2    Con3    Con4    Con5    Con6
#IGKV3-7    A0A075B6H7      195532822   276145022   75324344    179451608   93312672    188741058   244072994   540080672   133307714   102293349.5 133798065   166553494.5 138312451   45263876    361462136   185500736   357708246   156721253


### Kill the code if proteingroups does not contain crap columns
#temp <-
#  select(
#    proteingroups,
#    matches("(Reverse|Potential.contaminant|Only.identified.by.site)")
#  )
#if (!nrow(temp) * ncol(temp)) {
#  stop("File error, It does not contain crap...enter another file with crap")
#}

## Display data to faciliate choice of treatment and control
temp <- select(proteingroups, matches("Abundance"))
print(names(temp))

### Remove "+" identifeid rows from proteingroups
#idx <- NULL
#temp <-
#  select(
#    proteingroups,
#    matches("(Only.identified.by.site|Reverse|Potential.contaminant)")
#  )
#for (i in 1:ncol(temp)) {
#  index <- which(unlist(!is.na(match(temp[, i], "+"))))
#  idx <- union(idx, index)
#}
#proteingroups <- proteingroups[-idx, ] # removing indexed rows

### Remove proteins with Sequence coverage [%] < 5 [%] and Unique peptides == 1
#temp <-
#  select(proteingroups,
#         matches("(^Sequence.coverage(.){4}$|^Unique.peptides$)"))
#proteingroups <-
#  proteingroups[-union(which(temp[, 1] == 1), which(temp[, 2] < 5)) , ] # removing indexed rows


## Extract Uniprot and gene symbols
#splits <- unlist(strsplit("CAD protein OS=Mus musculus OX=10090 GN=Cad PE=1 SV=1", '\\|'))
#splits <- unlist(str_match("CAD protein OS=Mus musculus OX=10090 GN=Cad PE=1 SV=1", "GN=(.*?) PE="))
Uniprot <- character(length = nrow(proteingroups))
Symbol <- character(length = nrow(proteingroups))
Descriptions <- character(length = nrow(proteingroups))
for (i in 1:nrow(proteingroups)) {
  temp <- as.character(proteingroups$Description[i])
  Descriptions[i] <- temp
  Uniprot[i] <- as.character(proteingroups$Accession[i])
  splits <- unlist(str_match(temp, "GN=(.*?) PE="))
  Symbol[i] <- splits[2]
}


# WAP_1, WAP_2, WAP_3, 
# D-ALPHA_1, D-ALPHA_2, D-ALPHA_3
# Q-ALPHA_1, Q-ALPHA_2, Q-ALPHA_3
# pTTSS+Q_1, pTTSS+Q_2, pTTSS+Q_3
# pTTSS+Q-ALPHA_1, pTTSS+Q-ALPHA_2, pTTSS+Q-ALPHA_3
##Extract required data for Limma
#treatment <- readline('Enter treatment name(case insensitive) as it appeared in the iBAQ/LFQ column= ')
#control <- readline('Enter control name(case insensitive) as it appeared in the iBAQ/LFQ column= ')
#ibaq <- readinteger_binary('Enter 1 for iBAQ or 0 for LFQ= ')
treatment <- c("Abundance..F4..Sample","Abundance..F5..Sample","Abundance..F6..Sample")
control <- c("Abundance..F1..Sample","Abundance..F2..Sample","Abundance..F3..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))  #LFQ
treatment_reps <- select(temp, matches(treatment))
#treatment_reps <- data_sanity_check(temp, 'treatment', treatment)
control_reps <- select(temp, matches(control))
#control_reps <- data_sanity_check(temp, 'control', control)
data <- cbind(treatment_reps,
        control_reps,
        #select(proteingroups, matches("^id$")),  #maybe problematic.
        Uniprot,
        Symbol)
##Extract required data for Limma
treatment <- c("Abundance..F7..Sample","Abundance..F8..Sample","Abundance..F9..Sample")
control <- c("Abundance..F1..Sample","Abundance..F2..Sample","Abundance..F3..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))
treatment_reps <- select(temp, matches(treatment))
control_reps <- select(temp, matches(control))
data <- cbind(treatment_reps,
        control_reps,
        Uniprot,
        Symbol)
##Extract required data for Limma
treatment <- c("Abundance..F13..Sample","Abundance..F14..Sample","Abundance..F15..Sample")
control <- c("Abundance..F10..Sample","Abundance..F11..Sample","Abundance..F12..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))
treatment_reps <- select(temp, matches(treatment))
control_reps <- select(temp, matches(control))
data <- cbind(treatment_reps,
        control_reps,
        Uniprot,
        Symbol)
#> head(data)
#  Abundance..F13..Sample Abundance..F14..Sample Abundance..F15..Sample
#1            74399257195             7824880491            2,67259E+11
#2             7002793339             2132521146            23458705782
#3             1060277580              620505532             5544368780
#4            442697991,6            613281893,8             1971535002
#5             1683646539            904648237,7             8359400640
#6            360135421,6             1339034150             1614352533
#  Abundance..F10..Sample Abundance..F11..Sample Abundance..F12..Sample Uniprot
#1             4487833586            4,47435E+11            2,23515E+11  B2RQC6
#2             2172021935            39427573076            28112131746  Q6ZQA0
#3            426743632,6             4947260656             6773100611  Q3UJB9
#4            926489052,4             1636804139             1250211974  Q8BX70
#5            797097872,6            10397417632             9384902608  Q9EP71
#6             2428231339             2254991154             6269016713  P26039
#  Symbol
#1    Cad
#2 Nbeal2
#3   Edc4
#4 Vps13c
#5  Rai14
#6   Tln1

# ---- INPUT for Susanne Project, directly read the sheet "Gene Prot (FOUND)" ----
data <- read.csv("210618_MG_Carotisartery_consensus.csv", dec=",")

#                          ,,1,1,1,1,1,1,1,1,1,1,                           2,2,         4,4,4,4,4,4
#Gene Symbol,Accession,    CA1,CA2,CA3,CA10,CA5,CA6,CA8,CA14,CA12,CA13,     CA11,CA4,    Con1,Con2,Con3,Con4,Con5,Con6
# Drop columns "CA11" and "CA4"
data <- data[, !(names(data) %in% c("CA11", "CA4"))]

## Impute data
#print(names(data))
#rep_treats <- readinteger("Enter the number of treatment replicates=")
#rep_conts <- readinteger("Enter the number of control replicates=")
#FC_Cutoff <- readfloat("Enter the log fold change cut off=")
rep_treats = 10
rep_conts = 6
FC_Cutoff = 2

## removing blank rows
#NO_BLANK_ROWS --> deactivate the following code.
#For Gunnar data: temp <- as.matrix(rowSums(apply(data[, 1:(rep_treats + rep_conts)], 2, as.numeric)))
temp <- as.matrix(rowSums(apply(data[, 3:(rep_treats + rep_conts + 2)], 2, as.numeric)))
idx <- which(temp == 0)
if (length(idx)) {
  data <- data[-idx,]
}
#NO INFINITE --> deactivate!  
#  data_limma <- log2(as.matrix( as.numeric(data[c(1:(rep_treats + rep_conts))]) ))
#data[,1] <- gsub(',', '.', data[,1])
#data[,2] <- gsub(',', '.', data[,2])
#data[,3] <- gsub(',', '.', data[,3])
#data[,4] <- gsub(',', '.', data[,4])
#data[,5] <- gsub(',', '.', data[,5])
#data[,6] <- gsub(',', '.', data[,6])

#For Gunnar data: data_limma <- log2(as.matrix( apply(data[, 1:(rep_treats + rep_conts)], 2, as.numeric) ))
data_limma <- log2(as.matrix( apply(data[, 3:(rep_treats + rep_conts + 2)], 2, as.numeric) ))
data_limma[is.infinite(data_limma)] <- NA
nan_idx <- which(is.na(data_limma))
#temp <- reshape(temp, nrow(data_limma)*ncol(data_limma), 1)
#hist(temp, na.rm = TRUE, xlab = "log2(intensity)", ylab = "Frequency", main =  "All data: before imputation")
# replace the following for the code before
temp <- as.vector(data_limma)
temp <- temp[!is.na(temp)]
png("Freq_log2_before_imputation.png", width=800, height=800)
hist(temp, xlab = "log2(intensity)", ylab = "Frequency", main = "All data: before imputation")
dev.off()
fit <- fitdistr(c(na.exclude(data_limma)), "normal")
mu <- as.double(fit$estimate[1])
sigma <- as.double(fit$estimate[2])
sigma_cutoff <- 6
new_width_cutoff <- 0.3
downshift <- 1.8
width <- sigma_cutoff * sigma
new_width <- width * new_width_cutoff
new_sigma <- new_width / sigma_cutoff
new_mean <- mu - downshift * sigma
imputed_vals_my = rnorm(length(nan_idx), new_mean, new_sigma)
# scaling_factor <- readfloat_0_1("Enter a number > 0 and <=1 to scale imputed values = ")
scaling_factor = 1.0
data_limma[nan_idx] <- imputed_vals_my*scaling_factor
#data_limma[nan_idx] <- imputed_vals_my
#NOTE: [10857] values are imputed!!!

#> tail(data_limma)
#            CA1      CA2      CA3     CA10      CA5      CA6      CA8     CA14
#[1785,] 24.04443 23.32959 22.77167 23.65635 21.63752 24.01719 22.81896 23.32914
#[1786,]       NA       NA       NA       NA       NA 22.37408 22.64151       NA
#[1787,]       NA 22.91260 22.48334       NA       NA 22.02580 21.90541       NA
#[1788,] 24.19555 22.50627       NA 22.73289 23.85307 23.26974 22.76822       NA
#[1789,]       NA 23.84345 23.01388 22.02657       NA       NA 22.55753 21.60069
#[1790,] 24.50819       NA       NA 25.07117 22.63140 21.78845 22.61968 20.99800
#            CA12     CA13     Con1     Con2     Con3     Con4     Con5     Con6
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197 22.85973 21.98199
#[1786,]       NA       NA       NA       NA       NA       NA       NA       NA
#[1787,]       NA       NA       NA       NA       NA       NA       NA       NA
#[1788,]       NA       NA       NA       NA       NA 21.93513       NA 23.16174
#[1789,] 23.15220 21.70746 20.82720       NA       NA       NA       NA       NA
#[1790,]       NA 23.65803       NA 24.47523       NA 23.56952 22.89298       NA
#----impute---->
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197
#[1786,] 23.01317 23.12527 21.11763 20.60838 20.00608 20.90938
#[1787,] 22.20547 22.69346 23.68541 22.05089 21.50437 21.32925
#[1788,] 21.47150 22.78137 22.29056 20.30675 21.50243 21.93513
#[1789,] 23.15220 21.70746 20.82720 20.68821 22.90624 22.58886
#[1790,] 21.16174 23.65803 21.56143 24.47523 21.91832 23.56952
#            CA1      CA2      CA3     CA10      CA5      CA6      CA8     CA14
#[1785,] 24.04443 23.32959 22.77167 23.65635 21.63752 24.01719 22.81896 23.32914
#[1786,] 23.37888 21.17247 21.03109 21.87904 21.97014 22.37408 22.64151 22.80035
#[1787,] 21.57987 22.91260 22.48334 21.98497 22.34736 22.02580 21.90541 20.12150
#[1788,] 24.19555 22.50627 21.69512 22.73289 23.85307 23.26974 22.76822 21.56332
#[1789,] 20.48388 23.84345 23.01388 22.02657 23.42263 21.80419 22.55753 21.60069
#[1790,] 24.50819 22.03749 21.90290 25.07117 22.63140 21.78845 22.61968 20.99800
#            CA12     CA13     Con1     Con2     Con3     Con4
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197
#[1786,] 20.89429 23.68783 22.26621 21.77314 22.70896 22.14385
#[1787,] 22.28350 22.87021 21.09328 23.33020 21.88887 20.88960
#[1788,] 22.25577 21.17894 21.69301 22.28463 22.30470 21.93513
#[1789,] 23.15220 21.70746 20.82720 20.97298 19.93577 23.94133
#[1790,] 21.19956 23.65803 21.75923 24.47523 22.53592 23.56952
# temp <- reshape(temp, nrow(data_limma)*ncol(data_limma), 1)
temp <- as.vector(data_limma)
png("Freq_log2_after_imputation.png", width=800, height=800)
#na.rm = TRUE, 
hist(temp, xlab = "log2(intensity)", ylab = "Frequency", main =  "All data: after imputation")
dev.off()

## Median Normalization Module
#want_normalization <- as.integer(readline(
#  cat(
#    'Enter 1: if you want to normalize the protein intensities in each experiemnt by substrating the median of the corresponding experiment\n',
#    '\bEnter 2: if you want to perform column wise median normalization of the data matrix\n',
#    '\b(for definition of "column wise median normalization" see README)= '
#  )
#))
#  if (want_normalization == 1) {
  # browser()
  # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates before normalization"))
  # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates before normalization"))

  # Before normalization
  png("before_normalization.png", width=800, height=800)
  boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data before median substract normalization")
  dev.off()

  # Subtracting the column medians
  #Subtracting Median of Columns (samples): When you subtract the median of columns (samples), you are ensuring that every sample has a median expression level set to zero (or another reference level if not using median). This is useful when you want to compare expression levels across different samples and want to correct for any systematic bias that might be present in a given sample. This type of normalization is particularly common when dealing with samples that may have been processed or sequenced at different times, by different technicians, or with slightly different protocols, leading to varying baseline levels.
  #Subtracting Median of Rows (genes): When you subtract the median of rows (genes), you are normalizing expression profiles of each gene across different samples. This might be useful in cases where you want to compare the expression profiles of different genes and you are interested in relative changes in gene expression across samples, not the absolute expression level of a gene.
  #In most gene expression studies, normalization by subtracting the median (or mean) of columns (samples) is the typical approach to account for variations between samples.
  col_med <- matrixStats::colMedians(data_limma)
  med_mat <- matlab::repmat(col_med, nrow(data_limma), 1)
  data_limma <- data_limma - med_mat
  #(optional) using sweep instead of matlab::repmat
  #col_med <- apply(data_limma, 2, median, na.rm = TRUE)
  #data_limma <- sweep(data_limma, 2, col_med)

  # After normalization
  png("after_normalization.png", width=800, height=800)
  boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data after median substract normalization")
  dev.off()
  # browser()
  # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates after normalization"))
  # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates after normalization"))
#  } else if (want_normalization == 2) {
#    boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data before column wise median normalization")
#    data_limma <- median_normalization(data_limma)
#    boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data after column wise median normalization")
#    # browser()
#    # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates after normalization"))
#    # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates after normalization"))
#  }

#For Gunnar's data
#Symbol <- data$Symbol
#Uniprot <- data$Uniprot
#For Susanne's data
Symbol <- data$Gene.Symbol
Uniprot <- data$Accession
##Limma main code
design <- model.matrix( ~ factor(c(rep(2, rep_treats), rep(1, rep_conts))))
colnames(design) <- c("Intercept", "Diff")
res.eb <- eb.fit(data_limma, design, Symbol)
Sig_FC_idx <- union(which(res.eb$logFC < (-FC_Cutoff)), which(res.eb$logFC > FC_Cutoff))
Sig_Pval_mod_idx <- which(res.eb$p.mod < 0.05)
Sig_Pval_ord_idx <- which(res.eb$p.ord < 0.05)
Sig_mod_idx <-  intersect(Sig_FC_idx, Sig_Pval_mod_idx)
Sig_ord_idx <-  intersect(Sig_FC_idx, Sig_Pval_ord_idx)
categ_Ord <- rep(c("Not Significant"), times = length(Symbol))
categ_Mod <- categ_Ord
categ_Mod[Sig_mod_idx] <- "Significant"
categ_Ord[Sig_ord_idx] <- "Significant"
dat <-
  cbind(
    res.eb,
    categ_Ord,
    categ_Mod,
    NegLogPvalMod = (-log10(res.eb$p.mod)),
    NegLogPvalOrd = (-log10(res.eb$p.ord))
  )

##Save the data file
setwd("/home/jhuang/DATA/Data_Susanne_MS/LIMMA-pipeline-proteomics")
final_data <-
  cbind(#DELETED select(data, matches("^id$")),
        Uniprot,
        Symbol,
        #Descriptions,
        data_limma,
        dat)
#DELETED final_data <- select(final_data, -matches("^gene$"))
filename_final_data <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_final_data')
# readline('Enter a filename for final data= ')

##Create plotly object and save plot as html
filename_mod <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_limma_plot')
# readline('Enter a filename for limma plot= ')
filename_ord <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_ord_plot')
# readline('Enter a filename for ordinary t-test plot= ')

display_plotly_figs(final_data, FC_Cutoff, filename_mod, filename_ord)

write.table(
  final_data,
  paste(filename_final_data, '.tsv', sep = ''),
  sep = '\t',
  row.names = FALSE,
  col.names = TRUE
)
#setwd(cur_dir)

#sorting, add description.
#delete t.ord .. s2.post
# senden auch die Method-part!
# check if Symbol==gene  -log10(p.mod)
#cut -f2-2 20220221_173330_final_data.tsv > Symbol.txt
#cut -f14-14 20220221_173330_final_data.tsv > gene.txt

#https://bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html#6_Example_applications
#Input of GSVA should log and normalized or not normalized?

https://www.mcponline.org/article/S1535-9476(20)34997-5/fulltext DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis* https://genome.duke.edu/sites/default/files/2013Lecture3-DifferentialExpressionProteomicsforweb.pdf DESeq2 https://bioconductor.statistik.tu-dortmund.de/packages/3.5/bioc/vignettes/DESeq2/inst/doc/DESeq2.html Analyzing RNA-seq data with DESeq2 Comparative analysis of statistical methods used for detecting differential expression in label-free mass spectrometry proteomics

labeled mass spectrometry counts The only type of MS-based quantitation that is amenable to DESeq2 (or similar approaches) is spectral counting. Other approaches, whether labelled of unlabelled, are continuous data; I would also recommend limma for their analysis. https://www.researchgate.net/publication/221925844_Labeling_Methods_in_Mass_Spectrometry_Based_Quantitative_Proteomics https://de.wikipedia.org/wiki/Markierungsfreie_massenspektrometrische_Quantifizierung Label-free quantification is a method in mass spectrometry that aims to determine the relative amount of proteins in two or more biological samples. Unlike other methods for protein quantification, label-free quantification does not use a stable isotope containing compound to chemically bind to and thus label the protein.[1][2] https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-179 https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6412084/ https://bioinformatics.stackexchange.com/questions/16227/help-with-dia-mass-spectrometry-data-analysis-with-several-conditions-limma https://www.biostars.org/p/95405/#google_vignette https://www.sciencedirect.com/science/article/pii/S1570963913001866 https://github.com/wasimaftab/LIMMA-pipeline-proteomics/blob/master/limma_main.R

Statistical analysis on LC-MS data - To detect significant changes between two experimental groups we performed statistical analysis on LC-MS data.

Some comments to the analysis: - The LIMMA moderated t-test statistics are chosen for two-group comparison in a proteomic experiment. - Normalise by subtracting median: This method normalizes the protein intensities in each experiment by substrating the median of the corresponding experiment. We implemented statistical analysis in R and codes are freely available at the following GitHub repository https://github.com/wasimaftab/LIMMA-pipeline- proteomics/tree/master/Mitochondrial_Loop. The final volcano plots only feature proteins that are part of the high-confidence mito- chondrial proteome (Morgenstern et al., 2017). 1. Normalise by subtracting median: This method normalizes the protein intensities in each experiemnt by substrating the median of the corresponding experiment. 2. Column wise median normalization of the data matrix: This method calculates for each sample the median change (i.e. the difference between the observed value and the row average) and subtracts it from each row. Missing values are ignored in the procedure. The method is based on the assumption that a majority of the rows did not change. By default, all the rows are used for normalization but if we assume that the first 3 proteins are spike-ins then the call to median_normalization function needs to be modified as: median_normalization(data, spike_in_rows = 1:3) - The imputation algorithm proposed in Tyanova et al. [1] was used to impute missing values before the performance of statistical methods: 1. Tyanova, S., Temu, T., Sinitcyn, P., Carlson, A., Hein, M.Y., Geiger, T., Mann,M., and Cox, J. (2016). The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740.

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

LIMMA pipeline processing proteomics (MS)

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章