How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

gene_x 0 like s 25 view s

Tags: protein, pipeline, RNA-seq

Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:

  1. Preprocessing and Normalization

    For RNA-Seq data:

    • Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
    • Normalize the data (e.g., using DESeq2's variance stabilizing transformation (VST) or edgeR's TMM normalization).

    For MS proteomics data:

    • Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
    • Log-transform the data if needed to stabilize variance.
  2. Data Mapping and Integration

    • Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.

    • Common Identifiers:

      • Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
      • Remove entries without matching pairs to ensure one-to-one correspondence.
  3. Data Filtering

    • Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
    • Optionally, keep only genes/proteins of interest or those with high coverage.
  4. Correlation Analysis

    • For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.

    Steps:

    - Construct a table with rows as genes/proteins and columns as samples.
    - For each row, you’ll have two vectors:
        * RNA expression (e.g., normalized RNA counts)
        * Protein abundance (e.g., log-transformed LFQ intensity)
    - Calculate:
    
            from scipy.stats import pearsonr, spearmanr
    
            rna_vector = [...]
            protein_vector = [...]
    
            pearson_corr, _ = pearsonr(rna_vector, protein_vector)
            spearman_corr, _ = spearmanr(rna_vector, protein_vector)
    
  5. Visualize and Interpret

    • Plot scatter plots of RNA vs protein levels for:

      • All genes/proteins together (aggregate view)
      • Selected genes of interest
    • Plot correlation coefficients:

      • Histogram of all gene/protein correlations
      • Heatmap if you have sample-wise data
    • Assess overall agreement:

      • Typically, moderate correlation (~0.3–0.6) is observed in many studies.
  6. Consider Batch Effects and Biological Variability

    • If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).

    • Be mindful that:

      • Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
      • Some genes/proteins might show no correlation due to translational regulation, stability, etc.
  7. Summary Workflow

    ✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret

  8. Python script that walks through the key steps of correlating RNA-seq data with proteomics data:

    import pandas as pd
    import numpy as np
    from scipy.stats import pearsonr, spearmanr
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # --- Step 1: Load your data ---
    
    # Example: CSVs with genes/proteins as rows, samples as columns
    rna_data = pd.read_csv('rna_seq_data.csv', index_col=0)  # genes x samples
    protein_data = pd.read_csv('proteomics_data.csv', index_col=0)  # proteins x samples
    
    # --- Step 2: Map genes to proteins (assuming same identifiers) ---
    
    # Filter to common genes/proteins
    common_genes = rna_data.index.intersection(protein_data.index)
    rna_data_filtered = rna_data.loc[common_genes]
    protein_data_filtered = protein_data.loc[common_genes]
    
    print(f"Number of common genes/proteins: {len(common_genes)}")
    
    # --- Step 3: Log transform if needed (optional) ---
    
    rna_data_log = np.log2(rna_data_filtered + 1)
    protein_data_log = np.log2(protein_data_filtered + 1)
    
    # --- Step 4: Calculate gene-wise correlations across samples ---
    
    pearson_corrs = []
    spearman_corrs = []
    
    for gene in common_genes:
        rna_vector = rna_data_log.loc[gene]
        protein_vector = protein_data_log.loc[gene]
    
        pearson_corr, _ = pearsonr(rna_vector, protein_vector)
        spearman_corr, _ = spearmanr(rna_vector, protein_vector)
    
        pearson_corrs.append(pearson_corr)
        spearman_corrs.append(spearman_corr)
    
    # Save results
    correlation_df = pd.DataFrame({
        'Gene': common_genes,
        'Pearson': pearson_corrs,
        'Spearman': spearman_corrs
    })
    correlation_df.to_csv('gene_protein_correlations.csv', index=False)
    print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")
    
    # --- Step 5: Visualize the correlation distributions ---
    
    sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
    plt.xlabel('Pearson Correlation')
    plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
    plt.show()
    
    sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
    plt.xlabel('Spearman Correlation')
    plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
    plt.show()
    
    # --- Step 6: Scatter plot for a selected gene/protein ---
    
    example_gene = common_genes[0]  # change to your gene of interest
    plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
    plt.xlabel('Log2 RNA Expression')
    plt.ylabel('Log2 Protein Abundance')
    plt.title(f'RNA vs Protein for {example_gene}')
    plt.grid(True)
    plt.show()
    
    # Key Notes:
    #✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
    #✅ The script expects rows to be genes/proteins and columns to be samples.
    #✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).
    
  9. R script that covers the same steps as above:

    # --- Load libraries ---
    library(ggplot2)
    library(dplyr)
    
    # --- Step 1: Load your data ---
    # Example: CSVs with genes/proteins as rows, samples as columns
    rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
    protein_data <- read.csv("proteomics_data.csv", row.names = 1)
    
    # --- Step 2: Find common genes/proteins ---
    common_genes <- intersect(rownames(rna_data), rownames(protein_data))
    rna_data_filtered <- rna_data[common_genes, ]
    protein_data_filtered <- protein_data[common_genes, ]
    
    cat("Number of common genes/proteins:", length(common_genes), "\n")
    
    # --- Step 3: Log-transform if needed (optional) ---
    rna_data_log <- log2(rna_data_filtered + 1)
    protein_data_log <- log2(protein_data_filtered + 1)
    
    # --- Step 4: Calculate gene-wise correlations across samples ---
    pearson_corrs <- numeric(length(common_genes))
    spearman_corrs <- numeric(length(common_genes))
    
    for (i in seq_along(common_genes)) {
    gene <- common_genes[i]
    rna_vector <- as.numeric(rna_data_log[gene, ])
    protein_vector <- as.numeric(protein_data_log[gene, ])
    
    pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
    spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
    }
    
    # Save the results
    correlation_df <- data.frame(
    Gene = common_genes,
    Pearson = pearson_corrs,
    Spearman = spearman_corrs
    )
    
    write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
    cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")
    
    # --- Step 5: Visualize the correlation distributions ---
    ggplot(correlation_df, aes(x = Pearson)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
        x = "Pearson Correlation", y = "Frequency") +
    theme_minimal()
    
    ggplot(correlation_df, aes(x = Spearman)) +
    geom_histogram(bins = 30, fill = "salmon", color = "black") +
    labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
        x = "Spearman Correlation", y = "Frequency") +
    theme_minimal()
    
    # --- Step 6: Scatter plot for a selected gene/protein ---
    example_gene <- common_genes[1]  # change this to your gene of interest
    df_example <- data.frame(
    RNA = as.numeric(rna_data_log[example_gene, ]),
    Protein = as.numeric(protein_data_log[example_gene, ])
    )
    
    ggplot(df_example, aes(x = RNA, y = Protein)) +
    geom_point() +
    labs(title = paste("RNA vs Protein for", example_gene),
        x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
    theme_minimal() +
    geom_smooth(method = "lm", se = FALSE, color = "red")
    
    # Key Notes:
    #✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
    #✅ Rows: genes/proteins, columns: samples.
    #✅ Change example_gene to any gene of interest for plotting.
    #Tweak this for the new dataset or extend it with batch correction or other normalizations?
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum