KEGG and GO annotations in non-model organisms

gene_x 0 like s 42 view s

Tags: pipeline

https://www.biobam.com/functional-analysis/

Blast2GO_workflow

  1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (GO Terms): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    • 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    • Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    • Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names),
    • Button 'annot' (Tags: Annotated?, generated columns: Enzyme Codes, Enzyme Names)

    or blast2go_cli_v1.5.1 (NOT_USED)

        #https://help.biobam.com/space/BCD/2250407989/Installation
    
        # blast2go_pipeline.sh
        #!/bin/bash
    
        # -------------------------------
        #  BLAST2GO Automation Pipeline
        # -------------------------------
        #  Steps:
        #   1. Run BLAST (NCBI BLAST+)
        #   2. Perform GO Mapping (Blast2GO CLI)
        #   3. Perform GO Annotation
        #   4. Export results to TSV format
        # -------------------------------
    
        # CONFIGURATION
        BLAST_DB="/path/to/blast/db"          # Path to BLAST database
        OUTPUT_DIR="/path/to/output"          # Directory for storing results
        THREADS=8                              # Number of CPU threads
    
        # Ensure output directory exists
        mkdir -p "$OUTPUT_DIR"
    
        # Iterate over all FASTA files in the input directory
        for FASTA in /path/to/input/*.fasta; do
            BASENAME=$(basename "$FASTA" .fasta)
    
            echo "Processing: $BASENAME"
    
            # Step 1: Run BLAST (Change 'blastp' to 'blastx' for nucleotide queries)
            echo "Running BLAST..."
            blastp -query "$FASTA" -db "$BLAST_DB" -out "$OUTPUT_DIR/${BASENAME}_blast.xml" \
                -evalue 1e-5 -outfmt 5 -num_threads "$THREADS"
    
            # Step 2: Perform GO Mapping
            echo "Running GO Mapping..."
            ./blast2go_cli -mapping -in "$OUTPUT_DIR/${BASENAME}_blast.xml" \
                        -out "$OUTPUT_DIR/${BASENAME}_mapping.b2g"
    
            # Step 3: Perform GO Annotation
            echo "Running GO Annotation..."
            ./blast2go_cli -annotation -in "$OUTPUT_DIR/${BASENAME}_mapping.b2g" \
                        -out "$OUTPUT_DIR/${BASENAME}_annotation.b2g" \
                        -evalue 1e-6 -annex -goslim
    
            # Step 4: Export to TSV
            echo "Exporting results to TSV..."
            ./blast2go_cli -export -in "$OUTPUT_DIR/${BASENAME}_annotation.b2g" \
                        -out "$OUTPUT_DIR/${BASENAME}_annotation.tsv" -format TSV
    
            echo "Processing of $BASENAME completed!"
        done
    
        echo "✅ All jobs completed successfully!"
    

    Option 2 (GO Terms): Interpro based protein families / domains --> Button interpro * Button 'interpro' (Tags: Interproscaned?, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names Enzyme Codes)

    #TODO: merge InterPro GO IDs to GO IDs and generate final GO IDs, then perform GO enrichment!
    

    Option 3 (KEGG Terms): EggNog based on orthology and phylogenies (NOT_USED)

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
    
    Install EggNOG-mapper:
    
        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env
    
    Run annotation:
    
        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
        emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.
    
    Extract KEGG KO IDs from annotations.emapper.annotations.
    

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    • Go to KAAS
    • Upload your FASTA file.
    • Select an appropriate gene set.
    • Download the KO assignments.
  2. Find the Closest KEGG Organism Code (TODO: find the closely related organism)

    Since your species isn't directly in KEGG, use a closely related organism.

    • Check available KEGG organisms:
      library(clusterProfiler)
      library(KEGGREST)
      
      kegg_organisms <- keggList("organism")
      
      Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
      
      # Search for Acinetobacter in the list
      grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
      # Gammaproteobacteria
      #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"
      
  3. Find the Closest KEGG Organism for a Non-Model Species

    If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search
    

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code:

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology
    
  4. Perform KEGG and GO Enrichment in R (TODO: ids in gene_list_kegg_up are not recognized!)

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
    
        # Load the results from the AUM_vs_MHB-all.csv
        res <- read.csv("AUM_vs_MHB-all.csv")
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
        res$GeneName == "" | is.na(res$GeneName),
        gsub("gene-", "", res$GeneID),
        res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID (or GeneName)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
        up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
        # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
        down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        saveWorkbook(wb, "Gene_Expression_with_Annotations_AUM_vs_MHB.xlsx", overwrite = TRUE)
    
        # Set the 'GeneName' column as row.names
        rownames(res) <- res$GeneName
        # Drop the 'GeneName' column since it's now the row names
        res$GeneName <- NULL
    
        # ---- Perform KEGG enrichment analysis ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # Perform KEGG enrichment analysis
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down))
    
        # ---- Perform GO enrichment analysis (TODO: extract the merged GO IDs from 'Blast2GO 5 Basic' and adapt the code below!)----
        #http://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html
        #If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.
    
        # Perform GO enrichment analysis (Using enricher)
        gene_list_go_up <- up_regulated$GOs
        go_enrichment_up <- enrichGO(gene = gene_list_go_up, OrgDb, ont = "BP")  #keyType = "ENSEMBL", organism = 'abn',
    
        # Perform GO enrichment analysis (Using clusterProfiler)
        #gene_list_go_down <- down_regulated$GOs
        #go_enrichment_down <- enrichGO(gene = gene_list_go_down, organism = 'abn', ont = "BP")
    
        #addWorksheet(wb, "GO_Enrichment_Up")
        #writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        #addWorksheet(wb, "GO_Enrichment_Down")
        #writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_AUM_vs_MHB.xlsx", overwrite = TRUE)
    
        #To adapt your code for GO enrichment analysis using the up-regulated genes (149 genes) as the test set and all genes in res (3646 genes) as the background, we need to:
    
            * Extract the GeneIDs from up_regulated and use them as gene_list (the test set).
            * Extract the GeneIDs from res and use them as background_genes (the universe).
            * Create a TERM2GENE dataframe by extracting GO terms from res.
            * Run the enricher() function using these updated inputs.
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res[, c("GeneID", "GOs")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            universe = background_genes      # Define the background gene set
        )
    
        # View results
        print(go_enrichment_up)
    
  5. GO enrichment using TERM2GENE for enricher in clusterProfiler (Example code, the running code for current perform see above)

    To perform GO enrichment analysis using the enricher() function with a custom GO annotation data (in the form of a data.frame), follow these steps. The user can supply a gene list and a custom GO annotation in the form of a data.frame, where the first column is the gene ID and the second column is the GO ID.
    
    Here’s how to structure the R code for both over-representation testing (using enricher()) and gene set enrichment analysis (using GSEA()):
    
        # Install and load the necessary libraries
        install.packages("BiocManager")
        BiocManager::install("clusterProfiler")
        BiocManager::install("org.Hs.eg.db")  # For human gene IDs (optional)
    
        library(clusterProfiler)
        library(org.Hs.eg.db)  # Only if using human gene IDs (e.g., Entrez)
    
        #Prepare Custom GO Annotation Data
        #Assume that you have a custom GO annotation dataset in the form of a data.frame, where:
        #The first column is gene IDs (could be Entrez, gene symbols, etc.).
        #The second column is GO terms (GO IDs).
    
        Example data.frame:
    
        # Example custom GO annotation data (replace with your own data)
        go_annotation <- data.frame(
            geneID = c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"),
            GOID = c("GO:0001234", "GO:0005678", "GO:0002345", "GO:0009876", "GO:0005432")
        )
    
    # View the custom GO annotation
    head(go_annotation)
    
    3. Perform GO Enrichment Using enricher()
    
    With your gene list (e.g., from differential expression analysis) and the custom GO annotation, you can now run the over-representation test using the enricher() function.
    
    # Example gene list (e.g., gene symbols or Entrez IDs)
    gene_list <- c("GeneA", "GeneB", "GeneC", "GeneX", "GeneY")  # Replace with your own gene IDs
    
    # Perform enrichment analysis with custom GO annotation
    go_enrichment_result <- enricher(
        gene = gene_list,                # The gene list
        TERM2GENE = go_annotation,       # The custom GO annotation data
        pvalueCutoff = 0.05              # Adjust p-value threshold
        universe = background_genes      # Define a custom background gene set
    )
    
    # View the summary of the results
    summary(go_enrichment_result)
    
    # Visualize the results using a dotplot
    dotplot(go_enrichment_result)
    
    4. Perform Gene Set Enrichment Analysis (GSEA) Using GSEA()
    
    In addition to the over-representation test, you can also perform GSEA to determine whether a predefined set of genes (e.g., in the context of a ranked list of genes) is significantly enriched in certain GO categories.
    
    # Example ranked gene list (e.g., from differential expression with fold change)
    ranked_gene_list <- c("GeneA" = 2.5, "GeneB" = 1.8, "GeneC" = 0.5, "GeneD" = -1.2, "GeneE" = -2.0)
    
    # Perform GSEA with custom GO annotation
    gsea_result <- GSEA(
    geneList = ranked_gene_list,     # Ranked gene list (e.g., logFC or other metric)
    TERM2GENE = go_annotation,       # The custom GO annotation data
    pvalueCutoff = 0.05              # Adjust p-value threshold
    )
    
    # View the summary of the GSEA results
    summary(gsea_result)
    
    # Visualize the GSEA results using a dotplot
    dotplot(gsea_result)
    
    5. Save Results to CSV (Optional)
    
    If you want to save the results to a file (e.g., CSV), you can use the write.csv() function.
    
    # Save the GO enrichment results to a CSV file
    write.csv(go_enrichment_result, "GO_enrichment_results.csv")
    
    # Save the GSEA results to a CSV file
    write.csv(gsea_result, "GSEA_results.csv")
    
    Explanation of Key Steps:
    
        Gene List: You need a list of genes that you want to test for enrichment. This can be a list of genes from differential expression analysis or a custom set of genes.
        GO Annotation (TERM2GENE): This is a data.frame containing the mapping of gene IDs to GO terms (or categories). You provide this as an argument to both enricher() and GSEA().
        enricher() Function: This function performs the over-representation analysis (ORA), testing whether specific GO terms are significantly enriched in your gene list compared to the background.
        GSEA() Function: This function performs Gene Set Enrichment Analysis, testing whether the predefined gene sets (in this case, GO terms) are enriched in a ranked list of genes (e.g., based on log-fold change).
    
    Additional Parameters:
    
        pvalueCutoff: The threshold for statistical significance in both enricher() and GSEA(). Genes with a p-value below this cutoff are considered significantly enriched.
        minGSSize and maxGSSize: Define the minimum and maximum size of the gene sets (GO terms) to be included in the analysis.
        qvalueCutoff: The threshold for adjusted p-values (FDR).
    
    Conclusion:
    
    This guide demonstrates how to perform GO enrichment analysis using a custom GO annotation data (data.frame) with the enricher() and GSEA() functions from the clusterProfiler package. Whether you're performing over-representation testing or gene set enrichment analysis, these functions allow you to explore which biological processes or pathways are enriched in your gene list.
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum