Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

gene_x 0 like s 1136 view s

Tags: pipeline, RNA-seq

In the example Yersinia enterocolitica strain WA

#[IMPORTANT FINAL STEPS]
[STEP1] genbank2fasta.py WA.gb
[STEP2] python3 ~/Scripts/genbank_to_gtf.py WA.gb WA.gtf  #(1col or 3cols)
[STEP3] python3 ~/Scripts/modify_gtf.py WA.gtf WA_m.gtf   #(1col or 3cols) "CDS" --> "exon"

In the example of 1585 wildtype genome WA_m.gtf

#Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
#cut -d$'\t' -f3 WA_m.gtf | sort -u
#exon
#gene
#rRNA
#tRNA
#Gene; CDS (renamed exon=4121); rRNA (22); tRNA (84); ncRNA
[STEP4] replace transcript_id " --> transcript_id "tx- under kate
[STEP5] python3 ~/Scripts/add_geneid_genename_genebiotype.py WA_m.gtf WA_m_.gtf  #(4cols+3cols) ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
    #NO ERROR in WA case.
    #DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
    #    Error Message: Cannot find transcript_id!
[STEP6] python3 ~/Scripts/add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py WA_m_.gtf WA_m__.gtf  # (4cols+4cols) Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.
[STEP7] cp WA_m__.gtf WA_m___.gtf; MANUALLY modify the lines tRNA --> exon and 'gene_biotype "tRNA";' 
      #for ncRNA (3), tmRNA (1), rRNA (19), tRNA (69) in 1585_m___.gtf
      #for rRNA (22), tRNA (84) in WA_m___.gtf                            
      #grep "gene_biotype \"tRNA\"" WA_m___.gtf | wc -l  #CHECK they are correctly changed!

[RESULTS]
#CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
#CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
#...
#[Intermediate RESULTS]
#/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
#grep ">" genome.transcripts.fa  | wc -l
#2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!

[STEP8 (Optional, not used in the example)] To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
#                see WA_m___.gtf
#                grep "exon" WA_m___.gtf | wc -l
#                2332
#                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
#                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
#                2332
#                4227
#For example exon-V0R30_00090 is a tRNA.
#>tx-V0R30_00090
#GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
#We can also filter the count number in the downstream analysis from the big matrix!
#tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
>tx-V0R30_05655
AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
>tx-V0R30_06520
TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
>tx-V0R30_11060
CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG

nextflow running

nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_WA --fasta WA.fasta --gtf WA_m___.gtf -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0

nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon  --skip_deseq2_qc

[TODO] generate 1585 Excel table, we need DNA-sequence, we can use the DNA seqeunce from results_WA/genome/genome.transcripts.fa in the gb_to_excel.py
```
python3 ~/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
#insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.
```

GO enrichment analysis of non-model bacterial RNA-seq_Sepidermidis

~/Scripts/1_filter_interproscan_out.py
~/Scripts/2_split_proteins_with_and_without_GO_terms3.py
~/Scripts/3_generate_fasta_not_annotated_by_interpro.py

1. Use Custom Annotations for Enrichment Analysis

For GO term enrichment analysis without standard database support, you can create a custom set of annotations if you have the necessary GO term associations for your genes. This involves manually mapping your genes to GO terms based on your experimental data or literature research.
2. GO Term Enrichment without biomaRt

If you already have GO term associations for your genes or can obtain them through other means, you can proceed with GO term enrichment analysis using clusterProfiler or similar tools by creating a custom OrgDb or using the enricher function directly with your custom GO term mappings.

Here’s a simplified example using clusterProfiler with custom GO term mappings:

library(clusterProfiler)

# Assuming you have a named vector of genes and corresponding GO terms
genes <- c("gene1", "gene2", "gene3", ...) # Your gene identifiers
go_terms <- c("GO:0008150", "GO:0006355", "GO:0004672", ...) # Corresponding GO terms for these genes

# Create a named list where names are GO terms and elements are vectors of gene IDs associated with each GO term
go_list <- split(genes, go_terms)

# Your list of interesting genes (e.g., those differentially expressed)
interesting_genes <- c("gene1", "gene2", ...)

# Perform enrichment analysis
enrichment_results <- enricher(interesting_genes,
                              TERM2GENE = go_list,
                              pAdjustMethod = "BH",
                              pvalueCutoff = 0.05,
                              qvalueCutoff = 0.2)

# View the enrichment results
print(enrichment_results)

#Install and Run InterProScan.
#https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
python3 setup.py -f interproscan.properties
./interproscan.sh -i 1585_CDS.fasta -f tsv -goterms -pa

#InterProScan's remote option also supports additional output formats and analyses, such as producing graphical overviews (-svg for SVG output) or specific types of analyses only (-appl TIGRFAM,Pfam to limit the search to certain databases). Check the InterProScan documentation for a full list of options and more detailed explanations.

https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan

-dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                          service.  All match calculations will be run locally.

InterProScan test run

This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command:

python3 setup.py -f interproscan.properties

This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once.

You can then run the following two test case commands:

./interproscan.sh -i test_all_appl.fasta -f tsv -dp
./interproscan.sh -i test_all_appl.fasta -f tsv

./interproscan.sh -i 1585_CDS_test.fasta -f tsv -goterms -pa
#./interproscan.sh -i 1585_CDS_test.fasta -f TSV,XML -goterms

awk -F'\t' '{if(NR>1) print $1, $14}' 1585_CDS_test.fasta.tsv1 > filtered_GO_terms.txt

#4d_vs_15m-up.id
#4d_vs_15m-down.id

# Install and load required packages
#install.packages("rentrez")
#install.packages("KEGGREST")
#install.packages("dplyr") # for data manipulation
#install.packages("ggplot2") # for visualization

library(rentrez)
library(KEGGREST)
library(dplyr)
library(ggplot2)

# Read GenBank file
genbank_file <- "../../../1585.gb"
genbank_data <- readLines(genbank_file)

# Extract accession number from the GenBank file
accession <- grep("^ACCESSION", genbank_data, value = TRUE)
accession <- gsub("^ACCESSION\\s+", "", accession)
accession <- trimws(accession)

# Function to retrieve sequence information from GenBank file
get_sequence_from_genbank <- function(genbank_file) {
  # Read GenBank file
  genbank_data <- readLines(genbank_file)

  # Find sequence lines starting with "ORIGIN"
  origin_line <- grep("^ORIGIN", genbank_data)

  # Extract sequence lines
  sequence_lines <- genbank_data[(origin_line + 1):length(genbank_data)]

  # Remove non-sequence characters and concatenate into a single sequence
  sequence <- gsub("[0-9 ]", "", sequence_lines)
  sequence <- paste(sequence, collapse = "")

  return(sequence)
}

# GenBank file
genbank_file <- "1585.gb"

# Retrieve sequence from GenBank file
sequence <- get_sequence_from_genbank(genbank_file)

# Perform KEGG pathway enrichment analysis
kegg_pathways <- keggList("pathway")

# Toy example: Randomly select 100 genes and map to KEGG pathways
set.seed(123)
genes <- sample(1:1000, 100)

# Mapping genes to KEGG pathways
mapped_pathways <- sapply(genes, function(gene) {
  kegg_gene <- keggGet("link", gene)
  if (length(kegg_gene) > 0) {
    pathway_ids <- kegg_gene$ENTRY %>% unique()
    pathway_names <- kegg_pathways[kegg_pathways$ID %in% pathway_ids, "NAME"]
    if (length(pathway_names) > 0) {
      return(paste(pathway_names, collapse = "; "))
    } else {
      return(NA)
    }
  } else {
    return(NA)
  }
})

# Filter out NAs
mapped_pathways <- mapped_pathways[!is.na(mapped_pathways)]

# Count pathway occurrences
pathway_counts <- table(unlist(strsplit(mapped_pathways, "; ")))

# Sort pathways by occurrence
pathway_counts <- sort(pathway_counts, decreasing = TRUE)

# Top 10 enriched pathways
top_pathways <- head(pathway_counts, 10)

# Plot the top enriched pathways
barplot(top_pathways, main = "Top 10 Enriched KEGG Pathways",
        xlab = "Pathway", ylab = "Frequency")

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章