Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

gene_x 0 like s 924 view s

Tags: pipeline, RNA-seq

Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA

  1. In the example Yersinia enterocolitica strain WA

    1. #[IMPORTANT FINAL STEPS]
    2. [STEP1] genbank2fasta.py WA.gb
    3. [STEP2] python3 ~/Scripts/genbank_to_gtf.py WA.gb WA.gtf #(1col or 3cols)
    4. [STEP3] python3 ~/Scripts/modify_gtf.py WA.gtf WA_m.gtf #(1col or 3cols) "CDS" --> "exon"
  2. In the example of 1585 wildtype genome WA_m.gtf

    1. #Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
    2. #cut -d$'\t' -f3 WA_m.gtf | sort -u
    3. #exon
    4. #gene
    5. #rRNA
    6. #tRNA
    7. #Gene; CDS (renamed exon=4121); rRNA (22); tRNA (84); ncRNA
    8. [STEP4] replace transcript_id " --> transcript_id "tx- under kate
    9. [STEP5] python3 ~/Scripts/add_geneid_genename_genebiotype.py WA_m.gtf WA_m_.gtf #(4cols+3cols) ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
    10. #NO ERROR in WA case.
    11. #DEBUG ERROR Stop at line : CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
    12. # Error Message: Cannot find transcript_id!
    13. [STEP6] python3 ~/Scripts/add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py WA_m_.gtf WA_m__.gtf # (4cols+4cols) Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.
    14. [STEP7] cp WA_m__.gtf WA_m___.gtf; MANUALLY modify the lines tRNA --> exon and 'gene_biotype "tRNA";'
    15. #for ncRNA (3), tmRNA (1), rRNA (19), tRNA (69) in 1585_m___.gtf
    16. #for rRNA (22), tRNA (84) in WA_m___.gtf
    17. #grep "gene_biotype \"tRNA\"" WA_m___.gtf | wc -l #CHECK they are correctly changed!
    18. [RESULTS]
    19. #CP143516 biopython gene 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
    20. #CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
    21. #...
    22. #[Intermediate RESULTS]
    23. #/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
    24. #grep ">" genome.transcripts.fa | wc -l
    25. #2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
    26. [STEP8 (Optional, not used in the example)] To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
    27. # see WA_m___.gtf
    28. # grep "exon" WA_m___.gtf | wc -l
    29. # 2332
    30. # grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
    31. # grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
    32. # 2332
    33. # 4227
    34. #For example exon-V0R30_00090 is a tRNA.
    35. #>tx-V0R30_00090
    36. #GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
    37. #We can also filter the count number in the downstream analysis from the big matrix!
    38. #tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
    39. >tx-V0R30_05655
    40. AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
    41. >tx-V0R30_06520
    42. TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
    43. >tx-V0R30_11060
    44. CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
  3. nextflow running

    1. nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_WA --fasta WA.fasta --gtf WA_m___.gtf -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0
    2. nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full -resume --max_memory 126.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_deseq2_qc
  4. [TODO] generate 1585 Excel table, we need DNA-sequence, we can use the DNA seqeunce from results_WA/genome/genome.transcripts.fa in the gb_to_excel.py

    1. python3 ~/Scripts/gb_to_excel.py 1585.gb 1585.xlsx
    2. #insert C./P. (Chrom or Plasmid1, Plasmid2, Plasmid3) as the second chrom in the 1585.xlsx.
  5. GO enrichment analysis of non-model bacterial RNA-seq_Sepidermidis

    1. ~/Scripts/1_filter_interproscan_out.py
    2. ~/Scripts/2_split_proteins_with_and_without_GO_terms3.py
    3. ~/Scripts/3_generate_fasta_not_annotated_by_interpro.py
    4. 1. Use Custom Annotations for Enrichment Analysis
    5. For GO term enrichment analysis without standard database support, you can create a custom set of annotations if you have the necessary GO term associations for your genes. This involves manually mapping your genes to GO terms based on your experimental data or literature research.
    6. 2. GO Term Enrichment without biomaRt
    7. If you already have GO term associations for your genes or can obtain them through other means, you can proceed with GO term enrichment analysis using clusterProfiler or similar tools by creating a custom OrgDb or using the enricher function directly with your custom GO term mappings.
    8. Heres a simplified example using clusterProfiler with custom GO term mappings:
    9. library(clusterProfiler)
    10. # Assuming you have a named vector of genes and corresponding GO terms
    11. genes <- c("gene1", "gene2", "gene3", ...) # Your gene identifiers
    12. go_terms <- c("GO:0008150", "GO:0006355", "GO:0004672", ...) # Corresponding GO terms for these genes
    13. # Create a named list where names are GO terms and elements are vectors of gene IDs associated with each GO term
    14. go_list <- split(genes, go_terms)
    15. # Your list of interesting genes (e.g., those differentially expressed)
    16. interesting_genes <- c("gene1", "gene2", ...)
    17. # Perform enrichment analysis
    18. enrichment_results <- enricher(interesting_genes,
    19. TERM2GENE = go_list,
    20. pAdjustMethod = "BH",
    21. pvalueCutoff = 0.05,
    22. qvalueCutoff = 0.2)
    23. # View the enrichment results
    24. print(enrichment_results)
    25. #Install and Run InterProScan.
    26. #https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
    27. python3 setup.py -f interproscan.properties
    28. ./interproscan.sh -i 1585_CDS.fasta -f tsv -goterms -pa
    29. #InterProScan's remote option also supports additional output formats and analyses, such as producing graphical overviews (-svg for SVG output) or specific types of analyses only (-appl TIGRFAM,Pfam to limit the search to certain databases). Check the InterProScan documentation for a full list of options and more detailed explanations.
    30. https://interproscan-docs.readthedocs.io/en/latest/UserDocs.html#obtaining-a-copy-of-interproscan
    31. -dp,--disable-precalc Optional. Disables use of the precalculated match lookup
    32. service. All match calculations will be run locally.
    33. InterProScan test run
    34. This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command:
    35. python3 setup.py -f interproscan.properties
    36. This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once.
    37. You can then run the following two test case commands:
    38. ./interproscan.sh -i test_all_appl.fasta -f tsv -dp
    39. ./interproscan.sh -i test_all_appl.fasta -f tsv
    40. ./interproscan.sh -i 1585_CDS_test.fasta -f tsv -goterms -pa
    41. #./interproscan.sh -i 1585_CDS_test.fasta -f TSV,XML -goterms
    42. awk -F'\t' '{if(NR>1) print $1, $14}' 1585_CDS_test.fasta.tsv1 > filtered_GO_terms.txt
    43. #4d_vs_15m-up.id
    44. #4d_vs_15m-down.id
    45. # Install and load required packages
    46. #install.packages("rentrez")
    47. #install.packages("KEGGREST")
    48. #install.packages("dplyr") # for data manipulation
    49. #install.packages("ggplot2") # for visualization
    50. library(rentrez)
    51. library(KEGGREST)
    52. library(dplyr)
    53. library(ggplot2)
    54. # Read GenBank file
    55. genbank_file <- "../../../1585.gb"
    56. genbank_data <- readLines(genbank_file)
    57. # Extract accession number from the GenBank file
    58. accession <- grep("^ACCESSION", genbank_data, value = TRUE)
    59. accession <- gsub("^ACCESSION\\s+", "", accession)
    60. accession <- trimws(accession)
    61. # Function to retrieve sequence information from GenBank file
    62. get_sequence_from_genbank <- function(genbank_file) {
    63. # Read GenBank file
    64. genbank_data <- readLines(genbank_file)
    65. # Find sequence lines starting with "ORIGIN"
    66. origin_line <- grep("^ORIGIN", genbank_data)
    67. # Extract sequence lines
    68. sequence_lines <- genbank_data[(origin_line + 1):length(genbank_data)]
    69. # Remove non-sequence characters and concatenate into a single sequence
    70. sequence <- gsub("[0-9 ]", "", sequence_lines)
    71. sequence <- paste(sequence, collapse = "")
    72. return(sequence)
    73. }
    74. # GenBank file
    75. genbank_file <- "1585.gb"
    76. # Retrieve sequence from GenBank file
    77. sequence <- get_sequence_from_genbank(genbank_file)
    78. # Perform KEGG pathway enrichment analysis
    79. kegg_pathways <- keggList("pathway")
    80. # Toy example: Randomly select 100 genes and map to KEGG pathways
    81. set.seed(123)
    82. genes <- sample(1:1000, 100)
    83. # Mapping genes to KEGG pathways
    84. mapped_pathways <- sapply(genes, function(gene) {
    85. kegg_gene <- keggGet("link", gene)
    86. if (length(kegg_gene) > 0) {
    87. pathway_ids <- kegg_gene$ENTRY %>% unique()
    88. pathway_names <- kegg_pathways[kegg_pathways$ID %in% pathway_ids, "NAME"]
    89. if (length(pathway_names) > 0) {
    90. return(paste(pathway_names, collapse = "; "))
    91. } else {
    92. return(NA)
    93. }
    94. } else {
    95. return(NA)
    96. }
    97. })
    98. # Filter out NAs
    99. mapped_pathways <- mapped_pathways[!is.na(mapped_pathways)]
    100. # Count pathway occurrences
    101. pathway_counts <- table(unlist(strsplit(mapped_pathways, "; ")))
    102. # Sort pathways by occurrence
    103. pathway_counts <- sort(pathway_counts, decreasing = TRUE)
    104. # Top 10 enriched pathways
    105. top_pathways <- head(pathway_counts, 10)
    106. # Plot the top enriched pathways
    107. barplot(top_pathways, main = "Top 10 Enriched KEGG Pathways",
    108. xlab = "Pathway", ylab = "Frequency")

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum