Insertion Sequence (IS) Element Detection

gene_x 0 like s 589 view s

Tags: pipeline

  1. Bioinformatics Methods

    Sequence Extraction and Comparative Analysis

    Genomic regions surrounding the multidrug efflux MFS transporter were extracted from five isolates of Acinetobacter baumannii: ATCC19606, ACICU, AYE, ATCC17978, and SDF. These regions were aligned to identify any unmapped segments. This alignment revealed a 45 kb sequence (positions 2,944,541 to 2,989,666 in ATCC17978) that did not align with any part of the other four genomes. This unmapped sequence was subsequently extracted for further analysis.

    Insertion Sequence (IS) Element Detection

    The tool ISEScan (Xie & Tang, 2017) was employed to detect insertion sequences (IS) within the extracted 45 kb fragment. The analysis identified four distinct IS elements, providing insights into potential genomic variations and mobile genetic elements within this segment (see IS_of_45kb.xlsx).

    Visualization and Annotation

    The alignment was visualized using clinker v0.0.21 (Gilchrist and Chooi, 2021). Genes are represented by arrows colored according to similarity groups, with grey arrows indicating genes not part of any similarity group. The sequence identity between genes in the same similarity group is indicated by shading, according to the identity scale bar. Detailed gene annotations are shown and color-coded, with the multidrug efflux MFS transporter loci indicated by a green dashed box (see clinker1.png).

    References

    • Gilchrist, C.L.M., & Chooi, Y.-H. (2021). clinker & clustermap.js: Automatic generation of gene cluster comparison figures. Bioinformatics. doi: https://doi.org/10.1093/bioinformatics/btab007
    • Xie, Z., & Tang, H. (2017). ISEScan: automated identification of insertion sequence elements in prokaryotic genomes. Bioinformatics, 33(21), 3340-3347. doi: https://doi.org/10.1093/bioinformatics/btx433
  2. Tools of Insertion Sequence (IS) Element Detection

    • ISEScan: Description: Although not a database, ISEScan is a software tool used to identify IS elements in bacterial genome sequences. It can be helpful for researchers looking to analyze newly sequenced genomes for the presence of IS elements. Website: Available on platforms like GitHub for download and integration into bioinformatics workflows.

    • TnCentral including ISFinder: Description: TnCentral is a more comprehensive resource that includes information about transposons, which are larger and more complex than simple IS elements but often contain IS sequences as part of their structure. This database provides detailed information about transposon structures, including associated genes and regulatory features.

    • ISsaga: ISsaga is a web-based tool for the identification and annotation of insertion sequences in prokaryotic genomes. It provides various features for IS element analysis, including detection, classification, and visualization. You can access ISsaga here: ISsaga #http://issaga.biotoul.fr/ISsaga/issaga_index.php

    • ISFinder: ISFinder is a curated database and analysis platform for insertion sequences in prokaryotic genomes. It provides a comprehensive collection of IS sequences and tools for sequence analysis, classification, and annotation. You can access ISFinder here: ISFinder For ISfinder please cite: Siguier P. et al. (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36 (link pubmed) and the database URL (http://www-is.biotoul.fr).

    • For ISbrowser please cite: Kichenaradja P. et al. (2010) ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes. Nucleic Acids Res. 38: D62-D68 (link pubmed). For ISsaga please cite: Varani A. et al. (2011) ISsaga: an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes, Genome Biology 2011, 12:R30 (link pubmed).

    • ISMapper: ISMapper is a tool for mapping insertion sequences in bacterial genomes. It uses paired-end sequence data to identify IS element insertion sites and provides information about their genomic context. You can access ISMapper here: ISMapper ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data https://pubmed.ncbi.nlm.nih.gov/26336060/

    • ISseeker: ISseeker is a software package for the identification and annotation of insertion sequences in bacterial genomes. It provides a user-friendly interface for IS element detection and characterization. You can access ISseeker here: ISseeker

  3. Identificatio of IS (Insertion Sequence) elements using ISEScan

    1. #extracted sequence segments from the two isolates, specifically:
    2. # ATCC19606: 930469 to 951674 — segment1
    3. # ATCC17978: 2,934,384 to 3,000,721 — segment2
    4. #Then, I compared the two segments and found that positions 1-11055 of segment1 mapped to 66338-55284 of segment2, and positions 11049-21206 of segment1 mapped to 10158-23 of segment2. This means the sequence from 10159-55283 of segment2 (about 45 kb nt) is not mapped. I then extracted the 45 kb sequence (see the attached fasta file). I attempted to detect IS elements using the tool ISEScan (https://academic.oup.com/bioinformatics/article/33/21/3340/3930124). Four ISs were detected (see 45kb.fasta.xlsx; for more detailed results, see 45kb.fasta.zip).
    5. #samtools faidx Acinetobacter_baumannii_ATCC19606.gbk_converted.fna CP059040.1:930469-951674 > ../ATCC19606_segment.fasta
    6. vim ./gbks/A.baumannii_ATCC17978.gbk
    7. #LOCUS CP000521 3976747 bp DNA circular BCT 31-JAN-2014
    8. #DEFINITION Acinetobacter baumannii ATCC 17978, complete genome.
    9. I used the following commands extracted a 45kb fasta. Then using a tools get IS elements.
    10. samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
    11. makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
    12. blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
    13. samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
    14. please update the following tables in which all positons referred to the 45kb sequence to the complete genome in ATCC17978.
    15. #seqID: sequence identifier
    16. #family: family name of IS element
    17. #cluster: Tpase cluster
    18. #isBegin and isEnd: genome coordinates of the predicted IS element
    19. #isLen: length of the predicted IS element
    20. #ncopy4is: number of predicted IS copies including full-length and partial IS copies
    21. #start1, end1, start2, end2: genome coordinates of the IRs
    22. #score: score of the IRs
    23. #irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
    24. #irLen, length of inverted repeats
    25. #nGaps: number of gaps in IRs
    26. #orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
    27. #strand: strand where the Tpase is
    28. #orfLen: length of predicted Tpase ORF
    29. #E-value: the best E-value among all IS copies for the same IS element, the smaller the better
    30. #E-value4copy: the E-value of the reported IS copy, the smaller the better
    31. #type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
    32. #ov: ov number returned by hmmer search
    33. #tir: terminal inverted repeat sequences
    34. seqID family cluster isBegin isEnd isLen ncopy4is start1 end1 start2 end2 score irId irLen nGaps orfBegin orfEnd strand orfLen E-value E-value4copy type ov tir
    35. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_222 5818 8737 2920 1 5818 5842 8713 8737 18 17 25 0 5931 8822 + 2892 3.3E-74 3.3E-74 c 1 TGATTAAACTTTGCGGATTAAATTG:TGATTAAATCTAATGTGTTGAATTG
    36. CP000521.1_2934384_3000721:10159-55283 IS3 IS3_176 8745 9849 1105 1 8745 8761 9833 9849 26 15 17 0 8916 9775 - 860 9E-38 9E-38 p 1 ATTGATGATAGCCAAAA:ATTGATCCTAGCCAAAA
    37. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_226 9983 10411 429 1 9983 9996 10398 10411 20 12 14 0 9850 10364 + 515 7.2E-28 7.2E-28 p 1 TATCATTCATTATA:TATCATTCAGCATA
    38. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_302 23918 24796 879 1 23918 23953 24762 24796 54 33 36 1 23947 24699 - 753 3E-82 3E-82 c 1 AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG:AAAATCAAAATGATGC-TAGGGCGTGTCTTCATTTG
  4. Postprocessing of the relative postions in the 45kb.fasta in the results IS_of_45kb.xlsx to the absolute positions of the genome

    1. #https://github.com/xiezhq/ISEScan?tab=readme-ov-file#Usage
    2. Columns in NC_012624.fna.csv (NC_012624.fna.tsv, NC_012624.fna.raw):
    3. #seqID: sequence identifier
    4. #family: family name of IS element
    5. #cluster: Tpase cluster
    6. #isBegin and isEnd: genome coordinates of the predicted IS element
    7. #isLen: length of the predicted IS element
    8. #ncopy4is: number of predicted IS copies including full-length and partial IS copies
    9. #start1, end1, start2, end2: genome coordinates of the IRs
    10. #score: score of the IRs
    11. #irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
    12. #irLen, length of inverted repeats
    13. #nGaps: number of gaps in IRs
    14. #orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
    15. #strand: strand where the Tpase is
    16. #orfLen: length of predicted Tpase ORF
    17. #E-value: the best E-value among all IS copies for the same IS element, the smaller the better
    18. #E-value4copy: the E-value of the reported IS copy, the smaller the better
    19. #type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
    20. #ov: ov number returned by hmmer search
    21. #tir: terminal inverted repeat sequences
    22. #Note: the E-value is the E-value returned by hmmer when searching profile HMMs against proteome translated from a genome sequence
    23. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-8737 > 1_1.fasta
    24. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-5842 > 1_2.fasta
    25. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8713-8737 > 1_3.fasta
    26. seqtk seq -r 1_3.fasta > 1_3_rc.fasta
    27. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5931-8822 > 1_4.fasta
    28. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-9849 > 2_1.fasta
    29. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-8761 > 2_2.fasta
    30. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9833-9849 > 2_3.fasta
    31. seqtk seq -r 2_3.fasta > 2_3_rc.fasta
    32. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8916-9775 > 2_4.fasta
    33. seqtk seq -r 2_4.fasta > 2_4_rc.fasta
    34. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-10411 > 3_1.fasta
    35. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-9996 > 3_2.fasta
    36. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:10398-10411 > 3_3.fasta
    37. seqtk seq -r 3_3.fasta > 3_3_rc.fasta
    38. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9850-10364 > 3_4.fasta
    39. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-24796 > 4_1.fasta
    40. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-23953 > 4_2.fasta
    41. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:24762-24796 > 4_3.fasta
    42. seqtk seq -r 4_3.fasta > 4_3_rc.fasta
    43. samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23947-24699 > 4_4.fasta
    44. seqtk seq -r 4_4.fasta > 4_4_rc.fasta
    45. ATG (77 %), GTG (14 %), TTG
    46. Stopcodons in der DNA:
    47. TAG (Thymin - Adenin - Guanin)
    48. TAA (Thymin - Adenin - Adenin)
    49. TGA (Thymin - Guanin - Adenin)
    50. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_222 5818 8737 2920 1 5818 5842 8713 8737 18 17 25 0 5931 8822
    51. CP000521.1_2934384_3000721:10159-55283 IS3 IS3_176 8745 9849 1105 1 8745 8761 9833 9849 26 15 17 0 8916 9775
    52. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_226 9983 10411 429 1 9983 9996 10398 10411 20 12 14 0 9850 10364
    53. CP000521.1_2934384_3000721:10159-55283 IS5 IS5_302 23918 24796 879 1 23918 23953 24762 24796 54 33 36 1 23947 24699
    54. seqID family cluster isBegin isEnd isLen ncopy4is start1 end1 start2 end2 score irId irLen nGaps orfBegin orfEnd strand orfLen E-value E-value4copy type ov tir
    55. CP000521.1 IS5 IS5_222 2,950,360 2,953,279 2,920 1 2,950,360 2,950,388 2,953,239 2,953,267 18 17 25 0 2,950,473 2,953,364
    56. + 2,892 3.3E-74 3.3E-74 c 1 TGATTAAACTTTGCGGATTAAATTG
    57. CP000521.1 IS3 IS3_176 2,953,287 2,954,391 1,105 1 2,953,287 2,953,303 2,954,359 2,954,375 26 15 17 0 2,953,458 2,954,317
    58. - 860 9E-38 9E-38 p 1 ATTGATGATAGCCAAAA
    59. CP000521.1 IS5 IS5_226 2,954,509 2,954,937 429 1 2,954,509 2,954,522 2,954,924 2,954,937 20 12 14 0 2,954,376 2,954,890
    60. + 515 7.2E-28 7.2E-28 p 1 TATCATTCATTATA
    61. CP000521.1 IS5 IS5_302 2,969,444 2,970,322 879 1 2,969,444 2,969,479 2,970,248 2,970,282 54 33 36 1 2,969,473 2,970,225
    62. - 753 3E-82 3E-82 c 1 AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG
    63. I checked the results of the following command, they should be TGATTAAACTTTGCGGATTAAATTG, However it is a little different.
    64. samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950360-2950388
    65. GATTAAACTTTGCGGATTAAATTGACAGA
    66. 2950359-5818=+2944541
    67. 1..45125
    68. 2944541..2989666
    69. #confirm the position-changed sheet!
    70. CP000521.1 IS5 IS5_222 2950359 2953278 2920 1 2950359 2950383 2953254 2953278 18 17 25 0 2950472 2953363
    71. CP000521.1 IS3 IS3_176 2953286 2954390 1105 1 2953286 2953302 2954374 2954390 26 15 17 0 2953457 2954316
    72. CP000521.1 IS5 IS5_226 2954524 2954952 429 1 2954524 2954537 2954939 2954952 20 12 14 0 2954391 2954905
    73. CP000521.1 IS5 IS5_302 2968459 2969337 879 1 2968459 2968494 2969303 2969337 54 33 36 1 2968488 2969240
    74. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2953278 > 1_1_changed.fasta
    75. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2950383 > 1_2_changed.fasta
    76. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953254-2953278 > 1_3_changed.fasta
    77. seqtk seq -r 1_3_changed.fasta > 1_3_rc_changed.fasta
    78. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950472-2953363 > 1_4_changed.fasta
    79. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2954390 > 2_1_changed.fasta
    80. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2953302 > 2_2_changed.fasta
    81. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954374-2954390 > 2_3_changed.fasta
    82. seqtk seq -r 2_3_changed.fasta > 2_3_rc_changed.fasta
    83. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953457-2954316 > 2_4_changed.fasta
    84. seqtk seq -r 2_4_changed.fasta > 2_4_rc_changed.fasta
    85. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954952 > 3_1_changed.fasta
    86. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954537 > 3_2_changed.fasta
    87. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954939-2954952 > 3_3_changed.fasta
    88. seqtk seq -r 3_3_changed.fasta > 3_3_rc_changed.fasta
    89. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954391-2954905 > 3_4_changed.fasta
    90. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2969337 > 4_1_changed.fasta
    91. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2968494 > 4_2_changed.fasta
    92. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2969303-2969337 > 4_3_changed.fasta
    93. seqtk seq -r 4_3_changed.fasta > 4_3_rc_changed.fasta
    94. samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968488-2969240 > 4_4_changed.fasta
    95. seqtk seq -r 4_4_changed.fasta > 4_4_rc_changed.fasta
    96. diff 1_1.fasta 1_1_changed.fasta
    97. diff 1_2.fasta 1_2_changed.fasta
    98. diff 1_3.fasta 1_3_changed.fasta
    99. diff 1_4.fasta 1_4_changed.fasta
    100. diff 2_1.fasta 2_1_changed.fasta
    101. diff 2_2.fasta 2_2_changed.fasta
    102. diff 2_3.fasta 2_3_changed.fasta
    103. diff 2_4.fasta 2_4_changed.fasta
    104. diff 3_1.fasta 3_1_changed.fasta
    105. diff 3_2.fasta 3_2_changed.fasta
    106. diff 3_3.fasta 3_3_changed.fasta
    107. diff 3_4.fasta 3_4_changed.fasta
    108. diff 4_1.fasta 4_1_changed.fasta
    109. diff 4_2.fasta 4_2_changed.fasta
    110. diff 4_3.fasta 4_3_changed.fasta
    111. diff 4_4.fasta 4_4_changed.fasta
    112. It should be ATG.... (gene length)
    113. The positions are not correct. Note that I changed the coordinate twice in the following commands:
    114. samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
    115. makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
    116. blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
    117. samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
    118. The postions in the original tables are relative postions in the 45kb.fasta. Please correct it!
  5. 我想比对的是这27,694 Acinetobacter calcoaceticus/baumannii complex Genomes (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=909768) 中,目的基因 (https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530) 是否是conserved的 (先不管SNP, InDel或其他突变现象)。就如这两篇paper的结果显示"AdeIJK is highly conserved across the Acinetobacter genus" (Paper1: Evolution of RND efflux pumps in the development of a successful pathogen; Paper2: RND pumps across the genus Acinetobacter: AdeIJK is the universal efflux pump). Analysis results: In our analysis of 38,638 records, we discovered that 65 of them do not contain the specific genes based on the submitted genomes from GenBank (refer to the details below). Attached, you will find a document that provides detailed information on the positions of the genes present in the remaining genomes. However, it's important to note that, based on my experience, GenBank does contain a number of erroneously assembled genomes. Therefore, it's conceivable that the absence of the gene in some isolates could be attributed not to its actual absence but rather to errors in genome assembly.

    1. ~/Tools/datasets download genome taxon 909768
    2. #ncbi_dataset_taxon909768
    3. ~/Tools/datasets download genome accession --inputfile sampled_accession_numbers.txt --include gff3,gbff,genome
    4. #./blastn_1st_batch.sh generating ../blastn_1st_res
    5. for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
    6. echo "makeblastdb -in ${sample} -dbtype nucl"
    7. echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_merged_res_1_1000"
    8. done
    9. for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
    10. echo "makeblastdb -in ${sample} -dbtype nucl"
    11. echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res_4001_8000"
    12. done
    13. ...
    14. TODO NEXT WEEK: senden a EXCEL storing the merged blastn results! look if all 38638 records contain gene, if not, mark yellow (explain it is also possible an error of assembly of the genome regarding the specific isolate!)
    15. #GenBank GCA_000810025.3
    16. 38638
    17. makeblastdb -in GCF_000757665.1_ASM75766v1_genomic.fna -dbtype nucl
    18. blastn -db GCF_000757665.1_ASM75766v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
    19. makeblastdb -in GCF_000746645.1_ASM74664v1_genomic.fna -dbtype nucl
    20. blastn -db GCF_000746645.1_ASM74664v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
    21. #find ./ncbi_dataset/sampled_100/ -name "*.fna" -print0 | xargs -0 mv -t /destination/directory/
    22. find . -mindepth 2 -maxdepth 2 -type f -name "*.fna" -print0 | xargs -0 mv -t .
    23. find . -mindepth 1 -maxdepth 1 -type f -name "*.fna" >../fna_list2
    24. # The genome was downloaded at Mär 25, 2024
    25. I have over 38638 genome with the taxon-id 909768, want to assembly 100 from them,
    26. the data strcture as the following:
    27. ./data/GCF_000757665.1/GCF_000757665.1_ASM75766v1_genomic.fna
    28. ./data/GCF_000746645.1/GCF_000746645.1_ASM74664v1_genomic.fna
    29. ./data/GCF_000746605.1/GCF_000746605.1_ASM74660v1_genomic.fna
    30. ./data/GCF_000738845.1/GCF_000738845.1_ASM73884v1_genomic.fna
    31. ./data/GCF_000734775.1/GCF_000734775.1_ASM73477v1_genomic.fna
    32. ./data/GCF_000731965.1/GCF_000731965.1_ASM73196v1_genomic.fna
    33. ./data/GCF_000708795.1/GCF_000708795.1_ABU310_genomic.fna
    34. ./data/GCF_000708775.1/GCF_000708775.1_ASM70877v1_genomic.fna
    35. ./data/GCF_000695855.3/GCF_000695855.3_ASM69585v3_genomic.fna
    36. ./data/GCF_000248195.1/GCF_000248195.1_ASM24819v2_genomic.fna
    37. ....
    38. I have a total of 38638 genomes in fasta format in local computer. I want to quick check if they are contain a gene ABAYE_RS05070 (/product="multidrug effflux MFS transporter") https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530)
    39. # What are the default values of -perc_identity and -qcov_hsp_perc in blastn?
    40. #-perc_identity 90 -qcov_hsp_perc 90
    41. By default:
    42. -perc_identity is set to 0, meaning that all alignments found will be reported regardless of percent identity.
    43. -qcov_hsp_perc is set to 0 as well, indicating that all alignments will be reported regardless of query coverage per HSP.
    44. makeblastdb -in ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -dbtype nucl
    45. blastn -db ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -query ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
    46. makeblastdb -in 100.fasta -dbtype nucl
    47. # -max_target_seqs 1
    48. blastn -db 100.fasta -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
    49. with the following commands:
    50. for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna _genomic.fna
    51. GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
    52. makeblastdb -in ${sample} -dbtype nucl
    53. blastn -db ${sample} -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_1st_res
    54. done
    55. get the following list, I want to add the two parts in the filenames also in the results as the second and third columns.
    56. For example for GCF_001907155.1_ASM190715v1_genomic.fna,
    57. the current results is
    58. "gi|169794206|ref|NC_010410.1|:974325-975530 NZ_JWTW03000099.1 97.595 1206 29 0 1 1206 211093 209888 0.0 2067"
    59. It should be
    60. "gi|169794206|ref|NC_010410.1|:974325-975530 GCF_001907155.1 ASM190715v1 NZ_JWTW03000099.1 97.595 1206 29 0 1 1206 211093 209888 0.0 2067"
    61. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_JWYU03000064.1 97.595 1206 29 0 1 1206 73028 74233 0.0 2067
    62. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_CP010397.1 97.181 1206 34 0 1 1206 983267 984472 0.0 2039
    63. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_HG977522.1 97.595 1206 29 0 1 1206 935195 936400 0.0 2067
    64. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_HG977526.1 97.595 1206 29 0 1 1206 934286 935491 0.0 2067
    65. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_AP014649.1 97.595 1206 29 0 1 1206 960072 961277 0.0 2067
    66. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_CP010781.1 100.000 1206 0 0 1 1206 2943042 2941837 0.0 2228
    67. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_BBSU01000014.1 98.093 1206 23 0 1 1206 30546 29341 0.0 2100
    68. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_BBSP01000001.1 97.430 1206 31 0 1 1206 511041 509836 0.0 2056
    69. #!/bin/bash
    70. > ../blastn_res
    71. for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
    72. # Extract the second and third parts from the filename
    73. filename_parts=$(echo "${sample}" | cut -d'_' -f1-3)
    74. # Create the blast database
    75. makeblastdb -in "${sample}" -dbtype nucl
    76. # Run blastn and append the results with filename parts
    77. blastn -db "${sample}" -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -strand both -max_target_seqs 1 | awk -v filename_parts="${filename_parts}" '{print $0 "\t" filename_parts}' >> ../blastn_res
    78. done
    79. #a specific assembly version of a genome in the NCBI (National Center for Biotechnology Information) database
    80. #GCF_000248195.1_ASM24819v2 Acinetobacter baumannii strain NCTC 7422 contig_0002, whole genome shotgun sequence
    81. DELETE, since they are all the same: 1. qseqid [query or source (gene) sequence id]
    82. 2. sseqid GenBank ID [subject or target (reference genome) sequence id]
    83. 3. pident Percentage of identical positions
    84. 4. length Alignment length (sequence overlap)
    85. 5. mismatch Number of mismatches
    86. 6. gapopen Number of gap openings
    87. 7. qstart Start of alignment in query
    88. 8. qend End of alignment in query
    89. 9. sstart Start of alignment in subject
    90. 10. send End of alignment in subject
    91. 11. evalue Expect value
    92. 12. bitscore Bit score
    93. 13. Assembly accession number
    94. GenBank ID Percentage of identical positions Alignment length Number of mismatches Number of gap openings Start of alignment in query End of alignment in query Start of alignment in subject End of alignment in subject Expect value Bit score Assembly accession number
    95. query_seq_id sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
    96. gi|169794206|ref|NC_010410.1|:974325-975530 NZ_AIED01000002.1 82.047 1192 210 4 1 1190 161177 159988 0.0 1013 GCF_000248195.1_ASM24819v2
    97. #batch 1: until ...802.2_genomic.fna GCA_032600445.2_PDT001925801.2_genomic.fna GCA_032600465.2_PDT001925800.2_genomic.fna GCA_032600485.2_PDT001925799.2_genomic.fna GCA_032600525.2_PDT001925797.2_genomic.fna GCA_032600545.2_PDT001925796.2_genomic.fna GCA_032600565.2_PDT001925795.2_genomic.fna GCA_032600585.2_PDT001925793.2_genomic.fna
    98. #batch 2: from GCA_032600605.2_PDT001925792.2_genomic.fna GCA_032600625.2_PDT001925794.2_genomic.fna GCA_032600645.2_PDT001925791.2_genomic.fna ...
    99. #to compare with:
    100. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCE020000004.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600385.2_PDT001925803.2
    101. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCB020000004.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600395.2_PDT001925804.2
    102. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCD020000003.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600425.2_PDT001925802.2
    103. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCF020000001.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600445.2_PDT001925801.2 <--
    104. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCG020000001.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600465.2_PDT001925800.2
    105. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCH020000002.1 97.595 1206 29 0 1 1206 203882 202677 0.0 2067 GCA_032600485.2_PDT001925799.2
    106. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCJ020000003.1 97.595 1206 29 0 1 1206 113990 112785 0.0 2067 GCA_032600525.2_PDT001925797.2
    107. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCK020000002.1 97.595 1206 29 0 1 1206 210992 209787 0.0 2067 GCA_032600545.2_PDT001925796.2
    108. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600565.2_PDT001925795.2
    109. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600585.2_PDT001925793.2 <--
    110. Assembly accession number GenBank accession number
    111. Can I say NZ_AIED01000002.1 Genbank_ID, and GCF_000248195.1_ASM24819v2 Assembly_ID?
    112. Yes, you can interpret these identifiers within the given line as follows:
    113. * NZ_AIED01000002.1 can be referred to as a GenBank accession number. This format typically represents a sequence identifier within the GenBank database, part of the NCBI's collection of databases related to nucleotide sequences and their annotations.
    114. * GCF_000248195.1_ASM24819v2 refers to an assembly accession number from the GenBank assembly database. The "GCF" prefix indicates a RefSeq genome assembly accession, which is part of NCBI's Reference Sequence Database. The number following "GCF" is a unique identifier for the specific assembly of the genome, and the information after the underscore provides a version identifier and possibly an internal project code or name.
    115. In a broader context, the line you've provided looks like it's from a BLAST output or a similar comparative genomics tool output. Such lines typically contain information about sequence alignments, including the query and subject sequences (with their respective accession numbers), alignment scores, and statistical significance metrics. The presence of both a sequence accession number (GenBank ID) and an assembly accession number helps link individual sequences to the larger genomic context from which they were derived.
    116. 30662 + 7989 = 38651 > 38638
    117. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600565.2_PDT001925795.2
    118. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600585.2_PDT001925793.2
    119. gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCO020000003.1 97.595 1206 29 0 1 1206 54521 53316 0.0 2067 GCA_032600605.2_PDT001925792.2
    120. # In the file containing the following lines, I want to extract the first tokens separated by '_'.
    121. ./GCF_009013175.1_ASM901317v1_genomic.fna
    122. ./GCF_013344775.1_ASM1334477v1_genomic.fna
    123. ./GCF_013344845.1_ASM1334484v1_v1_genomic.fna
    124. ./GCF_013346345.1_ASM1334634v1_genomic.fna
    125. ./GCF_013346385.1_ASM1334638v1_genomic.fna
    126. 38651-38638=13
    127. diff fna_list_tokens_1_3 blastn_res_f13 > diff2
    128. diff fna_list_tokens_1_3 blastn_res_f13_uniq > diff3
    129. # Open the file named 'fna_list' and read its content
    130. with open('fna_list', 'r') as file:
    131. lines = file.readlines()
    132. # Extract the first token after the first '_' in each line
    133. first_tokens = [line.split('_')[1].strip() for line in lines]
    134. # Example output
    135. print(first_tokens)

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum