Insertion Sequence (IS) Element Detection

gene_x 0 like s 342 view s

Tags: pipeline

  1. Bioinformatics Methods

    Sequence Extraction and Comparative Analysis

    Genomic regions surrounding the multidrug efflux MFS transporter were extracted from five isolates of Acinetobacter baumannii: ATCC19606, ACICU, AYE, ATCC17978, and SDF. These regions were aligned to identify any unmapped segments. This alignment revealed a 45 kb sequence (positions 2,944,541 to 2,989,666 in ATCC17978) that did not align with any part of the other four genomes. This unmapped sequence was subsequently extracted for further analysis.

    Insertion Sequence (IS) Element Detection

    The tool ISEScan (Xie & Tang, 2017) was employed to detect insertion sequences (IS) within the extracted 45 kb fragment. The analysis identified four distinct IS elements, providing insights into potential genomic variations and mobile genetic elements within this segment (see IS_of_45kb.xlsx).

    Visualization and Annotation

    The alignment was visualized using clinker v0.0.21 (Gilchrist and Chooi, 2021). Genes are represented by arrows colored according to similarity groups, with grey arrows indicating genes not part of any similarity group. The sequence identity between genes in the same similarity group is indicated by shading, according to the identity scale bar. Detailed gene annotations are shown and color-coded, with the multidrug efflux MFS transporter loci indicated by a green dashed box (see clinker1.png).

    References

    • Gilchrist, C.L.M., & Chooi, Y.-H. (2021). clinker & clustermap.js: Automatic generation of gene cluster comparison figures. Bioinformatics. doi: https://doi.org/10.1093/bioinformatics/btab007
    • Xie, Z., & Tang, H. (2017). ISEScan: automated identification of insertion sequence elements in prokaryotic genomes. Bioinformatics, 33(21), 3340-3347. doi: https://doi.org/10.1093/bioinformatics/btx433
  2. Tools of Insertion Sequence (IS) Element Detection

    • ISEScan: Description: Although not a database, ISEScan is a software tool used to identify IS elements in bacterial genome sequences. It can be helpful for researchers looking to analyze newly sequenced genomes for the presence of IS elements. Website: Available on platforms like GitHub for download and integration into bioinformatics workflows.

    • TnCentral including ISFinder: Description: TnCentral is a more comprehensive resource that includes information about transposons, which are larger and more complex than simple IS elements but often contain IS sequences as part of their structure. This database provides detailed information about transposon structures, including associated genes and regulatory features.

    • ISsaga: ISsaga is a web-based tool for the identification and annotation of insertion sequences in prokaryotic genomes. It provides various features for IS element analysis, including detection, classification, and visualization. You can access ISsaga here: ISsaga #http://issaga.biotoul.fr/ISsaga/issaga_index.php

    • ISFinder: ISFinder is a curated database and analysis platform for insertion sequences in prokaryotic genomes. It provides a comprehensive collection of IS sequences and tools for sequence analysis, classification, and annotation. You can access ISFinder here: ISFinder For ISfinder please cite: Siguier P. et al. (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36 (link pubmed) and the database URL (http://www-is.biotoul.fr).

    • For ISbrowser please cite: Kichenaradja P. et al. (2010) ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes. Nucleic Acids Res. 38: D62-D68 (link pubmed). For ISsaga please cite: Varani A. et al. (2011) ISsaga: an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes, Genome Biology 2011, 12:R30 (link pubmed).

    • ISMapper: ISMapper is a tool for mapping insertion sequences in bacterial genomes. It uses paired-end sequence data to identify IS element insertion sites and provides information about their genomic context. You can access ISMapper here: ISMapper ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data https://pubmed.ncbi.nlm.nih.gov/26336060/

    • ISseeker: ISseeker is a software package for the identification and annotation of insertion sequences in bacterial genomes. It provides a user-friendly interface for IS element detection and characterization. You can access ISseeker here: ISseeker

  3. Identificatio of IS (Insertion Sequence) elements using ISEScan

    #extracted sequence segments from the two isolates, specifically:
    #    ATCC19606: 930469 to 951674 — segment1
    #    ATCC17978: 2,934,384 to 3,000,721 — segment2
    #Then, I compared the two segments and found that positions 1-11055 of segment1 mapped to 66338-55284 of segment2, and positions 11049-21206 of segment1 mapped to 10158-23 of segment2. This means the sequence from 10159-55283 of segment2 (about 45 kb nt) is not mapped. I then extracted the 45 kb sequence (see the attached fasta file). I attempted to detect IS elements using the tool ISEScan (https://academic.oup.com/bioinformatics/article/33/21/3340/3930124). Four ISs were detected (see 45kb.fasta.xlsx; for more detailed results, see 45kb.fasta.zip).
    
    #samtools faidx Acinetobacter_baumannii_ATCC19606.gbk_converted.fna CP059040.1:930469-951674 > ../ATCC19606_segment.fasta
    vim ./gbks/A.baumannii_ATCC17978.gbk
    #LOCUS       CP000521             3976747 bp    DNA     circular BCT 31-JAN-2014
    #DEFINITION  Acinetobacter baumannii ATCC 17978, complete genome.
    
    I used the following commands extracted a 45kb fasta. Then using a tools get IS elements.
    samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
    makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
    blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
    samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
    
    please update the following tables in which all positons referred to the 45kb sequence to the complete genome in ATCC17978.
    
        #seqID: sequence identifier
        #family: family name of IS element
        #cluster: Tpase cluster
        #isBegin and isEnd: genome coordinates of the predicted IS element
        #isLen: length of the predicted IS element
        #ncopy4is: number of predicted IS copies including full-length and partial IS copies
        #start1, end1, start2, end2: genome coordinates of the IRs
        #score: score of the IRs
        #irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
        #irLen, length of inverted repeats
        #nGaps: number of gaps in IRs
        #orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
        #strand: strand where the Tpase is
        #orfLen: length of predicted Tpase ORF
        #E-value: the best E-value among all IS copies for the same IS element, the smaller the better
        #E-value4copy: the E-value of the reported IS copy, the smaller the better
        #type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
        #ov: ov number returned by hmmer search
        #tir: terminal inverted repeat sequences
        seqID   family  cluster isBegin isEnd   isLen   ncopy4is    start1  end1    start2  end2    score   irId    irLen   nGaps   orfBegin    orfEnd  strand  orfLen  E-value E-value4copy    type    ov  tir
        CP000521.1_2934384_3000721:10159-55283  IS5 IS5_222 5818    8737    2920    1   5818    5842    8713    8737    18  17  25  0   5931    8822    +   2892    3.3E-74 3.3E-74 c   1   TGATTAAACTTTGCGGATTAAATTG:TGATTAAATCTAATGTGTTGAATTG
        CP000521.1_2934384_3000721:10159-55283  IS3 IS3_176 8745    9849    1105    1   8745    8761    9833    9849    26  15  17  0   8916    9775    -   860 9E-38   9E-38   p   1   ATTGATGATAGCCAAAA:ATTGATCCTAGCCAAAA
        CP000521.1_2934384_3000721:10159-55283  IS5 IS5_226 9983    10411   429 1   9983    9996    10398   10411   20  12  14  0   9850    10364   +   515 7.2E-28 7.2E-28 p   1   TATCATTCATTATA:TATCATTCAGCATA
        CP000521.1_2934384_3000721:10159-55283  IS5 IS5_302 23918   24796   879 1   23918   23953   24762   24796   54  33  36  1   23947   24699   -   753 3E-82   3E-82   c   1   AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG:AAAATCAAAATGATGC-TAGGGCGTGTCTTCATTTG
    
  4. Postprocessing of the relative postions in the 45kb.fasta in the results IS_of_45kb.xlsx to the absolute positions of the genome

    #https://github.com/xiezhq/ISEScan?tab=readme-ov-file#Usage
    
    Columns in NC_012624.fna.csv (NC_012624.fna.tsv, NC_012624.fna.raw):
    
    #seqID: sequence identifier
    #family: family name of IS element
    #cluster: Tpase cluster
    #isBegin and isEnd: genome coordinates of the predicted IS element
    #isLen: length of the predicted IS element
    #ncopy4is: number of predicted IS copies including full-length and partial IS copies
    #start1, end1, start2, end2: genome coordinates of the IRs
    #score: score of the IRs
    #irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
    #irLen, length of inverted repeats
    #nGaps: number of gaps in IRs
    #orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
    #strand: strand where the Tpase is
    #orfLen: length of predicted Tpase ORF
    #E-value: the best E-value among all IS copies for the same IS element, the smaller the better
    #E-value4copy: the E-value of the reported IS copy, the smaller the better
    #type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
    #ov: ov number returned by hmmer search
    #tir: terminal inverted repeat sequences
    
    #Note: the E-value is the E-value returned by hmmer when searching profile HMMs against proteome translated from a genome sequence
    
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-8737 > 1_1.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-5842 > 1_2.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8713-8737 > 1_3.fasta
    seqtk seq -r 1_3.fasta > 1_3_rc.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5931-8822 > 1_4.fasta
    
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-9849 > 2_1.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-8761 > 2_2.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9833-9849 > 2_3.fasta
    seqtk seq -r 2_3.fasta > 2_3_rc.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8916-9775 > 2_4.fasta
    seqtk seq -r 2_4.fasta > 2_4_rc.fasta
    
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-10411 > 3_1.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-9996 > 3_2.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:10398-10411 > 3_3.fasta
    seqtk seq -r 3_3.fasta > 3_3_rc.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9850-10364 > 3_4.fasta
    
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-24796 > 4_1.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-23953 > 4_2.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:24762-24796 > 4_3.fasta
    seqtk seq -r 4_3.fasta > 4_3_rc.fasta
    samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23947-24699 > 4_4.fasta
    seqtk seq -r 4_4.fasta > 4_4_rc.fasta
    
    ATG (77 %), GTG (14 %), TTG
    
    Stopcodons in der DNA:
    
        TAG (Thymin - Adenin - Guanin)
        TAA (Thymin - Adenin - Adenin)
        TGA (Thymin - Guanin - Adenin)
    
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_222 5818    8737    2920    1   5818    5842    8713    8737    18  17  25  0   5931    8822
    CP000521.1_2934384_3000721:10159-55283  IS3 IS3_176 8745    9849    1105    1   8745    8761    9833    9849    26  15  17  0   8916    9775
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_226 9983    10411   429 1   9983    9996    10398   10411   20  12  14  0   9850    10364
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_302 23918   24796   879 1   23918   23953   24762   24796   54  33  36  1   23947   24699
    
    seqID   family  cluster isBegin isEnd   isLen   ncopy4is    start1  end1    start2  end2    score   irId    irLen   nGaps   orfBegin    orfEnd  strand  orfLen  E-value E-value4copy    type    ov  tir
    CP000521.1  IS5 IS5_222 2,950,360   2,953,279   2,920   1   2,950,360   2,950,388   2,953,239   2,953,267   18  17  25  0   2,950,473   2,953,364
    +   2,892   3.3E-74 3.3E-74 c   1   TGATTAAACTTTGCGGATTAAATTG
    CP000521.1  IS3 IS3_176 2,953,287   2,954,391   1,105   1   2,953,287   2,953,303   2,954,359   2,954,375   26  15  17  0   2,953,458   2,954,317
    -   860 9E-38   9E-38   p   1   ATTGATGATAGCCAAAA
    CP000521.1  IS5 IS5_226 2,954,509   2,954,937   429 1   2,954,509   2,954,522   2,954,924   2,954,937   20  12  14  0   2,954,376   2,954,890
    +   515 7.2E-28 7.2E-28 p   1   TATCATTCATTATA
    CP000521.1  IS5 IS5_302 2,969,444   2,970,322   879 1   2,969,444   2,969,479   2,970,248   2,970,282   54  33  36  1   2,969,473   2,970,225
    -   753 3E-82   3E-82   c   1   AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG
    
    I checked the results of the following command, they should be TGATTAAACTTTGCGGATTAAATTG, However it is a little different.
    samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950360-2950388
    GATTAAACTTTGCGGATTAAATTGACAGA
    
    2950359-5818=+2944541
    
    1..45125
    2944541..2989666
    
    #confirm the position-changed sheet!
    CP000521.1  IS5 IS5_222 2950359 2953278 2920    1   2950359 2950383 2953254 2953278 18  17  25  0   2950472 2953363
    CP000521.1  IS3 IS3_176 2953286 2954390 1105    1   2953286 2953302 2954374 2954390 26  15  17  0   2953457 2954316
    CP000521.1  IS5 IS5_226 2954524 2954952 429 1   2954524 2954537 2954939 2954952 20  12  14  0   2954391 2954905
    CP000521.1  IS5 IS5_302 2968459 2969337 879 1   2968459 2968494 2969303 2969337 54  33  36  1   2968488 2969240
    
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2953278 > 1_1_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2950383 > 1_2_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953254-2953278 > 1_3_changed.fasta
    seqtk seq -r 1_3_changed.fasta > 1_3_rc_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950472-2953363 > 1_4_changed.fasta
    
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2954390 > 2_1_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2953302 > 2_2_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954374-2954390 > 2_3_changed.fasta
    seqtk seq -r 2_3_changed.fasta > 2_3_rc_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953457-2954316 > 2_4_changed.fasta
    seqtk seq -r 2_4_changed.fasta > 2_4_rc_changed.fasta
    
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954952 > 3_1_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954537 > 3_2_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954939-2954952 > 3_3_changed.fasta
    seqtk seq -r 3_3_changed.fasta > 3_3_rc_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954391-2954905 > 3_4_changed.fasta
    
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2969337 > 4_1_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2968494 > 4_2_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2969303-2969337 > 4_3_changed.fasta
    seqtk seq -r 4_3_changed.fasta > 4_3_rc_changed.fasta
    samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968488-2969240 > 4_4_changed.fasta
    seqtk seq -r 4_4_changed.fasta > 4_4_rc_changed.fasta
    
    diff 1_1.fasta 1_1_changed.fasta
    diff 1_2.fasta 1_2_changed.fasta
    diff 1_3.fasta 1_3_changed.fasta
    diff 1_4.fasta 1_4_changed.fasta
    
    diff 2_1.fasta 2_1_changed.fasta
    diff 2_2.fasta 2_2_changed.fasta
    diff 2_3.fasta 2_3_changed.fasta
    diff 2_4.fasta 2_4_changed.fasta
    
    diff 3_1.fasta 3_1_changed.fasta
    diff 3_2.fasta 3_2_changed.fasta
    diff 3_3.fasta 3_3_changed.fasta
    diff 3_4.fasta 3_4_changed.fasta
    
    diff 4_1.fasta 4_1_changed.fasta
    diff 4_2.fasta 4_2_changed.fasta
    diff 4_3.fasta 4_3_changed.fasta
    diff 4_4.fasta 4_4_changed.fasta
    
    It should be ATG.... (gene length)
    
    The positions are not correct. Note that I changed the coordinate twice in the following commands:
            samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
            makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
            blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
            samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
    
    The postions in the original tables are relative postions in the 45kb.fasta. Please correct it!
    
  5. 我想比对的是这27,694 Acinetobacter calcoaceticus/baumannii complex Genomes (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=909768) 中,目的基因 (https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530) 是否是conserved的 (先不管SNP, InDel或其他突变现象)。就如这两篇paper的结果显示"AdeIJK is highly conserved across the Acinetobacter genus" (Paper1: Evolution of RND efflux pumps in the development of a successful pathogen; Paper2: RND pumps across the genus Acinetobacter: AdeIJK is the universal efflux pump). Analysis results: In our analysis of 38,638 records, we discovered that 65 of them do not contain the specific genes based on the submitted genomes from GenBank (refer to the details below). Attached, you will find a document that provides detailed information on the positions of the genes present in the remaining genomes. However, it's important to note that, based on my experience, GenBank does contain a number of erroneously assembled genomes. Therefore, it's conceivable that the absence of the gene in some isolates could be attributed not to its actual absence but rather to errors in genome assembly.

    ~/Tools/datasets download genome taxon 909768
    #ncbi_dataset_taxon909768
    
    ~/Tools/datasets download genome accession --inputfile sampled_accession_numbers.txt --include gff3,gbff,genome
    
    #./blastn_1st_batch.sh generating ../blastn_1st_res
    for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
      echo "makeblastdb -in ${sample} -dbtype nucl"
      echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_merged_res_1_1000"
    done
    
    for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
      echo "makeblastdb -in ${sample} -dbtype nucl"
      echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res_4001_8000"
    done
    
    ...
    
    TODO NEXT WEEK: senden a EXCEL storing the merged blastn results! look if all 38638 records contain gene, if not, mark yellow (explain it is also possible an error of assembly of the genome regarding the specific isolate!)
    #GenBank GCA_000810025.3
    
    38638
    
    makeblastdb -in GCF_000757665.1_ASM75766v1_genomic.fna -dbtype nucl
    blastn -db GCF_000757665.1_ASM75766v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
    makeblastdb -in GCF_000746645.1_ASM74664v1_genomic.fna -dbtype nucl
    blastn -db GCF_000746645.1_ASM74664v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
    
    #find ./ncbi_dataset/sampled_100/ -name "*.fna" -print0 | xargs -0 mv -t /destination/directory/
    find . -mindepth 2 -maxdepth 2 -type f -name "*.fna" -print0 | xargs -0 mv -t .
    find . -mindepth 1 -maxdepth 1 -type f -name "*.fna" >../fna_list2
    
    # The genome was downloaded at Mär 25, 2024
    I have over 38638 genome with the taxon-id 909768, want to assembly 100 from them,
    
    the data strcture as the following:
    
    ./data/GCF_000757665.1/GCF_000757665.1_ASM75766v1_genomic.fna
    ./data/GCF_000746645.1/GCF_000746645.1_ASM74664v1_genomic.fna
    ./data/GCF_000746605.1/GCF_000746605.1_ASM74660v1_genomic.fna
    ./data/GCF_000738845.1/GCF_000738845.1_ASM73884v1_genomic.fna
    ./data/GCF_000734775.1/GCF_000734775.1_ASM73477v1_genomic.fna
    ./data/GCF_000731965.1/GCF_000731965.1_ASM73196v1_genomic.fna
    ./data/GCF_000708795.1/GCF_000708795.1_ABU310_genomic.fna
    ./data/GCF_000708775.1/GCF_000708775.1_ASM70877v1_genomic.fna
    ./data/GCF_000695855.3/GCF_000695855.3_ASM69585v3_genomic.fna
    ./data/GCF_000248195.1/GCF_000248195.1_ASM24819v2_genomic.fna
    ....
    
    I have a total of 38638 genomes in fasta format in local computer. I want to quick check if they are contain a gene ABAYE_RS05070 (/product="multidrug effflux MFS transporter") https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530)
    
    # What are the default values of -perc_identity and  -qcov_hsp_perc in blastn?
    #-perc_identity 90 -qcov_hsp_perc 90
    By default:
    -perc_identity is set to 0, meaning that all alignments found will be reported regardless of percent identity.
    -qcov_hsp_perc is set to 0 as well, indicating that all alignments will be reported regardless of query coverage per HSP.
    
    makeblastdb -in ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -dbtype nucl
    blastn -db ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -query ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
    
    makeblastdb -in 100.fasta -dbtype nucl
    # -max_target_seqs 1
    blastn -db 100.fasta -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
    
    with the following commands:
    
    for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna _genomic.fna
    GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
      makeblastdb -in ${sample} -dbtype nucl
      blastn -db ${sample} -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_1st_res
    done
    
    get the following list, I want to add the two parts in the filenames also in the results as the second and third columns.
    
    For example for  GCF_001907155.1_ASM190715v1_genomic.fna,
    
    the current results is
    "gi|169794206|ref|NC_010410.1|:974325-975530     NZ_JWTW03000099.1       97.595  1206    29      0       1       1206    211093  209888  0.0     2067"
    It should be
    "gi|169794206|ref|NC_010410.1|:974325-975530     GCF_001907155.1  ASM190715v1  NZ_JWTW03000099.1       97.595  1206    29      0       1       1206    211093  209888  0.0     2067"
    
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_JWYU03000064.1       97.595  1206    29      0       1       1206    73028   74233   0.0     2067
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_CP010397.1   97.181  1206    34      0       1       1206    983267  984472  0.0     2039
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_HG977522.1   97.595  1206    29      0       1       1206    935195  936400  0.0     2067
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_HG977526.1   97.595  1206    29      0       1       1206    934286  935491  0.0     2067
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_AP014649.1   97.595  1206    29      0       1       1206    960072  961277  0.0     2067
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_CP010781.1   100.000 1206    0       0       1       1206    2943042 2941837 0.0     2228
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_BBSU01000014.1       98.093  1206    23      0       1       1206    30546   29341   0.0     2100
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_BBSP01000001.1       97.430  1206    31      0       1       1206    511041  509836  0.0     2056
    
    #!/bin/bash
    
    > ../blastn_res
    for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
      # Extract the second and third parts from the filename
      filename_parts=$(echo "${sample}" | cut -d'_' -f1-3)
    
      # Create the blast database
      makeblastdb -in "${sample}" -dbtype nucl
    
      # Run blastn and append the results with filename parts
      blastn -db "${sample}" -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -strand both -max_target_seqs 1 | awk -v filename_parts="${filename_parts}" '{print $0 "\t" filename_parts}' >> ../blastn_res
    done
    
    #a specific assembly version of a genome in the NCBI (National Center for Biotechnology Information) database
    
    #GCF_000248195.1_ASM24819v2    Acinetobacter baumannii strain NCTC 7422 contig_0002, whole genome shotgun sequence
    
    DELETE, since they are all the same:   1.  qseqid      [query or source (gene) sequence id]
      2.  sseqid      GenBank ID   [subject or target (reference genome) sequence id]
      3.  pident      Percentage of identical positions
      4.  length      Alignment length (sequence overlap)
      5.  mismatch    Number of mismatches
      6.  gapopen     Number of gap openings
      7.  qstart      Start of alignment in query
      8.  qend        End of alignment in query
      9.  sstart      Start of alignment in subject
    10.  send        End of alignment in subject
    11.  evalue      Expect value
    12.  bitscore    Bit score
    13.              Assembly accession number
    
    GenBank ID  Percentage of identical positions   Alignment length    Number of mismatches    Number of gap openings  Start of alignment in query End of alignment in query   Start of alignment in subject   End of alignment in subject Expect value    Bit score   Assembly accession number
    query_seq_id sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
    gi|169794206|ref|NC_010410.1|:974325-975530     NZ_AIED01000002.1       82.047  1192    210     4       1       1190    161177  159988  0.0     1013    GCF_000248195.1_ASM24819v2
    
    #batch 1: until ...802.2_genomic.fna GCA_032600445.2_PDT001925801.2_genomic.fna GCA_032600465.2_PDT001925800.2_genomic.fna GCA_032600485.2_PDT001925799.2_genomic.fna GCA_032600525.2_PDT001925797.2_genomic.fna GCA_032600545.2_PDT001925796.2_genomic.fna GCA_032600565.2_PDT001925795.2_genomic.fna GCA_032600585.2_PDT001925793.2_genomic.fna
    #batch 2: from GCA_032600605.2_PDT001925792.2_genomic.fna GCA_032600625.2_PDT001925794.2_genomic.fna GCA_032600645.2_PDT001925791.2_genomic.fna ...
    
    #to compare with:
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCE020000004.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600385.2_PDT001925803.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCB020000004.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600395.2_PDT001925804.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCD020000003.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600425.2_PDT001925802.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCF020000001.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600445.2_PDT001925801.2 <--
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCG020000001.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600465.2_PDT001925800.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCH020000002.1   97.595  1206    29  0   1   1206    203882  202677  0.0 2067    GCA_032600485.2_PDT001925799.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCJ020000003.1   97.595  1206    29  0   1   1206    113990  112785  0.0 2067    GCA_032600525.2_PDT001925797.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCK020000002.1   97.595  1206    29  0   1   1206    210992  209787  0.0 2067    GCA_032600545.2_PDT001925796.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600565.2_PDT001925795.2
    gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600585.2_PDT001925793.2 <--
    
    Assembly accession number                       GenBank accession number
    Can I say NZ_AIED01000002.1 Genbank_ID, and GCF_000248195.1_ASM24819v2 Assembly_ID?
    
    Yes, you can interpret these identifiers within the given line as follows:
    
    * NZ_AIED01000002.1 can be referred to as a GenBank accession number. This format typically represents a sequence identifier within the GenBank database, part of the NCBI's collection of databases related to nucleotide sequences and their annotations.
    
    * GCF_000248195.1_ASM24819v2 refers to an assembly accession number from the GenBank assembly database. The "GCF" prefix indicates a RefSeq genome assembly accession, which is part of NCBI's Reference Sequence Database. The number following "GCF" is a unique identifier for the specific assembly of the genome, and the information after the underscore provides a version identifier and possibly an internal project code or name.
    
    In a broader context, the line you've provided looks like it's from a BLAST output or a similar comparative genomics tool output. Such lines typically contain information about sequence alignments, including the query and subject sequences (with their respective accession numbers), alignment scores, and statistical significance metrics. The presence of both a sequence accession number (GenBank ID) and an assembly accession number helps link individual sequences to the larger genomic context from which they were derived.
    
    30662 + 7989 = 38651 > 38638
    
    gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCM020000002.1       97.595  1206    29      0       1       1206    210993  209788  0.0     2067    GCA_032600565.2_PDT001925795.2
    gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCN020000002.1       97.595  1206    29      0       1       1206    210993  209788  0.0     2067    GCA_032600585.2_PDT001925793.2
    
    gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCO020000003.1       97.595  1206    29      0       1       1206    54521   53316   0.0     2067    GCA_032600605.2_PDT001925792.2
    
    # In the file containing the following lines, I want to extract the first tokens separated by '_'.
    
    ./GCF_009013175.1_ASM901317v1_genomic.fna
    ./GCF_013344775.1_ASM1334477v1_genomic.fna
    ./GCF_013344845.1_ASM1334484v1_v1_genomic.fna
    ./GCF_013346345.1_ASM1334634v1_genomic.fna
    ./GCF_013346385.1_ASM1334638v1_genomic.fna
    
    38651-38638=13
    
    diff fna_list_tokens_1_3 blastn_res_f13 > diff2
    diff fna_list_tokens_1_3 blastn_res_f13_uniq > diff3
    
    # Open the file named 'fna_list' and read its content
    with open('fna_list', 'r') as file:
        lines = file.readlines()
    
    # Extract the first token after the first '_' in each line
    first_tokens = [line.split('_')[1].strip() for line in lines]
    
    # Example output
    print(first_tokens)
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum