Structural Variant Calling for Nanopore Sequencing (edited)

gene_x 0 like s 49 view s

Tags: variation, pipeline

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)
    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba
    

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
    
  2. Data Preparation

    Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
    
  3. Preprocessing

    Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
    
  4. (Optional) Variant Calling for SNP and Indel Detection:

    Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
    
  5. Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

      #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
      ./4/1.Cleandata/4.filtered_reads.fq.gz
      ./3/1.Cleandata/3.filtered_reads.fq.gz
      ./2/1.Cleandata/2.filtered_reads.fq.gz
      ./8/1.Cleandata/8.filtered_reads.fq.gz
      ./5/1.Cleandata/5.filtered_reads.fq.gz
      ./WT/1.Cleandata/WT.filtered_reads.fq.gz
      ./9/1.Cleandata/9.filtered_reads.fq.gz
      ./10/1.Cleandata/10.filtered_reads.fq.gz
      ./7/1.Cleandata/7.filtered_reads.fq.gz
      ./1/1.Cleandata/1.filtered_reads.fq.gz
    
      # -- Alignment and Detect structural variants in each sample using SVIM (failed due to the strange output from SVIM!)
      #mamba install -c bioconda ngmlr
      mamba install -c bioconda svim
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          svim reads --aligner ngmlr --nanopore svim_reads_ngmlr_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta  --cores 10;
      done
      for sample in WT 1 2 3 4 5 7 8 9 10; do
      for sample in 1; do
          #INS,INV,DUP:TANDEM,DUP:INT,BND
          svim reads svim_reads_minimap2_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta --aligner minimap2 --nanopore --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
      done
      #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
      # -- Results1: Detect structural variants using Minamap2+Sniffles2:
    
      Minimap2: A commonly used aligner for nanopore sequencing data.
          Align Long Reads to the WT Reference using Minimap2
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
      done
    
      #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
      #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
      #  -l 50: Reports SVs that are at least 50 base pairs long.
      #  -t 4: Uses 4 threads for faster processing. --> 60
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          sniffles -m ${sample}.sorted.bam -v ${sample}.vcf -s 10 -l 50 -t 60
          #QUAL < 20 ||
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}.vcf > ${sample}_filtered.vcf
      done
    
      # -- Results2: Detect structural variants using NGMLR+Sniffles2
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
          #sniffles -m svim_reads_minimap2_${sample}/${sample}.filtered_reads.fq.minimap2.coordsorted.bam -v #sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
          #Using
          sniffles -m svim_reads_ngmlr_${sample}/${sample}.filtered_reads.fq.ngmlr.coordsorted.bam -v sniffles_ngmlr_${sample}.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_ngmlr_${sample}.vcf > sniffles_ngmlr_${sample}_filtered.vcf
      done
    
      # -- Compare the results1 and results2, and check them each position in IGV!
    
      #minimap2+sniffles2
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 WT_filtered.vcf | grep -v "##"
      POS
      1855752
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 1_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 2_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 3_filtered.vcf | grep -v "##"
      POS
      529416
      529416
      529418
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 4_filtered.vcf | grep -v "##"
      POS
      55682
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 5_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 7_filtered.vcf | grep -v "##"
      POS
      518217
      1855752
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 8_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 9_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422820
      2424590
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 10_filtered.vcf | grep -v "##"
      POS
      529416
      1855752
      2422818
      2424590
    
      #ngmlr+sniffles2
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_WT_filtered.vcf | grep -v "##"
      POS
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_1_filtered.vcf | grep -v "##"
      POS
      529419
      2422819
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_2_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_3_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_4_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_5_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_7_filtered.vcf | grep -v "##"
      POS
      518219
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_8_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_9_filtered.vcf | grep -v "##"
      POS
      529419
      2422820
      (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_10_filtered.vcf | grep -v "##"
      POS
      529418
      2422820
    
      #~/Tools/csv2xls-0.4/csv_to_xls.py sniffles_ngmlr_WT_filtered.vcf sniffles_ngmlr_1_filtered.vcf sniffles_ngmlr_2_filtered.vcf sniffles_ngmlr_3_filtered.vcf sniffles_ngmlr_4_filtered.vcf sniffles_ngmlr_5_filtered.vcf sniffles_ngmlr_7_filtered.vcf sniffles_ngmlr_8_filtered.vcf sniffles_ngmlr_9_filtered.vcf sniffles_ngmlr_10_filtered.vcf -d$'\t' -o putative_transposons2.xls
    
      # -- Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)
    
      python vcf_to_fasta.py variants.vcf variants.fasta
      #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
      #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
      /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
      python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
      Exact Matches
      =============
      2 bacteria (blast name), Bacteria <bacteria> (scientific name), eubacteria (genbank common name), Monera <bacteria> (in-part), Procaryotae <bacteria> (in-part), Prokaryota <bacteria> (in-part), Prokaryotae <bacteria> (in-part), prokaryote <bacteria> (in-part), prokaryotes <bacteria> (in-part)
      Non-exact Matches
      =================
      1783272 Terrabacteria group (scientific name)
      91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority)
      1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name)
    
      Summary of Classes within Firmicutes:
        * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061)
            * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis)
            * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis)
            * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes)
            * Listeria (e.g., Listeria monocytogenes)
        * Clostridia (includes many anaerobic species like Clostridium and Clostridioides)
        * Erysipelotrichia (intestinal bacteria, some pathogenic)
        * Tissierellia (less-studied, veterinary relevance)
        * Mollicutes (cell wall-less, includes Mycoplasma species)
        * Negativicutes (includes some Gram-negative, anaerobic species)
    
      RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta
      python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta
    
      #bcftools filter -i 'QUAL>30 && INFO/SVLEN>100' variants.vcf -o filtered.vcf
      #
      #bcftools view -i 'SVTYPE="INS"' variants.vcf | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' > insertions.txt
      #mamba install -c bioconda vcf2fasta
      #vcf2fasta variants.vcf -o insertions.fasta
      #grep "SEQS" variants.vcf | awk '{ print $1, $2, $4, $5, $8 }' > insertions.txt
      #python3 filtering_low_complexity.py
      #
      #vcftools --vcf input.vcf --recode --out filtered_output --minSVLEN 100
      #bcftools filter -e 'INFO/SEQS ~ "^(G+|C+|T+|A+){4,}"' variants.vcf -o filtered.vcf
    
      # -- calculate the percentage of reads
    
      To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record.
      Step 1: Extract Relevant Information
    
      In the provided VCF entry:
    
          RE (Reads Evidence): 733 – the total number of reads supporting the insertion.
          GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion.
          AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion.
          DR (Depth Reference): 0 – the number of reads supporting the reference allele.
          DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion).
    
      Step 2: Calculate Percentage of Reads Supporting the Insertion
    
      Using the formula:
      Percentage of reads with insertion=(DVDR+DV)×100
      Percentage of reads with insertion=(DR+DVDV​)×100
    
      Substitute the values:
      Percentage=(7330+733)×100=100%
      Percentage=(0+733733​)×100=100%
      Conclusion
    
      Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields.
    
  6. (failed) using own scripts direct analyze the bam-file via cigarString (failed due to too many short insertions!)

    transposons.fasta is a file containing the transposon sequences in FASTA format.
    python your_script.py input.bam reference.fasta transposons.fasta
    #Transposon_Sequence    Insertion_Frequency
    #Tn5                    10
    #Tn10                   5
    #Unknown                3
    
    python putative_transposons_with_counts.py mapping_WT.sorted.bam CP020463.fasta
    
    rule trim_short_reads:
        input:
            "/data/short-reads.fq.gz"
        output:
            "/data/trimmed-short-reads.fasta"
        shell:
            "python3 trim_by_tag_length.py /data/short-reads.fq.gz 10 > /data/trimmed-short-reads.fasta"
    
    rule trim_long_reads:
        input:
            "/data/long-reads.fq.gz"
        output:
            "/data/trimmed-long-reads.fasta"
        shell:
            "python3 trim_by_tag_length.py /data/long-reads.fq.gz 92 > /data/trimmed-long-reads.fasta"
    
    rule install_bwa:
        output:
            "bwa-mem2-2.0pre2_x64-linux/bwa-mem2"
        shell:
            "curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 | tar jxf -"
    
    rule map_short_reads:
        input:
            "bwa-mem2-2.0pre2_x64-linux/bwa-mem2",
            "/data/reference.fasta",
            "/data/trimmed-short-reads.fasta"
        output:
            "/data/mapping.sam"
        shell:
            """
            bwa-mem2-2.0pre2_x64-linux/bwa-mem2 index /data/reference.fasta
            bwa-mem2-2.0pre2_x64-linux/bwa-mem2 mem /data/reference.fasta /data/trimmed-short-reads.fasta > /data/mapping.sam
            """
    
    rule map_long_reads:
        input:
            "/data/reference.fasta",
            "/data/trimmed-long-reads.fasta"
        output:
            "/data/mapping.bam"
        conda:
            "minimap2.yml"
        shell:
            """
            minimap2 -x map-ont -d reference /data/reference.fasta > /dev/null 2>&1
            minimap2 -c -a -o /data/mapping.nonunique.sam -N 1 -x map-ont reference /data/trimmed-long-reads.fasta
            samtools view -bq 1 /data/mapping.nonunique.sam > /data/mapping.bam
            """
    
    rule convert_sam_to_bam:
        input:
            "/data/mapping.sam"
        output:
            "/data/mapping.bam",
        conda:
            "samtools.yml"
        shell:
            "samtools view -S -b /data/mapping.sam > /data/mapping.bam"
    
    rule get_unmapped_reads:
        input:
            "/data/mapping.bam"
        output:
            "/data/mapping.sorted.bam"
        conda:
            "samtools.yml"
        shell:
            """
    #        samtools view -f 4 /data/mapping.bam > /data/unmapped.sam
    #        samtools view -S -b /data/unmapped.sam > /data/unmapped.bam
    #        samtools bam2fq /data/unmapped.bam | seqtk seq -A - > /data/unmapped.fa
            samtools sort  /data/mapping.bam -o /data/mapping.sorted.bam
            samtools index /data/mapping.sorted.bam
            """
    
    rule create_insertion_plot:
        input:
            "/data/mapping.sorted.bam"
        output:
            "/data/summary-stats.tsv"
        shell:
            """
            python3 ~/Scripts/sam_to_insert_plot.py /data/mapping.sorted.bam /data/reference.fasta > /data/summary-stats.tsv
            """
    
  7. Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

      mamba install -c bioconda medaka
      medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
    
  8. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
    
  9. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
    
  10. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
    
  11. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
    
  12. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
        bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
    
  13. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species' genome. It is not assembled from the WT.fastq sequencing reads in this context. Here's why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size <size>
    
        Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy.
        Use the assembled and polished genome as your WT.fasta reference for further steps.
    
    Key Takeaways:
    
        If you have access to a reliable, high-quality reference genome, use it as the WT.fasta.
        Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
    
  14. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
    # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
    #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    
    # -- (failed!) try TEPID for annotation
    mamba install tepid=0.10 -c bioconda
    #(tepid_env)
    for sample in WT 1 2 3 4 5 7 8 9 10; do
        tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q  ../batch1_depth25/trycycler_${sample}/reads.fastq;
        tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
    done
    
    tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    
    tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
                [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    
    # -- (failed!) try EDTA for annotation
    https://github.com/oushujun/EDTA
    (transposon_long) mamba install -c conda-forge -c bioconda edta
    mamba install bioconda::rmblast  # cd RepeatMasker; ./configure
    EDTA.pl --genome CP020463.fasta --species others --threads 40
    
    For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
    For de novo repeat identification: RepeatScout is highly effective.
    For LTR retrotransposons: Use LTR_retriever.
    For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
    
  15. Validation: Cross-validate with short-read sequencing data if available.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum