Transposon analyses for the nanopore sequencing

gene_x 0 like s 125 view s

Tags: pipeline

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)
    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba
    

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Configuring RepeatMasker after installation involves setting up a transposable element (TE) database, such as Dfam or RepBase. Here's a detailed guide:

    1. Locate the RepeatMasker Directory
    
    After installing RepeatMasker (via Conda or manually), the main program will reside in its installation directory.
    
        conda info --envs
        #Locate the path for your environment (transposon_analysis), and then:
        cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
    
    2. Install and choose a TE Database
    
        Dfam: A freely available TE database. Preferred for most users due to open access.
    
        cd /mnt/nvme0n1p1/ref
        #Download the Preprocessed Library:
        wget https://www.dfam.org/releases/Dfam_3.8/families/Dfam-RepeatMasker.lib.gz
        #Move the File to RepeatMasker Libraries:
        mv Dfam-RepeatMasker.lib.gz /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/
        #Configure RepeatMasker to Use Dfam: Re-run the RepeatMasker configuration script and specify this library.
    
        # Move Dfam data to the RepeatMasker directory
        mv Dfam.h5 /path/to/RepeatMasker/Libraries/
        mv Dfam.embl /path/to/RepeatMasker/Libraries/
    
        # The Dfam library has been already installed.
    
        RepBase: A more comprehensive TE database but requires a license for academic or commercial use.
            Download the library files (e.g., .tar.gz) provided by RepBase.
            Extract them to the directory of your choice.
            https://www.girinst.org/server/RepBase/index.php
            tar -zxvf repbase_library.tar.gz -C /path/to/repbase/
    
    4. Configure RepeatMasker
    
        #Run the configuration script in the RepeatMasker installation directory:
        cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
        ./configure
    
        Enter Selection: 5
        Building FASTA version of RepeatMasker.lib .........
        Building RMBlast frozen libraries..
        The program is installed with a the following repeat libraries:
        File: /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5
        Database: Dfam
        Version: 3.3
        Date: 2020-11-09
    
        Dfam - A database of transposable element (TE) sequence alignments and HMMs.
    
        Total consensus sequences: 6953
        Total HMMs: 6915
    
        conda activate transposon_long
        #When using HMMER with RepeatMasker, it automatically looks for the Dfam.h5 file in the Libraries/ directory, not a custom library name specified with -lib.
        #If you're using HMMER and the Dfam.h5 file, the -lib option should not be used. Simply run RepeatMasker like this:
        RepeatMasker -species "YourSpecies" -pa 4 CP020463.fasta
    
  2. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
    
  3. Align Long Reads to the WT Reference

    Use Minimap2 for aligning your reads:
    
      for sample in 1 2 3 4 5 7 8 9 10; do
      for sample in WT; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
      done
    
  4. Call Structural Variants with Sniffles2: A fast structural variant caller for long-read sequencing, Sniffles2 accurately detect SVs on germline, somatic and population-level for PacBio and Oxford Nanopore read data.

    Detect structural variants in each sample using Sniffles2:
    
        sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
    
        -s 20: Requires at least 20 reads to support an SV for reporting.
        -l 50: Reports SVs that are at least 50 base pairs long.
        -t 4: Uses 4 threads for faster processing.
    
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          samtools index ${sample}.sorted.bam
          sniffles -m ${sample}.sorted.bam -v ${sample}.vcf -s 10 -l 50 -t 60
      done
      for sample in WT 1 2 3 4 5 7 8 9 10; do
          bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" ${sample}.vcf > ${sample}_filtered.vcf
      done
      #!!!!WT has only one record as expected!!!!
    
  5. Annotate Transposable Elements

    Build a Custom Transposon Library (Optional but Recommended):
        Use RepeatModeler2 to identify and classify transposable elements in your WT genome.
    
        makeblastdb -in CP020463.fasta -dbtype nucl -out CP020463_db -parse_seqids
        blastdbcmd -db CP020463_db -info
        RepeatModeler -database CP020463_db -pa 8
    
        #esearch -db nucleotide -query "CP020463" | efetch -format gb | grep -A 1 "translation" > CP020463_proteins.fasta
        #awk '{if(NR%2==1){print ">acc"NR/2} else {print $0}}' CP020463.protein.faa > CP020463_with_accessions.faa
    
        #TODO: DEBUG_NEXT_MONDAY!
        makeblastdb -in CP020463.protein.faa -dbtype prot -out CP020463_protein_db
        blastdbcmd -db CP020463_protein_db -info
        RepeatModeler -database CP020463_protein_db -pa 8
    
    This creates a transposon library in FASTA format.
    
    Annotate Insertions with RepeatMasker:
    
        Use the transposon library (or a database like Dfam) to annotate the detected insertions:
    
    RepeatMasker -lib transposons.fasta sample1.vcf -dir output/
    
    This step determines if detected insertions match known or de novo transposons.
    
  6. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
    
  7. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
    
  8. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
    
  9. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
    
  10. Clean and Filter the WT.vcf

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
    
            This removes low-quality calls and focuses on insertions (INS) relevant for transposon detection.
        Optionally, annotate the WT.vcf with known transposons using tools like RepeatMasker.
    
    The WT.vcf acts as a baseline for comparison:
    Variants detected in your samples 1–9 are compared to those in the WT to identify novel insertions (potential transposons).
    Shared insertions between the WT and samples are excluded as native to the WT genome.
    
  11. In this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species' genome. It is not assembled from the WT.fastq sequencing reads in this context. Here's why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size <size>
    
        Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy.
        Use the assembled and polished genome as your WT.fasta reference for further steps.
    
    Key Takeaways:
    
        If you have access to a reliable, high-quality reference genome, use it as the WT.fasta.
        Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum