Prepare the databases for vrap

gene_x 0 like s 527 view s

Tags: metagenomics

  1. I used an strategy, at first annotate the contigs using the virus-speicific data and bacteria-speicific data, then using more general databases nt and nr. The results are as attached. For some samples, for examples S5, which we can detected several contigs as gammaherpesvius. For the bacteria, it is more conversed.

    # -- txid10239 (Virus) and Taxonomy ID: 2 (Bacteria) --
    # -- Virus --
    #TODO: from 1,100,000 --> 1,288,629 (up to 2020/07/01); bacteria we can use refseq (up to 2020/07/01)!
    #--virus bacteria-refseq-fasta, then virus sequences, virus protein as default database, then nt and nr!
    #TODO!: download bact_nt_db and use in '--virus bact_nt_db'!
    #  pip install ncbi-genome-download
    #  ncbi-genome-download -F fasta bacteria
    #  ncbi-genome-download -F fasta virus
    #  https://www.ncbi.nlm.nih.gov/genome/microbes/
    #  https://www.biostars.org/p/9503245/
    
  2. download bacteria refseq with datasets

    #https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/
    The NCBI Datasets datasets command line tools are datasets and dataformat .
    
    #datasets download genome bacteria --assembly-source refseq --dehydrated --filename bacteria_refseq.zip
    ~/Tools/datasets download genome bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq_fasta.zip
    ~/Tools/datasets download genome taxon bacteria #2,231,190
    ~/Tools/datasets download genome taxon bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq.zip  #325,471
    #~/Tools/datasets download genome taxon virus #97,281 records
    #~/Tools/datasets download genome taxon virus --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename virus_refseq.zip #14,992
    
    #Unzip the file
    unzip bacteria_refseq.zip -d bacteria_refseq
    unzip virus_refseq.zip -d virus_refseq
    
    #Rehydrate the file: I'm recommending the dehydrated option because it's actually faster and more reliable, despite the additional steps. By default, the data package includes genomic, transcript, protein and cds sequences, in addition to gff3. If you only need the genomic fasta sequences, you can use this command instead:
    ~/Tools/datasets rehydrate --directory bacteria_refseq/
    ~/Tools/datasets rehydrate --directory virus_refseq/  #29984
    
  3. run vrap.py with --host genome.fa --virus bacteria_refseq [default viral_db up to 2020_07_01] -n nt -r nr

    # -- Virus --
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz -o CMV_S4_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz -o EBV_S5_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    
    # -- Control --
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R2_001.fastq.gz -o neg_control_S2_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    
    # -- Bacteria --
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R2_001.fastq.gz -o E_faecium_S1_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R2_001.fastq.gz -o S_aureus_epidermidis_S3_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
    #END
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum