gene_x 0 like s 509 view s
Tags: metagenomics
I used an strategy, at first annotate the contigs using the virus-speicific data and bacteria-speicific data, then using more general databases nt and nr. The results are as attached. For some samples, for examples S5, which we can detected several contigs as gammaherpesvius. For the bacteria, it is more conversed.
# -- txid10239 (Virus) and Taxonomy ID: 2 (Bacteria) --
# -- Virus --
#TODO: from 1,100,000 --> 1,288,629 (up to 2020/07/01); bacteria we can use refseq (up to 2020/07/01)!
#--virus bacteria-refseq-fasta, then virus sequences, virus protein as default database, then nt and nr!
#TODO!: download bact_nt_db and use in '--virus bact_nt_db'!
# pip install ncbi-genome-download
# ncbi-genome-download -F fasta bacteria
# ncbi-genome-download -F fasta virus
# https://www.ncbi.nlm.nih.gov/genome/microbes/
# https://www.biostars.org/p/9503245/
download bacteria refseq with datasets
#https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/
The NCBI Datasets datasets command line tools are datasets and dataformat .
#datasets download genome bacteria --assembly-source refseq --dehydrated --filename bacteria_refseq.zip
~/Tools/datasets download genome bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq_fasta.zip
~/Tools/datasets download genome taxon bacteria #2,231,190
~/Tools/datasets download genome taxon bacteria --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename bacteria_refseq.zip #325,471
#~/Tools/datasets download genome taxon virus #97,281 records
#~/Tools/datasets download genome taxon virus --assembly-source refseq --dehydrated --exclude-protein --exclude-genomic-cds --exclude-rna --exclude-gff3 --filename virus_refseq.zip #14,992
#Unzip the file
unzip bacteria_refseq.zip -d bacteria_refseq
unzip virus_refseq.zip -d virus_refseq
#Rehydrate the file: I'm recommending the dehydrated option because it's actually faster and more reliable, despite the additional steps. By default, the data package includes genomic, transcript, protein and cds sequences, in addition to gff3. If you only need the genomic fasta sequences, you can use this command instead:
~/Tools/datasets rehydrate --directory bacteria_refseq/
~/Tools/datasets rehydrate --directory virus_refseq/ #29984
run vrap.py with --host genome.fa --virus bacteria_refseq [default viral_db up to 2020_07_01] -n nt -r nr
# -- Virus --
vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20430/635290002_CMV_S4_R2_001.fastq.gz -o CMV_S4_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20431/635850623_EBV_S5_R2_001.fastq.gz -o EBV_S5_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
# -- Control --
vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20428/neg_control_S2_R2_001.fastq.gz -o neg_control_S2_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
# -- Bacteria --
vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20427/635031018_E_faecium_S1_R2_001.fastq.gz -o E_faecium_S1_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
vrap/vrap.py -1 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R1_001.fastq.gz -2 ../231114_VH00358_62_AACYCYWM5_cfDNA/p20429/635724976_S_aureus_epidermidis_S3_R2_001.fastq.gz -o S_aureus_epidermidis_S3_unbiased2 --host /home/jhuang/REFs/genome.fa -n /mnt/h1/jhuang/blast/nt -a /mnt/h1/jhuang/blast/nr -t 40 -l 200
#END
点赞本文的读者
还没有人对此文章表态
没有评论
Extract beta-diversity from the Phyloseq results
QIIME + Phyloseq + MicrobiotaProcess (v1)
© 2023 XGenes.com Impressum