Data_Tam_DNAseq_2025 NACHREICHEN_1

gene_x 0 like s 169 view s

Tags: pipeline

For Clinical and HF

a) Please proceed with SNP and indel calling for both clinical samples.

  1. #-- DNAseq_2025_AYE: clinical vs CP149838 (Acinetobacter baumannii strain 2024CK-00130) --
  2. mkdir -p bacto_clinical_vs_CP149838/raw_data; cd bacto_clinical_vs_CP149838;
  3. ln -s ~/Tools/bacto/db/ .;
  4. ln -s ~/Tools/bacto/envs/ .;
  5. ln -s ~/Tools/bacto/local/ .;
  6. cp ~/Tools/bacto/Snakefile .;
  7. cp ~/Tools/bacto/bacto-0.1.json .;
  8. cp ~/Tools/bacto/cluster.json .;
  9. cd raw_data
  10. ln -s ../../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_1.fq.gz clinical_R1.fastq.gz
  11. ln -s ../../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_2.fq.gz clinical_R2.fastq.gz
  12. (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  13. # Annotate the snippy results using SNP_matrix.sh from spandx
  14. (spandx) cd bacto_clinical_vs_CP149838/snippy/clinical
  15. (spandx) cp ~/Tools/spandx/bin/SNP_matrix.sh ./
  16. (spandx) cp clinical.filt.vcf out.filtered.vcf
  17. (spandx) cp clinical.filt.vcf out.vcf
  18. #Adapt to "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
  19. (spandx) bash SNP_matrix.sh CP149838 .
  20. #(ANNOTATED_RESULTS_SNIPPY) ~/Tools/csv2xls-0.4/csv_to_xls.py All_SNPs_indels_annotated.txt -d$'\t' -o ../../../clinical_vs_CP149838_SNPs_indels_annotated.xls
  21. #-- DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2: HF vs CP095177.1 (Enterobacter hormaechei strain K528) --
  22. mkdir -p bacto_HF_vs_CP095177/raw_data; cd bacto_HF_vs_CP095177;
  23. ln -s ~/Tools/bacto/db/ .;
  24. ln -s ~/Tools/bacto/envs/ .;
  25. ln -s ~/Tools/bacto/local/ .;
  26. cp ~/Tools/bacto/Snakefile .;
  27. cp ~/Tools/bacto/bacto-0.1.json .;
  28. cp ~/Tools/bacto/cluster.json .;
  29. cd raw_data
  30. ln -s ../../X101SC24115801-Z01-J001/01.RawData/HF/HF_1.fq.gz HF_R1.fastq.gz
  31. ln -s ../../X101SC24115801-Z01-J001/01.RawData/HF/HF_2.fq.gz HF_R2.fastq.gz
  32. mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
  33. (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  34. # Annotate the Snippy results using SNP_matrix.sh from SPANDx, disregarding the runtime error.
  35. mamba activate /home/jhuang/miniconda3/envs/spandx
  36. (spandx) cd bacto_HF_vs_CP095177/snippy/HF
  37. (spandx) cp ~/Tools/spandx/bin/SNP_matrix.sh ./
  38. (spandx) cp HF.filt.vcf out.filtered.vcf
  39. (spandx) cp HF.filt.vcf out.vcf
  40. #Adapt to "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
  41. (spandx) bash SNP_matrix.sh CP095177 .
  42. #(ANNOTATED_RESULTS_SNIPPY) ~/Tools/csv2xls-0.4/csv_to_xls.py All_SNPs_indels_annotated.txt -d$'\t' -o ../../../HF_vs_CP095177_SNPs_indels_annotated.xls

b) Additionally, I would like a phylogenetic tree (evolutionary relationship between our samples and other reference genomes) and

  1. #using Dendogram tool

c) A sequence alignment figure (similar to this example: https://macvector.com/blog/2021/10/compare-a-pair-of-genomes/) comparing the reference genome with our samples.

  1. # TODO: Using BRIG: see http://xgenes.com/article/article-content/325/analysis-of-snps-indels-transposons-and-is-elements-in-5-a-baumannii-strains/
  2. * DNAseq_2025_AYE: others vs CU459141
  3. cd ~/Tools/BRIG-0.95-dist
  4. #IMPORTANT_CRITICAL_DEBUG: we should use jdk1.6, otherwise results in ERROR!
  5. ~/Tools/Java1.6/jdk1.6.0_45/bin/java -Xmx25000M -jar BRIG.jar
  6. #1. Select Reference Genbank file in first textfield.
  7. #2. Prepare assembled fasta-files and sepecific gene-sequences as "Query sequence" in the folder "~/DATA/Data_Tam_DNAseq_2025_AYE/shovill/ and ~/DATA/Data_Tam_DNAseq_2025_AYE/" by "cp AYE-Q/contigs.fa AYE-Q.fasta; ...; samtools faidx CP059040.fasta CP059040:1844319-1845509 > adeA.fasta; ..." in second textfield.
  8. #3. Saved bundle_sessions --> saved all files in a seperate directory --> open session brigWorkspace_AYE (Note that sometimes the fasta-files is too large (severeal GBs) and all of them have been copied to "~/brigWorkspace_adeABadeIJ_adeIJK_CM1_CM2/gen")
  9. * DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2: others to CP059040
  10. samtools faidx CP059040.fasta CP059040:1844319-1845509 > adeA.fasta
  11. samtools faidx CP059040.fasta CP059040:1845506-1848616 > adeB.fasta
  12. samtools faidx CP059040.fasta CP059040:740422-741672 > adeI_.fasta
  13. revcomp adeI_.fasta > adeI.fasta
  14. samtools faidx CP059040.fasta CP059040:737233-740409 > adeJ_.fasta
  15. revcomp adeJ_.fasta > adeJ.fasta
  16. samtools faidx CP059040.fasta CP059040:735779-737233 > adeK_.fasta
  17. revcomp adeK_.fasta > adeK.fasta
  18. gene 1844319..1845509
  19. /gene="adeA"
  20. /locus_tag="H0N29_08675"
  21. gene 1845506..1848616
  22. /gene="adeB"
  23. /locus_tag="H0N29_08680"
  24. gene complement(740422..741672)
  25. /gene="adeI"
  26. /locus_tag="H0N29_03550"
  27. gene complement(737233..740409)
  28. /gene="adeJ"
  29. /locus_tag="H0N29_03545"
  30. gene complement(735779..737233)
  31. /gene="adeK"
  32. /locus_tag="H0N29_03540"
  33. #jhuang@WS-2290C:~/brigWorkspace_adeABadeIJ_adeIJK_CM1_CM2/scratch$ vim adeAB.fastaVsCP059040.gb.fna.tab
  34. #jhuang@WS-2290C:~/brigWorkspace_adeABadeIJ_adeIJK_CM1_CM2/scratch$ vim adeIJK.fastaVsCP059040.gb.fna.tab
  35. #TODO: Optimize the cutoff so that vey short matches not used for the BRIG drawing!
  36. #/home/jhuang/miniconda3/bin/blastn -outfmt 6 -query /home/jhuang/DATA/Data_Tam_DNAseq_2025_AYE/shovill/AYE-S.fasta -db /home/jhuang/brigWorkspace/scratch/CU459141.gb.fna -out /home/jhuang/brigWorkspace/scratch/AYE-S.fastaVsCU459141.gb.fna.tab -task blastn
  37. "-evalue 1e-1 -max_target_seqs 1" #-strand both -num_threads 15 -outfmt 6
  38. cd ~/brigWorkspace_adeABadeIJ_adeIJK_CM1_CM2/scratch/
  39. cp /home/jhuang/Tools/bacto/db/CU459141.gb ~/brigWorkspace_AYE #DEBUG due to the empty genbank file
  40. #mv AYE-Q.fastaVsCU459141.gb.fna.tab AYE-Q.fastaVsCU459141.gb.fna.tab_
  41. #mv AYE-S.fastaVsCU459141.gb.fna.tab AYE-S.fastaVsCU459141.gb.fna.tab_
  42. #mv AYE-WT_on_Tig4.fastaVsCU459141.gb.fna.tab AYE-WT_on_Tig4.fastaVsCU459141.gb.fna.tab_
  43. #mv AYE-craA_on_Tig4.fastaVsCU459141.gb.fna.tab AYE-craA_on_Tig4.fastaVsCU459141.gb.fna.tab_
  44. #mv AYE-craA-1_on_Cm200.fastaVsCU459141.gb.fna.tab AYE-craA-1_on_Cm200.fastaVsCU459141.gb.fna.tab_
  45. #mv AYE-craA-2_on_Cm200.fastaVsCU459141.gb.fna.tab AYE-craA-2_on_Cm200.fastaVsCU459141.gb.fna.tab_
  46. python3 ~/Scripts/filter_blast.py AYE-Q.fastaVsCU459141.gb.fna.tab_ AYE-Q.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  47. python3 ~/Scripts/filter_blast.py AYE-S.fastaVsCU459141.gb.fna.tab_ AYE-S.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  48. python3 ~/Scripts/filter_blast.py AYE-WT_on_Tig4.fastaVsCU459141.gb.fna.tab_ AYE-WT_on_Tig4.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  49. python3 ~/Scripts/filter_blast.py AYE-craA_on_Tig4.fastaVsCU459141.gb.fna.tab_ AYE-craA_on_Tig4.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  50. python3 ~/Scripts/filter_blast.py AYE-craA-1_on_Cm200.fastaVsCU459141.gb.fna.tab_ AYE-craA-1_on_Cm200.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  51. python3 ~/Scripts/filter_blast.py AYE-craA-2_on_Cm200.fastaVsCU459141.gb.fna.tab_ AYE-craA-2_on_Cm200.fastaVsCU459141.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  52. cp /home/jhuang/Tools/bacto/db/CP059040.gb ~/brigWorkspace_adeABadeIJ_adeIJK_CM1_CM2 #DEBUG due to the empty genbank file
  53. #mv deltaAdeIJK.fastaVsCP059040.gb.fna.tab deltaAdeIJK.fastaVsCP059040.gb.fna.tab_
  54. #mv deltaAdeABIJ.fastaVsCP059040.gb.fna.tab deltaAdeABIJ.fastaVsCP059040.gb.fna.tab_
  55. #mv CM1.fastaVsCP059040.gb.fna.tab CM1.fastaVsCP059040.gb.fna.tab_
  56. #mv CM2.fastaVsCP059040.gb.fna.tab CM2.fastaVsCP059040.gb.fna.tab_
  57. python3 ~/Scripts/filter_blast.py deltaAdeIJK.fastaVsCP059040.gb.fna.tab_ deltaAdeIJK.fastaVsCP059040.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  58. python3 ~/Scripts/filter_blast.py deltaAdeABIJ.fastaVsCP059040.gb.fna.tab_ deltaAdeABIJ.fastaVsCP059040.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  59. python3 ~/Scripts/filter_blast.py CM1.fastaVsCP059040.gb.fna.tab_ CM1.fastaVsCP059040.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  60. python3 ~/Scripts/filter_blast.py CM2.fastaVsCP059040.gb.fna.tab_ CM2.fastaVsCP059040.gb.fna.tab --evalue 1e-10 --min_length 100 --min_identity 90
  61. * DNAseq_2025_AYE: clinical vs gi|2707120326|gb|CP149838.1| (Acinetobacter baumannii strain 2024CK-00130)
  62. mamba activate /home/jhuang/miniconda3/envs/bakta
  63. bakta --db /mnt/nvme1n1p1/bakta_db ~/DATA/Data_Tam_DNAseq_2025_AYE/vrap_clinical/vrap_contig.fasta --prefix vrap_clinical
  64. * DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2: HF vs gi|2239477175|gb|CP095177.1| (Enterobacter hormaechei strain K528)
  65. bakta --db /mnt/nvme1n1p1/bakta_db ~/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/vrap_HF/vrap_contig.fasta --prefix vrap_HF
  66. # Using Artemis (ACT) to draw the comparisons (https://sanger-pathogens.github.io/Artemis/; https://github.com/sanger-pathogens/Artemis)
  67. tar zxf artemis-unix-release-{version}.tar.gz
  68. ~/Scripts/embl_to_gbk.py CP149838.embl CP149838.gbk
  69. ~/Scripts/embl_to_gbk.py vrap_clinical.embl clinical.gbk
  70. ~/Scripts/embl_to_gbk.py CP095177.embl CP095177.gbk
  71. ~/Scripts/embl_to_gbk.py vrap_HF.embl HF.gbk
  72. TODO_NEXT_WEEK: comparison_file_1 is defined as follows: Common formats: .crunch (BLAST), .delta (MUMmer), or .align (Exonerate).
  73. blastn -query ref_genome.fasta -subject sample_genome.fasta -outfmt 6 > blast.crunch
  74. # Step 1: Extract FASTA from GenBank (if needed for alignment)
  75. ~/Scripts/genbank2fasta.py CP149838.gbk
  76. ~/Scripts/genbank2fasta.py clinical.gbk
  77. # Step 2: Run BLAST/MUMmer to generate comparison file
  78. blastn -query CP149838.gbk_converted.fna -subject clinical.gbk_converted.fna -outfmt 6 > blast.crunch
  79. Input:
  80. Sequence file 1: reference.gb (annotated reference genome).
  81. Sequence file 2: sample.gb (annotated sample genome).
  82. Comparison file 1: nucmer.delta (generated with nucmer reference.fasta sample.fasta).
  83. Output:
  84. ACT will display both genomes with annotations, aligned based on the .delta file.
  85. The image appears to be a genome visualization, possibly from a tool like Artemis, ACT (Artemis Comparison Tool), or another genome browser. Heres whats likely happening in the plot:
  86. Genome Overview
  87. Three Rows for Forward Strand and Three Rows for Reverse Strand
  88. Each genome sequence has genes and annotations on both forward and reverse strands.
  89. The three rows in each direction represent different levels of gene annotations to avoid overlap and make visualization easier.
  90. Why Three Rows?
  91. Genes in a genome can be closely packed or even overlapping.
  92. To prevent clutter, genome visualization tools often distribute overlapping genes across multiple rows.
  93. This is purely for visual clarity and does not indicate different types of genes.
  94. Key Features in the Image
  95. Black vertical bars: Represent DNA sequences.
  96. Blue/Green Annotations: Likely represent genes or coding sequences (CDS).
  97. Gray Annotations: Might be regulatory elements or pseudogenes.
  98. Thicker Blue Section: Could represent regions of high similarity between sequences
  99. #Some genome browsers use frame lines to indicate the three possible reading frames on each strand.
  100. #Top three lines: Represent the three possible reading frames of the forward strand (5' → 3').
  101. #Bottom three lines: Represent the three possible reading frames of the reverse strand (3' → 5').
  102. #Light grey areas = ORFs (regions that can actually be translated into proteins).
  103. #If an area is not light grey, it likely means that no ORF exists in that reading frame, meaning there are stop codons breaking the sequence.
  104. #TODO_NEXT_MONDAY: how to zoom in the ACT browser?

d) Would it be possible to perform genome annotation for these samples?

  1. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-Q ../shovill/AYE-Q.fasta
  2. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-S ../shovill/AYE-S.fasta
  3. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-WT_on_Tig4 ../shovill/AYE-WT_on_Tig4.fasta
  4. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-craA_on_Tig4 ../shovill/AYE-craA_on_Tig4.fasta
  5. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-craA-1_on_Cm200 ../shovill/AYE-craA-1_on_Cm200.fasta
  6. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain AYE-craA-2_on_Cm200 ../shovill/AYE-craA-2_on_Cm200.fasta
  7. cd ~/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/bakta
  8. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain deltaAdeIJK ../shovill/deltaAdeIJK.fasta
  9. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain deltaAdeABIJ ../shovill/deltaAdeABIJ.fasta
  10. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain CM1 ../shovill/CM1.fasta
  11. bakta --db /mnt/nvme1n1p1/bakta_db --genus Acinetobacter --species baumannii --strain CM2 ../shovill/CM2.fasta
  12. #END

e) Could you please perform a search for HF sample against CP095179 and CP095178 as Enterobacter hormaechei strain K528 harbours two plasmids.

  1. In addition to my comments from my previous email, could you please perform a search for HF sample against CP095179 and CP095178 as Enterobacter hormaechei strain K528 harbours two plasmids.
  2. Solution: Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)
  3. mamba activate /home/jhuang/miniconda3/envs/vrap
  4. (vrap) vrap/vrap.py -1 HF_trimmed_P_1.fastq.gz -2 HF_trimmed_P_2.fastq.gz -o vrap_HF_on_CP095178_and_CP095179 --host /home/jhuang/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/CP095178_CP095179.fasta -t 100 -l 200 -g
  5. cd bowtie
  6. mv mapped mapped.sam
  7. samtools view -S -b mapped.sam > mapped.bam
  8. samtools sort mapped.bam -o mapped_sorted.bam
  9. samtools index mapped_sorted.bam
  10. samtools view -H mapped_sorted.bam
  11. samtools flagstat mapped_sorted.bam
  12. #14634302 + 0 in total (QC-passed reads + QC-failed reads)
  13. #0 + 0 secondary
  14. #0 + 0 supplementary
  15. #0 + 0 duplicates
  16. #177416 + 0 mapped (1.21% : N/A)
  17. #11014010 + 0 paired in sequencing
  18. #5507005 + 0 read1
  19. #5507005 + 0 read2
  20. #118328 + 0 properly paired (1.07% : N/A)
  21. #123066 + 0 with itself and mate mapped
  22. #10632 + 0 singletons (0.10% : N/A)
  23. #0 + 0 with mate mapped to a different chr
  24. #0 + 0 with mate mapped to a different chr (mapQ>=5)
  25. samtools depth -m 0 -a mapped_sorted.bam > coverage.txt
  26. grep "CP095178" coverage.txt > CP095178_coverage.txt
  27. grep "CP095179" coverage.txt > CP095179_coverage.txt
  28. import pandas as pd
  29. import matplotlib.pyplot as plt
  30. # Load coverage data
  31. df = pd.read_csv("CP095178_coverage.txt", sep="\t", header=None, names=["chr", "pos", "coverage"])
  32. # Plot
  33. plt.figure(figsize=(10,4))
  34. plt.plot(df["pos"], df["coverage"], color="blue", linewidth=0.5)
  35. plt.xlabel("Genomic Position")
  36. plt.ylabel("Coverage Depth")
  37. plt.title("BAM Coverage Plot")
  38. plt.show()
  39. RESULTS: say CP095178 does not exist, but CP095179 exists (see attached coverage plots)!

Kindly perform a comparative analysis on this data (DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2 against CP059040):

  1. adeABadeIJ sample (knockout of adeA - H0N29_08675, adeB - H0N29_08680, adeI - H0N29_03550, and adeJ - H0N29_03545, please confirm whether these genes are successfully knocked out.)
  2. adeIJK sample (knockout of adeI - H0N29_03550, adeJ - H0N29_03545, and adeK - H0N29_03540,, please confirm whether these genes are successfully knocked out.)
  3. for sample in adeABadeIJ adeIJK CM1 CM2; do
  4. makeblastdb -in ~/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/shovill/${sample}/contigs.fa -dbtype nucl
  5. done
  6. for id in adeA adeB adeI adeJ adeK; do
  7. echo "mkdir ${id}"
  8. echo "for sample in adeABadeIJ adeIJK CM1 CM2; do"
  9. echo "blastn -db ~/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
  10. echo "done"
  11. done
  12. cp mlst/all_mlst_may_be_wrong.txt typing.csv
  13. python ~/Scripts/process_directory.py adeA typing.csv typing_until_adeA.csv
  14. python ~/Scripts/process_directory.py adeB typing_until_adeA.csv typing_until_adeB.csv
  15. python ~/Scripts/process_directory.py adeI typing_until_adeB.csv typing_until_adeI.csv
  16. python ~/Scripts/process_directory.py adeJ typing_until_adeI.csv typing_until_adeJ.csv
  17. python ~/Scripts/process_directory.py adeK typing_until_adeJ.csv typing_until_adeK.csv
  18. #Note that the length of adeK is 1455 nt long. In the sample adeIJK sample, we can find the subsequence from the position 14 to 1455 of the genes in the contig00007 (position 44108-45549).

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum