Deciphering S. aureus with metatranscriptomics (Marc's project)

gene_x 0 like s 667 view s

Tags: human, bacterium, pipeline, RNA-seq

RNAseq analaysis using nextflow set reference as CP000255

prepare reference

mkdir raw_data; cd raw_data
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz

Downloading CP000255.fasta and CP000255.gff from GenBank

mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
mv ~/Downloads/sequence\(2\).gff3 CP000255.gff

MRSA: USA300
EMRSA (CC22): CC22 strain
MSSA: Newmann strain

#https://journals.asm.org/doi/10.1128/jb.01000-07
#Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
#https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430

#Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1

#Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
https://pubmed.ncbi.nlm.nih.gov/30473348/

sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
grep -P "\texon\t" CP000255_m.gff | sort | wc -l  #2630
# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
--gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in P50_LH1and2; do \
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto

nextflow run

mv trimmed/*.fq.gz .
# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- in results/genome/CP059040_genes.gtf -->
#P059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
#CP059040.1      Genbank exon    1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
#CP059040.1      Genbank transcript      1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
#CP059040.1      Genbank exon    1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
>CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome

Calculate Host and Bacterial Compositions using vrap

download all S epidermidis genomes and identified all ST2 isolates from them!

Staphylococcus aureus  ID: 1280
Staphylococcus epidermidis Taxonomy ID: 1282

esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788

esearch -db nucleotide -query "txid1280[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
esearch -db nucleotide -query "txid1282[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta  #3478, 9.2G
python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta  #447, 1.1G

# ---- Download related genomes from ENA ----
https://www.ebi.ac.uk/ena/browser/view/1280
#Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
https://www.ebi.ac.uk/ena/browser/view/1282

#python ~/Scripts/filter_fasta.py  ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
#S.aureus, Staphylococcus epidermidis
python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta  #2187, 5.8G
python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta  #243, 596M

replace virus to S.aureus

run vrap

ln -s ~/Tools/vrap/ .
conda activate /home/jhuang/miniconda3/envs/vrap
vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

#TODO: change virus_user_db --> S.aureus_user_db
#ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome

mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam

@PG     ID:bowtie2      PN:bowtie2      VN:2.4.4        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
24061654 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
22492206 + 0 mapped (93.48% : N/A)
12342028 + 0 paired in sequencing
6171014 + 0 read1
6171014 + 0 read2
10326710 + 0 properly paired (83.67% : N/A)
11134274 + 0 with itself and mate mapped
56254 + 0 singletons (0.46% : N/A)
421598 + 0 with mate mapped to a different chr
15942 + 0 with mate mapped to a different chr (mapQ>=5)

Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
cd bowtie
mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam
#24061651 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
541762 + 0 mapped (2.25% : N/A)
12342022 + 0 paired in sequencing
6171011 + 0 read1
6171011 + 0 read2
382256 + 0 properly paired (3.10% : N/A)
388562 + 0 with itself and mate mapped
22349 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw

samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
#21.4043

samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
#21.4043
samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'

python ~/Scripts/calculate_coverage.py mapped_sorted.bam
#Average coverage: 24.41

Report

I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:

    Host and Bacterial Composition:
        * The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).

    Genome Assembly and Annotation:
        * After removing the human reads, we assembled the bacterial genome.
        * For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
        * The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).

    Mapping and Coverage:
        * Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
        * The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
        * The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.

Conclusion:
The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.

Calculate the percentages of reads are human rRNA and human mRNA, see the attached review "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics".

run nextflow set human as reference

(OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.

        diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa

        featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam

Here, gene_type in the GTF file may distinguish mRNA and rRNA features.

/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf

(OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"   --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"    -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'
#--pseudo_aligner 'salmon'

'GRCh38' {
    fasta       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
    bwa         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
    bowtie2     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
    star        = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
    bismark     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
    gtf         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
    bed12       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
    mito_name   = "chrM"
    macs_gsize  = "2.7e9"
    blacklist   = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"

Report

Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.

However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.

This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.

Based on these findings, depleting human mRNA appears to be the most critical step for improvement.

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Deciphering S. aureus with metatranscriptomics (Marc's project)

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章