RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)

gene_x 0 like s 992 view s

Tags: processing, pipeline

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta --gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner star_salmon --gtf_group_features gene_id --gtf_extra_attributes gene_name --featurecounts_group_type gene_biotype --featurecounts_feature_type transcript

prepare reference

They are wildtype strains grown in different medium.
AUM - artificial urine medium
Urine - human urine
MHB - Mueller-Hinton broth

mkdir raw_data; cd raw_data
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz

ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz

ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz

(Optional) using trinity to find the most closely reference

Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12

#https://www.genome.jp/kegg/tables/br08606.html#prok
acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank

Downloading CP059040.fasta and CP059040.gff from GenBank

(Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

#Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
(base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
./CP059040.fasta
./CP059040.bed
./CP059040.gb
./CP059040.gff3
./CP059040.gff3_backup
./CP059040_full.gb
./CP059040_gene.gff3
./CP059040_gene.gtf
./CP059040_gene_old.gff3
./CP059040_rRNA.gff3
./CP059040_rRNA_v.gff3

# ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
#gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
#grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
#sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
#gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed

grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
#grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803

cp CP059040.gff3 CP059040_backup.gff3
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3

#3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
# --------------------------------------------------------------------------------------------------------------------------------------------------
# ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
cp CP059040.gff3 CP059040_backup.gff3
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796

#The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
# from gff3 to gtf
sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
sed -i -e "s/;/\"; /g" CP059040_gene.gtf
sed -i -e "s/=/=\"/g" CP059040_gene.gtf

#sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
#using editor instead!

#The following is GTF-format.
CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";

#NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
#NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
#NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11

##gff-version 3
##sequence-region CP059040.1 1 3980852
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470

gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
#[01/21 10:57:46] Loading reference annotation (guides)..
#GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
#[01/21 10:57:46] 3796 reference transcripts loaded.
#Default stack size for threads: 8388608
#WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
#WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
#Please make sure the -G annotation file uses the same naming convention for the genome sequences.
#[01/21 10:58:30] All threads finished.

#  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
#  The specified gene identifier attribute is 'Name'
#  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
#  The program has to termin

#  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
#  The specified gene identifier attribute is 'gene_biotype'
#  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
#  The program has to terminate.

#grep "ID=cds-" CP059040.gff3 | wc -l
#grep "ID=exon-" CP059040.gff3 | wc -l
#grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
grep "gbkey=" CP059040.gff3 | wc -l  7695
grep "ID=id-" CP059040.gff3 | wc -l  5
grep "locus_tag=" CP059040.gff3 | wc -l    7689
#...
cds   3701                             locus_tag=xxxx, no gene_biotype
exon   96                              locus_tag=xxxx, no gene_biotype
gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
rna    96                              locus_tag=xxxx, no gene_biotype
------------------
    7694

cp CP059040.gff3_backup CP059040.gff3
grep "^##" CP059040.gff3 > CP059040_gene.gff3
grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
#!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto

nextflow run

#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
(host_env) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
/usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_1 --
#After checking the record (see below) in results/genome/CP059040.gtf, we have to change 'exon' to 'transcript', the default values are --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
#In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章