Prepare virus GTF for nextflow run

gene_x 0 like s 997 view s

Tags: pipeline, RNA-seq

Prepare GTF for non-model virus

The gffread command you're using is designed to convert GFF format files to GTF format, but it doesn't necessarily preserve all the attribute information. The -T option enforces creation of gene_id and transcript_id attributes, which are mandatory in GTF format, and gffread takes these from the ID and Parent fields of the input GFF file, respectively.
The GTF format is simpler than GFF and doesn't accommodate all the possible attributes of a GFF file. That's why you're seeing a reduction in information in your converted file.

If you need to retain all information from the GFF file, you may need to do some post-processing to add the extra attributes back into the GTF file. However, keep in mind that downstream tools which expect GTF format may not correctly handle extra attributes.

# -- Deprecated processing for virus gtf --
#NOT_USED, since it changed a lot!
#gffread X14112.1.gff -T -o X14112.1.gtf
cp X14112.1.gff3 X14112.1.gff3_backup
grep "^##" X14112.1.gff3 > X14112.1_gene.gff3
grep "ID=gene" X14112.1.gff3 >> X14112.1_gene.gff
#!!!!VERY_IMPORTANT!!!!: change type '\tgene\t' to '\texon\t'! 
#sed -i -e "s/\tgene\t/\texon\t/g" X14112.1_gene_.gff # since default is --featurecounts_feature_type 'exon'

# -- New processing for virus gtf --
gffread X14112.1_orig.gff -T -o X14112.1_v2.gtf

python3 add_gene_id.py  # X14112.1_v2.gtf --> X14112.1_v3.gtf
#------------------------------------
def add_missing_gene_id(input_gtf, output_gtf):
    with open(input_gtf, 'r') as in_gtf, open(output_gtf, 'w') as out_gtf:
        for line in in_gtf:
            if not line.startswith('#'):  # Skip header lines
                elements = line.strip().split('\t')
                attributes = elements[8]
                if 'gene_id' not in attributes:
                    # Extract transcript_id
                    transcript_id = ''
                    for attr in attributes.split(';'):
                        if 'transcript_id' in attr:
                            transcript_id = attr.strip()
                    # Prepend transcript_id as gene_id if not empty
                    if transcript_id != '':
                        attributes = f'{transcript_id.replace("transcript_id", "gene_id")}; {attributes}'
                elements[8] = attributes
                line = '\t'.join(elements)
            out_gtf.write(line + '\n')
# Use the function
input_gtf = 'X14112.1_v2.gtf'  # Path to your input GTF
output_gtf = 'X14112.1_v3.gtf'  # Path to the output GTF
add_missing_gene_id(input_gtf, output_gtf)

Human herpesvirus 1, also known as Herpes Simplex Virus type 1 (HSV-1), is a virus with a complex genome encoding around 70-80 genes. The number of genes can vary slightly depending on the specific strain of HSV-1, as well as the methodologies used to identify and annotate the genes.
IE175, also known as ICP4 (Infected Cell Polypeptide 4), is a protein encoded by the Human herpesvirus 1 (HSV-1). The gene for this protein is also referred to as the IE (immediate early) gene 3, and the protein it encodes is a major regulatory protein.
In the lifecycle of HSV-1, immediate early genes are the first set of genes to be transcribed following infection. The proteins produced from these genes then regulate the expression of early and late genes that are involved in viral DNA replication and the production of viral structural proteins.
ICP4, in particular, is essential for the onset of viral replication. It acts as a trans-activator, promoting transcription of other viral genes. It can also interact with host cell proteins and influence host gene expression. As a result of these functions, ICP4 plays a key role in the pathogenesis of HSV-1 infection.

Please note that the naming convention for viral genes and proteins can sometimes be inconsistent, with multiple names referring to the same gene or protein. IE175, ICP4, and IE gene 3 all refer to the same gene in HSV-1.

# Delete the records if they are intron or manually add gene_name to the records without gene_name.

cp X14112.1_v3.gtf X14112.1_v4.gtf
#Find all recoreds without "gene_name"
grep -v "gene_name" X14112.1_v4.gtf

#-->Delete intron records: grep "intron" X14112.1.gff3_orig
DEL X14112.1        EMBL    transcript      4953    6907    .       -       .       transcript_id "id-X14112.1:4953..6907"; gene_id "id-X14112.1:4953..6907"
DEL X14112.1        EMBL    exon    4953    6907    .       -       .       gene_id "id-X14112.1:4953..6907"; transcript_id "id-X14112.1:4953..6907";
DEL X14112.1        EMBL    transcript      132374  132539  .       +       .       transcript_id "id-X14112.1:132374..132539"; gene_id "id-X14112.1:132374..132539"
DEL X14112.1        EMBL    exon    132374  132539  .       +       .       gene_id "id-X14112.1:132374..132539"; transcript_id "id-X14112.1:132374..132539";
DEL X14112.1        EMBL    transcript      145649  145860  .       -       .       transcript_id "id-X14112.1:145649..145860"; gene_id "id-X14112.1:145649..145860"
DEL X14112.1        EMBL    exon    145649  145860  .       -       .       gene_id "id-X14112.1:145649..145860"; transcript_id "id-X14112.1:145649..145860";

# or update: grep "146805" X14112.1_orig.gff
UPDATE X14112.1        EMBL    transcript      146805  151063  .       +       .       transcript_id "rna-X14112.1:146805..151063"; gene_id "rna-X14112.1:146805..151063"
UPDATE X14112.1        EMBL    exon    146805  151063  .       +       .       gene_id "rna-X14112.1:146805..151063"; transcript_id "rna-X14112.1:146805..151063";
                                                                                  --> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175";                                                  --> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175";

# or update: grep "133941" X14112.1_orig.gff
UPDATE X14112.1        EMBL    transcript      133941  146107  .       -       .       transcript_id "rna-X14112.1:133941..146107"; gene_id "rna-X14112.1:133941..146107"
UPDATE X14112.1        EMBL    exon    133941  145648  .       -       .       gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
UPDATE X14112.1        EMBL    exon    145861  146107  .       -       .       gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
                                                                                  --> transcript_id "rna-IE68"; gene_id "rna-IE68"; gene_name "IE68";
                                                                                  --> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";
                                                                                  --> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";

(optional) consider to update all exon and CDS with different names! for example exon-RL2-1, exon-RL2-2, cds-RL2-1. Maybe it is not nessary, since the output contains only transcript-type!

Run nexflow for virus

docker pull nfcore/rnaseq
/usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_virus    --fasta "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta" --gtf "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_v4.gtf"   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P<umi_1>.{12}).*" --umitools_dedup_stats    --skip_rseqc --skip_dupradar --skip_preseq    -profile docker -resume  --max_cpus 55 --max_memory 120.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'hisat2'    --gtf_extra_attributes 'gene_name' --gtf_group_features 'gene_id' --featurecounts_group_type 'gene_name' --featurecounts_feature_type 'exon'    --umitools_grouping_method 'unique'

Run nexflow for human using GRCh38 genome

docker pull nfcore/rnaseq
/usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P<umi_1>.{12}).*" --umitools_dedup_stats    --skip_rseqc --skip_dupradar --skip_preseq    -profile docker -resume  --max_cpus 55 --max_memory 128.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon' --pseudo_aligner 'salmon'    --gtf_extra_attributes 'gene_name' --gtf_group_features 'gene_id' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'    --umitools_grouping_method 'unique'

3.1. BUG_1 for running d8_r1 due to memory

  # in modules/nf-core/umitools/dedup/main.nf
  process UMITOOLS_DEDUP {
      tag "$meta.id"
      //REMOVED  label "process_medium"
      //ADDED
      label 'high_memory' // this needs to be defined in your config file
      cpus 55 // adjust as per your system's capabilities

  ERROR ~ Module compilation error
  - file : /mnt/h1/jhuang/DATA/Data_Manja_RNAseq_Organoids/rnaseq/./workflows/../subworkflows/nf-core/bam_dedup_stats_samtools_umitools/../../../modules/nf-core/umitools/dedup/main.nf
  - cause: Unexpected character: '#' @ line 3, column 5.
        #label "process_medium"
      ^

3.2. BUG_2 for running d8_r1 due to memory

  # in conf/test_full.config
  process {
    //ADDED
    withLabel: 'high_memory' {
      memory = '120 GB' // adjust as per your system's capabilities
    }
    withName: 'UMITOOLS_DEDUP' {
      time = '160.h' // Adjust the time limit to your needs
    }
  }

  ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP (control_r2)'
  Caused by:
    Process requirement exceeds available memory -- req: 128 GB; avail: 125.8 GB
  Command executed:
    PYTHONHASHSEED=0 umi_tools \
        dedup \
        -I control_r2.transcriptome.sorted.bam \
        -S control_r2.umi_dedup.transcriptome.sorted.bam \
        --output-stats control_r2.umi_dedup.transcriptome.sorted \
        --method='unique' --random-seed=100
    cat <<-END_VERSIONS > versions.yml
    "NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP":
        umitools: $(umi_tools --version 2>&1 | sed 's/^.*UMI-tools version://; s/ *$//')
    END_VERSIONS
  Command exit status:

R-code for evaluation of nextflow outputs

# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)

library(tximport)
library(DESeq2)

setwd("~/DATA/Data_Manja_RNAseq_Organoids/results_GRCh38_unique_9samples/star_salmon")

# Define paths to your Salmon output quantification files
files <- c("control_r1" = "./control_r1/quant.sf",
          "control_r2" = "./control_r2/quant.sf",
          "HSV.d2_r1" = "./HSV.d2_r1/quant.sf",
          "HSV.d2_r2" = "./HSV.d2_r2/quant.sf",
          "HSV.d4_r1" = "./HSV.d4_r1/quant.sf",
          "HSV.d4_r2" = "./HSV.d4_r2/quant.sf",
          "HSV.d6_r1" = "./HSV.d6_r1/quant.sf",
          "HSV.d6_r2" = "./HSV.d6_r2/quant.sf",
          "HSV.d8_r2" = "./HSV.d8_r2/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)

# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r2"))
condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8"))

# Define the colData for DESeq2
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))

# Create DESeqDataSet object
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)

# In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
# Filter data to retain only genes with more than 2 counts > 3 across all samples
# dds <- dds[rowSums(counts(dds) > 3) > 2, ]

# Run DESeq2
dds <- DESeq(dds)

# Perform rlog transformation
rld <- rlogTransformation(dds)

# Output raw count data to a CSV file
write.csv(counts(dds), file="transcript_counts.csv")

# -- gene-level count data --
# Read in the tx2gene map from salmon_tx2gene.tsv
#tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)

# Set the column names
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Remove the gene_name column if not needed
tx2gene <- tx2gene[,1:2]

# Import and summarize the Salmon data with tximport
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)

# Continue with the DESeq2 workflow as before...
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
#dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
dds <- DESeq(dds)
rld <- rlogTransformation(dds)
write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")

#TODO: why a lot of reads were removed due to the too_short?
STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output

dim(counts(dds))
head(counts(dds), 10)

WHY: too much "Unmapped: too short" in STAR using nextflow, I read the STAR manual, the default values are 0, it means it will never discard reads due to too short. ASSUMPTION: the umi_tools dedup counts as "Unmapped: too short". Check the intermediate bam files.

Indeed, you're right that according to the STAR manual, the default settings for read length filtering parameters (--outFilterScoreMinOverLread, --outFilterMatchNminOverLread, and --outFilterMatchNmin) are set to 0, which means reads will not be discarded for being too short.

The "Unmapped: too short" statistic in the STAR output usually refers to reads that didn't map well enough to the reference genome. In other words, either the read didn't match sufficiently to any location in the genome, or the best matches were not significantly better than other matches to be unique.

If you're seeing a high proportion of "Unmapped: too short" reads, it could be due to several reasons:

The quality of the reads might be poor, leading to low mapping efficiency.
The reference genome might not be the correct one for your data.
If your data is single-end reads, it might contain a high proportion of reads shorter than the default minimum allowed length.

For troubleshooting, you could:

Check the quality of your reads using a tool like FastQC. * Ensure you're using the correct reference genome. * If you're working with single-end data, consider checking the length distribution of your reads. If many reads are shorter than the minimum length threshold, consider adjusting this threshold if appropriate. * If you're working with paired-end data, make sure both of the reads in a pair pass the quality controls. STAR is sensitive to inconsistencies between paired reads.

Finally, always make sure your read preprocessing steps (like trimming for quality and adapter sequences) are performed correctly as these can greatly affect the downstream analysis.

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Prepare virus GTF for nextflow run

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章