RNA-Seq Analysis of the WaGa samples

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

1 Libraries

Let’s start by loading all of the libraries that will be used throughout this analysis.

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.

2 Load RNA-Seq counts

2.1 Samples

RNA-Seq samples.
sample_id condition replicate batch
1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam sT.DMSO.EV rep1 batch2
1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam sT.DMSO.EV rep2 batch2
2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam sT.DMSO.EV rep3 batch2
1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam scr.DMSO.EV rep1 batch2
1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam scr.DMSO.EV rep2 batch2
2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam scr.DMSO.EV rep3 batch2
1107_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam sT.Dox.EV rep1 batch2
1605_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam sT.Dox.EV rep2 batch2
2706_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam sT.Dox.EV rep3 batch2
1107_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam scr.Dox.EV rep1 batch2
1605_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam scr.Dox.EV rep2 batch2
2706_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam scr.Dox.EV rep3 batch2
1107_WaGa_wt_EVAligned.sortedByCoord.out.bam wt.EV rep1 batch2
1605_WaGa_wt_EVAligned.sortedByCoord.out.bam wt.EV rep2 batch2
2706_WaGa_wt_EVAligned.sortedByCoord.out.bam wt.EV rep3 batch2
WaGa_RNAAligned.sortedByCoord.out.bam RNA rep1 batch1
WaGa_RNA_118Aligned.sortedByCoord.out.bam RNA rep2 batch1
WaGa_RNA_147Aligned.sortedByCoord.out.bam RNA rep3 batch1

2.2 Load sample information

To make things easier, all of the relevant experiment sample information including sample ID, condition, and batch have been stored in separate CSV files.

During the processing of the analysis configuration above, the relevant sample metadata was loaded. Here we create a few variables containing different subsets of this information that will be useful throughout the downstream analyses.

During the differential expression analysis, we will use a larger subset of the samples in order to improve our estimation of variance and batch effects.

2.3 Load count tables

Next, we will load in the count data for each sample. Throughout our analyses we will apply several different types of filters and data transformations. Since the exact types of normalization and filtering procedures we want to apply may differ, depending on the type of analysis we are interested in (e.g. differential expression vs. co-expression network analysis), for each type of analysis, we will maintain a list with the count tables at each of the major steps in processing. Whenever genes or samples are removed from one version of the count table for that analysis, the same genes and samples will be removed from all other versions of the count table. This way we can ensure that the dimensions of the count matrix is consistent across all levels of processing, within a given analysis.

The different stages of processing which will be maintained are:

  • raw: Unmodified count values as generated by featureCounts, or another quantification tool.
  • rld: rlog transformed.
  • normed: Counts with zero or more of the follow transformations: log2, CPM, Voom, quantile normalization
  • batch_adjusted: Same as above, but with an (optional) batch adjustment step. If “batch_adjust” is set to “none”, this will be the same as the normed version of the counts.
  • final: The final version of the counts after all filtering and transformation steps have been applied.
## Preparing count matrix

[1] “Preparing count matrix”

## Single combined count file
    if (length(count_files) == 1) {
        # Load combined count table
        # TODO: Generalize so that counts stored in arbitrary formats can be loaded
        count_table <- read.table(CONFIG$input_count_tables, header=TRUE,
                                  row.names=1, check.names=FALSE) 
    } else {
        # iterate over per-sample counts and append to count matrix
        count_table <- read.delim(count_files[1], header = FALSE, row.names = 1)
    
        COUNT_COLUMN_IDX <- 2
    
        for (infile in count_files[2:length(count_files)]) {
          sample_counts <- read.delim(infile, header = FALSE)
          count_table <- cbind(count_table, sample_counts[, COUNT_COLUMN_IDX])
        }
    
        # fix column names
        colnames(count_table) <- file_path_sans_ext(basename(count_files))
    
        count_table <- as.matrix(count_table)
    }

    # Make sure featureCounts meta-columns are removed (should be taken care of in
    # recent versions of HPGLtools)
    count_table <- count_table[!grepl('^X?__', rownames(count_table)),]
#}

# Remove ENSEMBL gene id version, if present
# e.g. "ENSG00000005421.8" -> "ENSG00000005421"
if (startsWith(rownames(count_table)[1], 'ENS')) {
    rownames(count_table) <- sub('\\.[0-9]*', '', rownames(count_table))
}

# Exclude any missing samples
#sample_ids <- sample_ids[sample_ids %in% colnames(count_table)]

# Filter out samples not needed for this analysis
count_table <- count_table[, sample_ids]

# Drop any rows with NA's (should only occur when comparing counts across
# experiments where different reference GFF's were used)
count_table <- count_table[complete.cases(count_table),]

# print initial size of raw count table
if (CONFIG$verbose) {
    cat(sprintf("**Count table dimensions:**\n"))
    cat(sprintf("- Rows: %d\n", nrow(count_table)))
    cat(sprintf("- Columns: %d\n", ncol(count_table)))
}

Count table dimensions: - Rows: 60664 - Columns: 18

## [1] 76153959

3 Load annotations

3.1 Load gene annotations

Next, we will use Bioconductor OrganismDb https://www.bioconductor.org/packages/devel/bioc/manuals/OrganismDbi/man/OrganismDbi.pdf packages citep(citation('OrganismDbi')) and the R interface to BioMart citep(c('10.1186/1471-2164-10-22', '10.1038/nprot.2009.97')) to retrieve ENSEMBL gene annotations for the host.

# Load gene and transcript annotations
library(CONFIG$organism_db, character.only=TRUE)
#OrganismDb Object
orgdb <- get(CONFIG$organism_db)

# BUG (Must be fixed!) Fix AnnotationDbi namespace mess
assign('select', dplyr::select, envir=.GlobalEnv)
assign('get',    base::get, envir=.GlobalEnv)

# main ensembl host is down fairly often, so alt hosts may be needed.
gene_info <- load_host_annotations(orgdb, rownames(count_table),
                                   keytype=CONFIG$orgdb_key,
                                   biomart_dataset=CONFIG$biomart_dataset,
                                   biomart_host='www.ensembl.org')
                                   #biomart_host='useast.ensembl.org')
                                   #biomart_host='aug2017.archive.ensembl.org')
                                   #biomart_host='archive.ensembl.org')

#https://bioconductor.org/packages/devel/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
# Keep only the feature information remaining genes
gene_info <- gene_info[gene_info$gene_id %in% rownames(count_table),]

# For now, just grab the description for the first transcript
gene_info <- gene_info[!duplicated(gene_info$gene_id), ]

# Gene IDs
gene_ids <- rownames(count_table)

# Determine approximate transcript lengths for each gene.
# Note that because the annotations have not been filtered to remove ncRNAs,
# many of the genes associated with lncRNAs, etc. will not have a length
# associated with them.
gene_info$transcript_length <- getlength(gene_ids, CONFIG$organism_genome, "ensGene")
# -- DEBUG --
#Can't find hg38/ensGene length data in genLenDataBase...
#A TxDb annotation package exists for hg38. Consider installing it to get the gene lengths.
#Trying to download from UCSC. This might take a couple of minutes.
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
#BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
#library("BSgenome.Hsapiens.UCSC.hg38")
#library("TxDb.Hsapiens.UCSC.hg38.knownGene")
#?supportedGenomes()

# Note 2017/12/09
# ---------------
# hg19 still better to use for tx lengths than hg38 for the moment:
# > sum(is.na(getlength(gene_ids, 'hg38', "ensGene")))                                                                                                                                                        
#   [1] 36591
# > sum(is.na(getlength(gene_ids, 'hg19', "ensGene")))                                                                                                                                                           
#   [1] 24462     
# Note 2023/02/08
# ---------------
#Loading hg19 length data...
#[1] 25670


# Location of external annotation files
species_dir <- tolower(sub('. ', '', CONFIG$host))
input_dir <- file.path('..', '..', 'data', species_dir)
Preview of gene annotations.
gene_id chromosome description strand type transcript_length
ENSG00000284662 chr5 olfactory receptor family 4 subfamily F member 16 + protein_coding 629
ENSG00000186827 chr1 TNF receptor superfamily member 4 - protein_coding 1075
ENSG00000186891 chr1 TNF receptor superfamily member 18 - protein_coding 789
ENSG00000160072 chr1 ATPase family AAA domain containing 3B + protein_coding 2114
ENSG00000041988 chr1 THAP domain containing 3 + protein_coding 931
ENSG00000260179 NA NA NA lncRNA NA

3.2 Load GO annotations

Next, we will use the GO.db annotation package citep(citation('GO.db')) to load a table of Gene Ontology (GO) citep('10.1038/75556') terms associated with each parasite gene.

Preview of GO annotations.
GO ENSEMBL TERM ONTOLOGY
GO:0000002 ENSG00000114120 mitochondrial genome maintenance BP
GO:0000002 ENSG00000025708 mitochondrial genome maintenance BP
GO:0000002 ENSG00000068305 mitochondrial genome maintenance BP
GO:0000002 ENSG00000117020 mitochondrial genome maintenance BP
GO:0000002 ENSG00000171612 mitochondrial genome maintenance BP
GO:0000002 ENSG00000115204 mitochondrial genome maintenance BP

3.3 Load KEGG annotations

Next, we will load a table of KEGG citep('10.1093/nar/27.1.29') pathway annotations for each host gene.

Preview of KEGG annotations.
category name class description
path:hsa00232 Caffeine metabolism Metabolism; Biosynthesis of other secondary metabolites
path:hsa00983 Drug metabolism - other enzymes Metabolism; Xenobiotics biodegradation and metabolism
path:hsa01100 Metabolic pathways
path:hsa05204 Chemical carcinogenesis Human Diseases; Cancers It has been estimated that exposure to environmental chemical carcinogens may contribute significantly to the causation of a sizable fraction, perhaps a majority, of human cancers. Human carcinogens act through a variety of genotoxic and non-genotoxic mechanisms. Genotoxic carcinogens can attack biological macromolecules such as DNA and RNA either directly or indirectly through metabolism, resulting in the formation of adducts with these macromolecules. If DNA adducts escape cellular repair mechanisms and persist, they may lead to miscoding, resulting in permanent mutations. Non-genotoxic carcinogens act by the mechanisms such as induction of inflammation, immunosuppression, formation of reactive oxygen species, activation of receptors, and epigenetic silencing. Together, these genotoxic and non-genotoxic mechanisms can alter signal-transduction pathways that finally result in hypermutability, genomic instability, loss of proliferation control, and resistance to apoptosis - some of the characteristic features of cancer cells.
path:hsa00230 Purine metabolism Metabolism; Nucleotide metabolism
path:hsa05340 Primary immunodeficiency Human Diseases; Immune diseases Primary immunodeficiencies (PIs) are a heterogeneous group of disorders, which affect cellular and humoral immunity or non-specific host defense mechanisms mediated by complement proteins, and cells such as phagocytes and natural killer (NK) cells. These disorders of the immune system cause increased susceptibility to infection, autoimmune disease, and malignancy. Most of PIs are due to genetic defects that affect cell maturation or function at different levels during hematopoiesis. Disruption of the cellular immunity is observed in patients with defects in T cells or both T and B cells. These cellular immunodeficiencies comprise 20% of all PIs. Disorders of humoral immunity affect B-cell differentiation and antibody production. They account for 70% of all PIs.

4 Create ExpressionSet

Rather than keeping track of the count data and meta information all separately, let’s create a Bioconductor ExpressionSet instance containing all of the relevant information.

In order to create an ExpressionSet we first have to make sure each of the three main componenets (expression data, sample information, and feature information) have consistent column and row names and have been sorted similarly.

Some useful methods for working with ExpressionSets:

  • sampleNames(expr_set) // sample ids
  • featureNames(expr_set) // gene ids
  • exprs(expr_set) // counts
  • pData(expr_set) // sample metadata (id, condition, batch, etc)
  • fData(expr_set) // gene annotations
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 60664 features, 18 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam
##     1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam ...
##     WaGa_RNA_147Aligned.sortedByCoord.out.bam (18 total)
##   varLabels: sample condition batch
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000000003 ENSG00000000005 ... ENSG00000288725
##     (60664 total)
##   fvarLabels: gene_id chromosome ... transcript_length (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

5 Preview raw counts

5.2 Highly expressed genes (before filtering)

Genes with the highest average expression before filtering.
gene_id description type tx_len num_reads num_reads_pct
ENSG00000198804 cytochrome c oxid… protein_coding NA 1304472 1.7129
ENSG00000198886 NADH dehydrogenas… protein_coding NA 1189403 1.5618
ENSG00000198712 cytochrome c oxid… protein_coding NA 789210 1.0363
ENSG00000198727 cytochrome b protein_coding NA 716157 0.9404
ENSG00000140259 microfibril assoc… protein_coding 1441 708146 0.9299
ENSG00000198786 NADH dehydrogenas… protein_coding NA 658007 0.8640
ENSG00000198938 cytochrome c oxid… protein_coding NA 630656 0.8281
ENSG00000198888 NADH dehydrogenas… protein_coding NA 486080 0.6383
ENSG00000251562 metastasis associ… lncRNA 2227 381826 0.5014
ENSG00000210082 NA Mt_rRNA NA 378582 0.4971
## [1] 76153959

6 Filter data

Next let’s filter the data to remove unwanted or uninformative genes.

Filtering targets:

  1. low-count genes
  2. noncoding RNAs

Later on (post-normalization), there may also be an addition step of variance-based filtering applied.

6.3 Highly expressed genes (after filtering)

Genes with the highest average expression after filtering.
gene_id description type tx_len num_reads num_reads_pct
ENSG00000198804 cytochrome c oxid… protein_coding NA 1304472 1.7134
ENSG00000198886 NADH dehydrogenas… protein_coding NA 1189403 1.5623
ENSG00000198712 cytochrome c oxid… protein_coding NA 789210 1.0366
ENSG00000198727 cytochrome b protein_coding NA 716157 0.9407
ENSG00000140259 microfibril assoc… protein_coding 1441 708146 0.9301
ENSG00000198786 NADH dehydrogenas… protein_coding NA 658007 0.8643
ENSG00000198938 cytochrome c oxid… protein_coding NA 630656 0.8284
ENSG00000198888 NADH dehydrogenas… protein_coding NA 486080 0.6385
ENSG00000251562 metastasis associ… lncRNA 2227 381826 0.5015
ENSG00000210082 NA Mt_rRNA NA 378582 0.4973
## Features  Samples 
##    53795       18
## [1] 76133340

7 Create DESeqDataSet

To analyze and graph the data using the DESeq2 package, we have to change the data format from an ExpressionSet to a DESeqDataSet object and identify the factor in the experimental design that we want to compare for our differential expression analysis. This requires a few steps. (https://colauttilab.github.io/RNA-Seq_Tutorial.html#deseq2_analysis)

First, extract raw count matrix from ExpressionSet

Second, we prepare a data frame containing metadata about each sample.

Third, construct a DESeq object from the prepared data.

8 Estimate size factors

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     select
## The following object is masked from 'package:OrganismDbi':
## 
##     select
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## The following object is masked from 'package:genefilter':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following object is masked from 'package:biomaRt':
## 
##     select
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:ReactomePA':
## 
##     dotplot
## The following object is masked from 'package:clusterProfiler':
## 
##     dotplot
## 
## Loaded mixOmics 6.20.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
## Loading required package: grid
## Loading required package: futile.logger
## 
## Attaching package: 'futile.logger'
## The following object is masked from 'package:mgcv':
## 
##     scat
## Loading required package: usethis
## NULL
##  1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8411 
##  1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.9430 
##  2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             1.0928 
## 1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8531 
## 1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8503 
## 2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             1.1101
# raw and normalized counts
raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
write.table(raw_counts, file="tables/raw_counts.txt", sep="\t", quote=F, col.names=NA)
write.table(normalized_counts, file="tables/normalized_counts.txt", sep="\t", quote=F, col.names=NA)

# prepare the function estimSf
### Let's implement such a function
### cds is a countDataset
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))

    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)
## [1] 0
##  1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8411 
##  1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.9430 
##  2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             1.0928 
## 1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8531 
## 1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             0.8503 
## 2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 
##                                             1.1101
## [1] TRUE

9 Basic exploratory analysis

9.1 Raw counts and Pseudo counts

## [1] 53795    18

It is often useful, to visualize the count distribution, to compute “pseudo counts”, which are log-transformed counts:

1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1107_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1605_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 2706_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1107_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1605_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 2706_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1107_WaGa_wt_EVAligned.sortedByCoord.out.bam 1605_WaGa_wt_EVAligned.sortedByCoord.out.bam 2706_WaGa_wt_EVAligned.sortedByCoord.out.bam WaGa_RNAAligned.sortedByCoord.out.bam WaGa_RNA_118Aligned.sortedByCoord.out.bam WaGa_RNA_147Aligned.sortedByCoord.out.bam
ENSG00000000003 5.392 5.672 5.883 5.044 5.210 5.285 5.426 5.248 5.728 4.954 4.322 5.392 5.322 6.150 3.907 6.539 6.755 7.210
ENSG00000000005 3.170 3.585 3.907 2.322 3.000 2.807 3.322 2.322 3.459 3.585 2.322 2.807 2.322 3.000 3.000 0.000 1.000 0.000
ENSG00000000419 6.600 6.794 7.313 6.426 6.508 7.011 7.022 7.686 7.340 6.742 6.700 6.088 7.665 7.942 6.686 9.377 9.185 8.948
ENSG00000000457 6.044 6.109 6.190 6.443 5.615 6.658 6.539 5.170 5.728 5.755 5.672 5.672 6.285 5.615 5.248 7.593 7.267 6.954
ENSG00000000460 5.728 6.150 6.170 4.755 5.322 5.833 6.267 6.304 5.858 5.492 5.931 6.066 5.781 6.190 5.210 8.814 8.061 7.846
ENSG00000000938 6.109 6.066 6.340 5.615 6.248 6.539 5.858 6.304 5.954 5.392 6.267 6.742 6.170 5.555 5.954 2.000 1.000 2.000

The object df_raw will be used later to compare the effect of different normalization on the count distribution in the different samples.

9.2 Count distribution

First, we provide an overview of the distribution of the first sample (without the genes that have been filtered out) by ploting the histograms of raw counts and pseudo counts:

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This figure shows that the count distribution is highly skewed with a large proportion of genes with a count equal to 0 and a few number of genes with an extreme count.

9.3 Relation between mean and variance

Another important feature of RNA-Seq data is the fact that they are overdispersed. This means that the variance of the count of a given gene over different biological samples within a given condition is larger than the average count for the same gene. This feature is illustrated by plotting the graphics of the mean vs variance for condition “wt” for all genes with an average count smaller than 5000 (otherwise the chart can not be read easily):

In this figure, the black line is the “y=x” diagonal. It is easy to see that, for most genes, the variance is much larger than the mean.

10 Normalization

This section performs different normalization approaches for RNA-Seq data. These normalization are performed using either DESeq (RLE) or edgeR (TC, RPKM, UQ, TMM). The final count distribution is compared to the raw count distribution in the last part of this section.

10.1 DESeq2

Normalized counts can be obtained using the function counts with the option normalized = TRUE. The calculation of the normalized counts is performed with the formula K~gj=Kgj/sj. The following command lines compare the result of the counts function with this calculation:

## [1] 0

Finally, pseudo counts are computed and stored in a data frame for further comparison.

10.2 edgeR

In this section, the package edgeR is perform to compare the other normalization approaches. To do so, a DGEList object has to be created from the count data:

The library sizes and the normalization factors for all samples are obtained by:

group lib.size norm.factors
1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2514456 1
1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2917888 1
2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 3829508 1
1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2397527 1
1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2660616 1
2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 3739227 1
1107_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 2771568 1
1605_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 4102294 1
2706_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 3446136 1
1107_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2226308 1
1605_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2680865 1
2706_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 3354354 1
1107_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5828110 1
1605_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5599417 1
2706_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 2506633 1
WaGa_RNAAligned.sortedByCoord.out.bam 1 9700209 1
WaGa_RNA_118Aligned.sortedByCoord.out.bam 1 8042977 1
WaGa_RNA_147Aligned.sortedByCoord.out.bam 1 7815247 1

10.3 Total count (TC)

Normalization by Total Count (TC) is obtained directly by the function cpm. Pseudo-counts (log2 transformed counts) are computed and stored in a data frame for later use.

10.4 RPKM

RPKM needs information about gene lengths which have to be passed to the function rpkm. Again, pseudo-counts are computed and stored in a data frame for further use.

hgnc_symbol ensembl_gene_id start_position end_position size
TSPAN6 ENSG00000000003 100627108 100639991 12883
TNMD ENSG00000000005 100584936 100599885 14949
DPM1 ENSG00000000419 50934867 50959140 24273
SCYL3 ENSG00000000457 169849631 169894267 44636
C1orf112 ENSG00000000460 169662007 169854080 192073
FGR ENSG00000000938 27612064 27635185 23121

10.5 Upper quartile (UQ)

Upper quartile normalization is obtained with the function calcNormFactors. This function changes the variable norm.factors in the DGEList object and thus also the way the functions cpm and rpkm are computing counts: now, a normalized library size, equal to the original library size multiplied by the normalization factor, is used to compute normalized counts: K~gj=Kgj/(modified LS)∗10^6. The normalization factors are computed as follows. Note that size.factors = lib.size*norm.factors/10^6.

group lib.size norm.factors
1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2514456 1.5370
1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2917888 1.4654
2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 3829508 1.1810
1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2397527 1.5776
1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2660616 1.4216
2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 3739227 1.2974
1107_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 2771568 1.4537
1605_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 4102294 0.8819
2706_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 3446136 1.2169
1107_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2226308 1.5697
1605_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2680865 1.5643
2706_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 3354354 1.3727
1107_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5828110 0.9030
1605_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5599417 0.6755
2706_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 2506633 1.4106
WaGa_RNAAligned.sortedByCoord.out.bam 1 9700209 0.3391
WaGa_RNA_118Aligned.sortedByCoord.out.bam 1 8042977 0.2965
WaGa_RNA_147Aligned.sortedByCoord.out.bam 1 7815247 0.2841

The previous formula is compared to the result of the function cpm in the following command lines:

## [1] -7.276e-12  7.276e-12

which shows no difference. Finally, pseudo counts are obtained and stored in a data frame:

10.6 TMM

TMM normalization works similarly as UQ and updates the value of the variable norm.factors:

group lib.size norm.factors
1107_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2514456 1.3456
1605_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 2917888 1.2298
2706_WaGa_sT_DMSO_EVAligned.sortedByCoord.out.bam 1 3829508 1.0654
1107_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2397527 1.3707
1605_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 2660616 1.1780
2706_WaGa_scr_DMSO_EVAligned.sortedByCoord.out.bam 1 3739227 1.0892
1107_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 2771568 1.2840
1605_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 4102294 0.8046
2706_WaGa_sT_Dox_EVAligned.sortedByCoord.out.bam 1 3446136 1.0757
1107_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2226308 1.4004
1605_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 2680865 1.3521
2706_WaGa_scr_Dox_EVAligned.sortedByCoord.out.bam 1 3354354 1.1472
1107_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5828110 0.8482
1605_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 5599417 0.6435
2706_WaGa_wt_EVAligned.sortedByCoord.out.bam 1 2506633 1.2032
WaGa_RNAAligned.sortedByCoord.out.bam 1 9700209 0.5494
WaGa_RNA_118Aligned.sortedByCoord.out.bam 1 8042977 0.5983
WaGa_RNA_147Aligned.sortedByCoord.out.bam 1 7815247 0.6189

Normalized pseudo counts are obtained with the function cpm and stored in a data frame:

11 PCA

Finally, a PCA is performed on the pseudo counts obtained after DESeq2 (RLE) normalization:

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

12 Differential Expression Analysis

Finally, we are ready to analyze the data by running the DESeq function on the DESeqDataSet object. The model below will take a few seconds to run. The details of this analysis are a bit complicated, but worth reading in detail if you want to do this for a project or publication (use ?DESeq and look up the references therein).

The analysis is complicated because sequencing is essentially a process of random sampling, so two genes may look like they are differentially expressed between two samples just because one sample has more overall RNA or because one sample has another gene ‘hogging’ all the sequence reads. So it’s not possible to compare one gene across samples without also considering all the other genes in those samples.

## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Once the analysis is done, we can extract a table of results using the results() function. This is useful for other downstream analyses including visualization.

## [1] "Intercept"                           "condition_RNA_vs_scr.DMSO.EV"       
## [3] "condition_scr.Dox.EV_vs_scr.DMSO.EV" "condition_sT.DMSO.EV_vs_scr.DMSO.EV"
## [5] "condition_sT.Dox.EV_vs_scr.DMSO.EV"  "condition_wt.EV_vs_scr.DMSO.EV"
## log2 fold change (MLE): condition wt.EV vs scr.DMSO.EV 
## Wald test p-value: condition wt.EV vs scr.DMSO.EV 
## DataFrame with 6 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003  47.26900      -0.125133  0.402798 -0.310660 0.7560593
## ENSG00000000005   6.60857      -0.133363  0.762828 -0.174827 0.8612156
## ENSG00000000419 176.46105       0.556557  0.271188  2.052294 0.0401411
## ENSG00000000457  70.91095      -0.831785  0.330742 -2.514909 0.0119063
## ENSG00000000460  83.37109       0.155099  0.305115  0.508328 0.6112234
## ENSG00000000938  58.98183      -0.502072  0.328987 -1.526116 0.1269809
##                      padj
##                 <numeric>
## ENSG00000000003 0.8570386
## ENSG00000000005        NA
## ENSG00000000419 0.1726800
## ENSG00000000457 0.0887135
## ENSG00000000460 0.7588803
## ENSG00000000938 0.3214849
  • baseMean – this is the average gene expression for the two groups.
  • log2FoldChange – raise this number to the exponent 2 to get the fractional difference. For example, -2.32 means a ratio of 2^(-2.32) or 0.2 (Male vs Female), so the gene would be expressed higher in the male than the female.
  • lfcSE – this is the standard error of the estimate of the log2FoldChange
  • pvalue – this is the probability of getting the observed log2FoldChange just by randomly sampling sequences. This depends on the magnitude of the difference relative to the total number of sequences.
  • padj – this is an adjusted p-value. The p-value must be adjusted to account for the fact that we are looking at many genes. For example, if we have 10,000 genes, so we expect to get P < 0.05 about 5% of the time, so we would get 500 genes with P < 0.05 just by chance (i.e. 0.05 * 10,000).