Should the inputs for GSVA be normalized or raw?

gene_x 0 like s 604 view s

Tags: R, software, processing

Gene Set Variation Analysis (GSVA) is a non-parametric and unsupervised method used for estimating the variation of gene set enrichment through the samples of a gene expression matrix. Given its nature and the underlying computations, there are specific recommendations for input data preprocessing.

  • Normalization: Gene expression data should generally be normalized before applying GSVA. This ensures that the different scales and potential batch effects from different experiments or different runs are accounted for. There are several normalization methods available depending on the type of gene expression data. For RNA-seq data, methods like TMM (Trimmed Mean of M-values) or RLE (Relative Log Expression) are popular. For microarray data, quantile normalization is commonly used.

  • Log Transformation: It is generally recommended to log-transform gene expression data before using GSVA. The rationale is similar to the reasons for normalizing the data: taking the logarithm compresses the range of expression values, making highly expressed genes and lowly expressed genes more comparable in scale. Typically, for RNA-seq count data, one might use a transformation like log2(CPM+1) or log2(FPKM/TPM + 1). The "+1" is added to handle zero counts.

However, the best practices can vary based on the specifics of the dataset and the research question. It's crucial to consult the GSVA documentation, relevant literature, and potentially perform some exploratory analysis on your dataset to determine the best preprocessing steps.

Lastly, always remember to ensure that the gene identifiers in your expression dataset match those in the gene sets you are using for the enrichment analysis. This often requires additional preprocessing steps to map between different types of gene identifiers (e.g., gene symbols, Entrez IDs, Ensembl IDs).

https://bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html

Input arguments of gsva(): There are four classes of parameter objects corresponding to the methods listed above, and may have different additional parameters to tune, but all of them require at least the following two input arguments:

  • A normalized gene expression dataset, which can be provided in one of the following containers:
    • A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    • An ExpressionSet object; see package Biobase.
    • A SummarizedExperiment object, see package SummarizedExperiment.
  • A collection of gene sets; which can be provided in one of the following containers:
    • A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    • A GeneSetCollection object; see package GSEABase.

In the context of gene expression data or other biological datasets, the term "non-log space" usually refers to the original, untransformed measurements or values. This is in contrast to "log space," where values have been transformed using a logarithm, typically the natural logarithm or the base-2 logarithm.

Here's a bit more on why and when these terms are used:

  • Log Transformation: For various types of data, including gene expression data, a logarithm transformation is often applied. One reason to do this is to stabilize variance or to make the data distribution more normal-like, especially when measurements can span several orders of magnitude. For example, gene expression data from microarray or RNA-seq experiments might have a long-tailed distribution, and taking the logarithm can help in compressing extreme values.

  • Non-log Space: When we refer to values in "non-log space," we're talking about the original measurements, before any logarithm transformation. For gene expression, these might be raw read counts, FPKM values (for RNA-seq data), or probe intensities (for microarrays).

  • Back Transformation: Sometimes, after performing computations or analyses in the log space, one might want to transform the results back to the original scale. This "back transformation" involves taking the antilogarithm (exponential) of the log-transformed values.

For example, consider the average expression of a gene. If you take the average of log-transformed expression values and then back-transform this average by taking the exponential, you won't get the same result as taking the average of the original, non-log-transformed values. This is because the logarithm is a non-linear transformation.

In summary, "non-log space" refers to data that hasn't been log-transformed, and it represents the original scale of the measurements.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum