https://github.com/cougarlj/COMPSRA/issues/18
https://github.com/HCGB-IGTP/XICRA
- This pipeline is designed to take paired end reads in fastq format, trim adapters and low-quality base pairs positions, and merge read pairs (R1 & R2) that overlap.
- A mapping step to the reference genome (user defined) assigns joined reads to all major RNA biotypes including miRNA and isomiRs, tRNA fragments (tRFs) and piwi associated RNAs (piRNAs).
- Then, XICRA produces a miRNA analysis at the isomiR level using joined reads, with several choices of software that can be selected by the user with standardized output.
- Results are generated for each sample, analyzed and summarized for all samples in a single expression matrix.
- This information can be processed at the miRNA or isomiR level (single sequence) but also summarizing for each isomiR variant type.
- Statistical summaries can be easily accessed using the accompanied R package XICRA.stats (https://github.com/HCGB-IGTP/XICRA.stats).
-
Although the pipeline is designed to take paired-end reads, it also accepts single-end reads.
-
XICRA uses cutadapt [30] for the adapter trimming analysis.
- Default trimming preset parameter settings are: to keep all reads regardless of whether the adapter is found or not, a 10% maximum adapter matching error rate (mismatches, insertions and deletions), and a 3 bp minimum overlap length.
- User must provide specific adapter sequences for the trimming analysis.
- An optional previous quality checking step can be performed for each sample using FastQC [31] before the trimming analysis.
-
Results are summarized for all samples using MultiQC software [32].
-
Once all reads are adapter trimmed, the tool uses fastq-join from ea-utils [33] to join the two PE reads, if provided, on the overlapping ends.
- Apart from the joined reads, this tool also generates two files with the R1 and R2 reads that cannot be joined.
- As a default the minimum overlap is set to 6 bp and the maximum allowed difference for the reads to be joined is set to 0% to retain 100% matching read pairs ensuring high quality sequencing information.
-
Parameters can be modified using the different options provided.
-
The XICRA pipeline can continue to process either joined PE reads or SR reads.
- Two levels of mapping are implemented. The first level profiles RNA biotypes using STAR [34] to map reads against the reference genome and featureCounts [35] to extract and quantify numbers of reads by class. The second level focuses specifically on small RNA subclasses.
-
Here we describe the miRNA analysis implemented within XICRA but the modularity and versatility of the pipeline would make it quite straightforward to include other RNA biotypes analyses in detail.
-
For miRNAs analysis at the isomiR resolution level, XICRA allows the user to use either miraligner [26], sRNAbench from sRNAtoolbox [27] or OPTIMIR [28].
- Each software uses different strategies and might produce different results [36].
- We have included them as they allow following standardization procedures performed by miRTOP software and adopt the miR.gff3 file format [37].
- Again, the pipeline modular implementation would allow adding additional softwares converging and adapting to miRTOP and miR.gff3 format.
- For each of the softwares mentioned above and included within the miRNA module in XICRA default parameters are used.
- Some of these parameters can be modified using the different options provided.
- As a result of this miRNA module, annotation is generated that categorizes isomiRs into classes based on their sequence modifications (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl) following miRTOP suggested classification scheme.
- A final conversion step from individual per sample miR.gff3 files into a single expression matrix is performed.
- This file serves as input for differential expression (DE) analysis.
- Information is provided for each unique sequence and indexed names contain the miRNA, the variant type and license plate (unique identifier, UID) provided by miRTOP.
- Duplicated entries at the sequence level, produced by different modifications from the same or different miRNA are discarded.
-
An additional matrix is provided containing the sequence information for each encrypted UID.
-
Per sample read count matrices at the isomiR level are summarized into a single expression matrix that it serves as input for DE analysis between the comparison groups of interest.
- We have generated an additional R package (XICRA.stats) that facilitates the retrieval of these matrices and parses the information included within each unique index name provided.
- The DE analysis can be done aggregating data at the mature miRNA level (i.e. hsa-miR-501-3p), by isomiR class (i.e. hsa-miR-501-3p_iso_5p), by specific length variant cluster (i.e. hsa-miR-501-3p_iso_3p:-2) or with the sequence of the read itself as the counting data.
- This is useful since different types of modification may coexist in a single sequence, and non-templated additions and internally edited sequences can differ leading to isomiRs that can fall into different categories or be derived from different mature miRNAs.
- DE analysis is performed outside of the tool with DESeq2 package in R [38].
Types of small RNAs
-
miRNA(微小RNA):微小RNA是长度约为20-24个核苷酸的短RNA分子,它们在基因表达的后转录水平上发挥调控作用。miRNA通过结合到mRNA分子的互补序列上,可以抑制翻译过程或导致mRNA的降解。miRNA参与许多细胞过程,包括发育、分化和细胞周期控制。
-
tRNA(转运RNA):转运RNA是长约70-90个核苷酸的适配器分子,在蛋白质生物合成的翻译过程中起着关键作用。它们负责将相应的氨基酸运送到核糖体,并通过其反密码子环识别mRNA上的密码子,确保在形成的蛋白质中正确排序氨基酸。
-
piRNA(Piwi相互作用RNA):Piwi相互作用RNA是一类长度通常在24至31个核苷酸之间的小RNA分子,主要存在于动物的生殖细胞中,参与转座子沉默和基因组稳定性。它们与Piwi蛋白互作,帮助抑制转座元件的转录,保护基因组完整性。
-
snRNA(小核RNA):小核RNA是一组在真核生物细胞核中发现的小RNA分子,长度约为100-200个核苷酸。它们是剪接体的重要组成部分,剪接体是负责从前mRNA分子中移除内含子的复合体。
-
snoRNA(小核仁RNA):小核仁RNA是发现在细胞的核仁中的特化RNA分子,长度约为60-300个核苷酸。它们参与rRNA的修饰和成熟过程,特别是rRNA的化学修饰过程,如甲基化和伪尿苷化。
-
circRNA(环形RNA):环形RNA是一种具有闭合圆形结构的RNA,没有常见的5'和3'端。它们可以由外显子或内含子产生,并具有多种功能,包括作为miRNA的分子海绵、影响转录和调节基因表达。CircRNAs参与许多生物学过程,并与各种疾病有关。
在生物学研究中,进行差异表达分析通常用于比较不同样本或条件下RNA分子的表达水平变化。对于上面RNA类型:
-
miRNA(微小RNA):进行差异表达分析是有意义的。miRNAs在调控基因表达和参与多种生物过程中扮演关键角色,因此,分析其在不同条件下的表达差异有助于了解其在疾病发生、发展或其他生物学过程中的功能。
-
tRNA(转运RNA):虽然tRNAs是蛋白质合成中必不可少的,但它们在不同状态下的表达量差异通常不是研究的主要焦点。尽管如此,tRNA的改变有时可以反映细胞的代谢状态或应对压力的能力,但这并不是差异表达分析的常见应用。
-
piRNA(Piwi-interacting RNA):进行差异表达分析同样有意义,尤其是在研究生殖细胞、干细胞和癌症等领域。piRNAs与转座子的沉默和基因组稳定性维护有关,因此分析其差异表达有助于揭示它们在这些过程中的作用。
-
snRNA(小核RNA)和snoRNA(小核仁RNA):这两种RNA主要参与RNA加工和修饰,如剪接和rRNA的修饰。通常情况下,对它们进行差异表达分析不是很常见,因为它们更多地涉及基本的细胞内过程。然而,如果研究的目的是特定RNA加工途径的变化,那么它们的表达分析可能是有意义的。
-
circRNA(环状RNA):进行差异表达分析是有意义的。circRNAs与许多生理过程有关,包括作为miRNA的海绵、影响基因的转录和参与疾病的发展。因此,分析circRNAs的表达差异可以帮助理解它们在不同生物学背景下的功能和作用。
总的来说,miRNA、piRNA和circRNA的差异表达分析在许多研究领域是有意义的,因为它们在疾病和生物学过程中的角色。而tRNA、snRNA和snoRNA的差异表达分析可能不那么常见,除非研究特定的生物过程或条件下它们的特定变化。
通过将 cDNA 测序读段映射到人类基因组上,我们能否确定这些读段是否来源于 circRNA?
可以通过将 cDNA 的 reads 映射到人类基因组上来帮助确定它们是否来源于 circRNA。在 circRNA 的研究中,这种映射过程中寻找的关键特征是反向剪接事件,也就是说,寻找那些正常线性剪接顺序被反向连接的 reads。
在标准的基因表达研究中,mRNA 被逆转录成 cDNA,然后生成的测序 reads 通常会映射到基因组的连续区域以表示线性剪接事件。然而,对于 circRNA,由于它们是由反向剪接形成的,因此具有独特的“头尾相连”的结构。在测序数据中,这些反向剪接或“头尾相连”的事件会导致 reads 映射到基因组的非连续区域,即一个 exon 的末尾连接到另一个 exon 的开始,这与正常的线性剪接顺序不同。
通过检测这种非典型的、非连续的映射模式,可以推断出 reads 来自于 circRNA。需要专门的生物信息学工具和算法来识别这些特殊的映射模式,从而鉴定出 circRNAs。这些工具能够识别跨越独特的反向剪接点的 reads,帮助研究者确定哪些 reads 来自 circRNAs。