- This pipeline is designed to take paired end reads in fastq format, trim adapters and low-quality base pairs positions, and merge read pairs (R1 & R2) that overlap.
- A mapping step to the reference genome (user defined) assigns joined reads to all major RNA biotypes including miRNA and isomiRs, tRNA fragments (tRFs) and piwi associated RNAs (piRNAs).
- Then, XICRA produces a miRNA analysis at the isomiR level using joined reads, with several choices of software that can be selected by the user with standardized output.
- Results are generated for each sample, analyzed and summarized for all samples in a single expression matrix.
- This information can be processed at the miRNA or isomiR level (single sequence) but also summarizing for each isomiR variant type.
- Statistical summaries can be easily accessed using the accompanied R package XICRA.stats (
Although the pipeline is designed to take paired-end reads, it also accepts single-end reads.
XICRA uses cutadapt [30] for the adapter trimming analysis.
- Default trimming preset parameter settings are: to keep all reads regardless of whether the adapter is found or not, a 10% maximum adapter matching error rate (mismatches, insertions and deletions), and a 3 bp minimum overlap length.
- User must provide specific adapter sequences for the trimming analysis.
- An optional previous quality checking step can be performed for each sample using FastQC [31] before the trimming analysis.
Results are summarized for all samples using MultiQC software [32].
Once all reads are adapter trimmed, the tool uses fastq-join from ea-utils [33] to join the two PE reads, if provided, on the overlapping ends.
- Apart from the joined reads, this tool also generates two files with the R1 and R2 reads that cannot be joined.
- As a default the minimum overlap is set to 6 bp and the maximum allowed difference for the reads to be joined is set to 0% to retain 100% matching read pairs ensuring high quality sequencing information.
Parameters can be modified using the different options provided.
The XICRA pipeline can continue to process either joined PE reads or SR reads.
- Two levels of mapping are implemented. The first level profiles RNA biotypes using STAR [34] to map reads against the reference genome and featureCounts [35] to extract and quantify numbers of reads by class. The second level focuses specifically on small RNA subclasses.
Here we describe the miRNA analysis implemented within XICRA but the modularity and versatility of the pipeline would make it quite straightforward to include other RNA biotypes analyses in detail.
For miRNAs analysis at the isomiR resolution level, XICRA allows the user to use either miraligner [26], sRNAbench from sRNAtoolbox [27] or OPTIMIR [28].
- Each software uses different strategies and might produce different results [36].
- We have included them as they allow following standardization procedures performed by miRTOP software and adopt the miR.gff3 file format [37].
- Again, the pipeline modular implementation would allow adding additional softwares converging and adapting to miRTOP and miR.gff3 format.
- For each of the softwares mentioned above and included within the miRNA module in XICRA default parameters are used.
- Some of these parameters can be modified using the different options provided.
- As a result of this miRNA module, annotation is generated that categorizes isomiRs into classes based on their sequence modifications (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl) following miRTOP suggested classification scheme.
- A final conversion step from individual per sample miR.gff3 files into a single expression matrix is performed.
- This file serves as input for differential expression (DE) analysis.
- Information is provided for each unique sequence and indexed names contain the miRNA, the variant type and license plate (unique identifier, UID) provided by miRTOP.
- Duplicated entries at the sequence level, produced by different modifications from the same or different miRNA are discarded.
An additional matrix is provided containing the sequence information for each encrypted UID.
Per sample read count matrices at the isomiR level are summarized into a single expression matrix that it serves as input for DE analysis between the comparison groups of interest.
- We have generated an additional R package (XICRA.stats) that facilitates the retrieval of these matrices and parses the information included within each unique index name provided.
- The DE analysis can be done aggregating data at the mature miRNA level (i.e. hsa-miR-501-3p), by isomiR class (i.e. hsa-miR-501-3p_iso_5p), by specific length variant cluster (i.e. hsa-miR-501-3p_iso_3p:-2) or with the sequence of the read itself as the counting data.
- This is useful since different types of modification may coexist in a single sequence, and non-templated additions and internally edited sequences can differ leading to isomiRs that can fall into different categories or be derived from different mature miRNAs.
- DE analysis is performed outside of the tool with DESeq2 package in R [38].
