## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
##         sdata.SScILD_S19584_Tripleminus_AAACCTGAGAAAGTGG-1
## MT-TF                                                    0
## MT-RNR1                                                  2
## MT-TV                                                    2
## MT-RNR2                                                  4
## MT-TL1                                                   1
## MT-ND1                                                   4
## MT-TI                                                    0
## MT-TQ                                                    0
## MT-TM                                                    0
## MT-ND2                                                   2
##         sdata.SScILD_S19584_Tripleminus_AAACCTGAGCTCCCAG-1
## MT-TF                                                    0
## MT-RNR1                                                  2
## MT-TV                                                    7
## MT-RNR2                                                  2
## MT-TL1                                                   0
## MT-ND1                                                   3
## MT-TI                                                    0
## MT-TQ                                                    0
## MT-TM                                                    1
## MT-ND2                                                   5
##                                                                   orig.ident
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGAAAGTGG-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGCTCCCAG-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGGATGTAT-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGTTAGGTA-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGAAT-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGTCT-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACGGTGTC-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTCGACG-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTTGGAT-1 SScILD_S19584_Tripleminus
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAGTGAGTG-1 SScILD_S19584_Tripleminus
##                                                    nCount_RNA nFeature_RNA
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGAAAGTGG-1       1335          572
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGCTCCCAG-1        759          479
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGGATGTAT-1       3732         1464
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGTTAGGTA-1        783          473
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGAAT-1       7493         2080
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGTCT-1       9782         2262
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACGGTGTC-1       7133         2678
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTCGACG-1       4650         1549
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTTGGAT-1        582          241
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAGTGAGTG-1      12683         2507
##                                                    disease patient    type
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGAAAGTGG-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGCTCCCAG-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGGATGTAT-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGAGTTAGGTA-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGAAT-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAATGGTCT-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACGGTGTC-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTCGACG-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCACTTGGAT-1 SSc-ILD  S19584 Triple-
## sdata.SScILD_S19584_Tripleminus_AAACCTGCAGTGAGTG-1 SSc-ILD  S19584 Triple-
##  [1] "MT-TF"   "MT-RNR1" "MT-TV"   "MT-RNR2" "MT-TL1"  "MT-ND1"  "MT-TI"  
##  [8] "MT-TQ"   "MT-TM"   "MT-ND2"
##  [1] "RPL23AP53" "RPL14P3"   "RPL23AP83" "RPL30P4"   "RPS2P28"   "RPL12P47" 
##  [7] "RPL26P30"  "RPL6P8"    "RPL5P2"    "RPLP0P6"

Quality control

Plot QC

calculate the percentage of mitocondrial and ribosomal genes per cell and plot some of the QC-features as violin plots.

There is quite some difference in quality for the datasets. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.

Filtering

Detection-based filtering

A standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells.

selected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)
selected_f <- rownames(alldata)[ Matrix::rowSums(alldata) > 3]

data.filt <- subset(alldata, features=selected_f, cells=selected_c)
dim(data.filt)
## [1] 28859 43983
## [1] 43983

Additionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.

MALAT1 constitutes up to 10% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.

Mito/Ribo filtering

We also have quite a lot of cells with high proportion of mitochondrial and low proportion ofribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering.

selected_mito <- WhichCells(data.filt, expression = percent_mito < 0.20)
selected_ribo <- WhichCells(data.filt, expression = percent_ribo > 0.05)

# and subset the object to only keep those cells
data.filt <- subset(data.filt, cells = selected_mito)
data.filt <- subset(data.filt, cells = selected_ribo)

dim(data.filt)

table(data.filt$orig.ident)
## [1] 28859 42291
## 
##    Control_N1865_CD31plus    Control_N1865_CD45plus   Control_N1865_EPCAMplus 
##                      1889                      1165                      2906 
## Control_N1865_Tripleminus       IPF_F18545_CD31plus       IPF_F18545_CD45plus 
##                      2019                      5552                      3480 
##      IPF_F18545_EPCAMplus    IPF_F18545_Tripleminus    SScILD_S19584_CD31plus 
##                      5342                      3343                      3904 
##    SScILD_S19584_CD45plus   SScILD_S19584_EPCAMplus SScILD_S19584_Tripleminus 
##                      4050                      3643                      4998

Plot filtered QC

Lets plot the same QC-stats another time.

Filter genes

As the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset bofore any further analysis.

dim(data.filt)

# Filter MALAT1
data.filt <- data.filt[ ! grepl("MALAT1", rownames(data.filt)), ]

# Filter Mitocondrial
data.filt <- data.filt[ ! grepl("^MT-", rownames(data.filt)), ]

# Filter Ribossomal gene (optional if that is a problem on your data)
# data.filt <- data.filt[ ! grepl("^RP[SL]", rownames(data.filt)), ]

# Filter Hemoglobin gene (optional if that is a problem on your data)
data.filt <- data.filt[ ! grepl("^HB[^(P)]", rownames(data.filt)), ]

dim(data.filt)
## [1] 28859 42291
## [1] 28813 42291

Sample sex

By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is.

Calculate cell-cycle scores

We can plot violin plots for the cell cycle scores.

## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms

Predict doublets

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## [1] "Creating 14097 artificial doublets..."
## [1] "Creating Seurat object..."
## [1] "Normalizing Seurat object..."
## [1] "Finding variable genes..."
## [1] "Scaling data..."
## [1] "Running PCA..."
## [1] "Calculating PC distance matrix..."
## [1] "Computing pANN..."
## [1] "Classifying doublets.."

We should expect that two cells have more detected genes than a single cell, lets check if our predicted doublets also have more detected genes in general.

Now, lets remove all predicted doublets from our data.

## [1] 28813 40599

Dimensionality reduction

Data preparation

Feature selection

Next, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.

## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 22807 rows containing missing values (geom_point).
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Regressing out percent_mito, nFeature_RNA
## Centering and scaling data matrix

tSNE

We can now plot the tSNE colored per dataset. We can clearly see the effect of batches present in the dataset.

UMAP

We can now run UMAP for cell embeddings.

## 15:10:55 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 15:10:55 Read 44091 rows and found 30 numeric columns
## 15:10:55 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 15:10:55 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 15:11:02 Writing NN index file to temp file /tmp/Rtmpoov3M7/file5a49332a471a
## 15:11:02 Searching Annoy index using 1 thread, search_k = 3000
## 15:11:17 Annoy recall = 100%
## 15:11:19 Commencing smooth kNN distance calibration using 1 thread
## 15:11:21 Initializing from normalized Laplacian + noise
## 15:11:28 Commencing optimization for 200 epochs, with 1889484 positive edges
## 15:11:53 Optimization finished
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -3.249
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1804e-16
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -3.0075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.1038e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -2.9897
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.30103
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.1267e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -3.0916
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.50197
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.1401e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.090619
## [1] "pca"  "tsne" "umap"
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8973 anchors
## Filtering anchors
##  Retained 4273 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9682 anchors
## Filtering anchors
##  Retained 3656 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8026 anchors
## Filtering anchors
##  Retained 2410 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7011 anchors
## Filtering anchors
##  Retained 2647 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7441 anchors
## Filtering anchors
##  Retained 1571 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8545 anchors
## Filtering anchors
##  Retained 3090 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8238 anchors
## Filtering anchors
##  Retained 5156 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7665 anchors
## Filtering anchors
##  Retained 1752 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8686 anchors
## Filtering anchors
##  Retained 3361 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6563 anchors
## Filtering anchors
##  Retained 3424 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 12238 anchors
## Filtering anchors
##  Retained 3447 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 11155 anchors
## Filtering anchors
##  Retained 5615 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 11201 anchors
## Filtering anchors
##  Retained 1647 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 11421 anchors
## Filtering anchors
##  Retained 1561 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9416 anchors
## Filtering anchors
##  Retained 1503 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 12286 anchors
## Filtering anchors
##  Retained 2824 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 10790 anchors
## Filtering anchors
##  Retained 1870 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 11686 anchors
## Filtering anchors
##  Retained 4908 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9970 anchors
## Filtering anchors
##  Retained 2365 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 10237 anchors
## Filtering anchors
##  Retained 3843 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 15030 anchors
## Filtering anchors
##  Retained 1990 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9261 anchors
## Filtering anchors
##  Retained 2769 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9383 anchors
## Filtering anchors
##  Retained 2114 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9348 anchors
## Filtering anchors
##  Retained 2875 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9007 anchors
## Filtering anchors
##  Retained 4621 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7700 anchors
## Filtering anchors
##  Retained 2819 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 12329 anchors
## Filtering anchors
##  Retained 2687 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9925 anchors
## Filtering anchors
##  Retained 3159 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7661 anchors
## Filtering anchors
##  Retained 3537 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7644 anchors
## Filtering anchors
##  Retained 1703 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7974 anchors
## Filtering anchors
##  Retained 2973 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7610 anchors
## Filtering anchors
##  Retained 2329 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6548 anchors
## Filtering anchors
##  Retained 2948 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8748 anchors
## Filtering anchors
##  Retained 2310 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8753 anchors
## Filtering anchors
##  Retained 3252 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7834 anchors
## Filtering anchors
##  Retained 2212 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 10135 anchors
## Filtering anchors
##  Retained 1869 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8890 anchors
## Filtering anchors
##  Retained 2158 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9173 anchors
## Filtering anchors
##  Retained 1609 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9445 anchors
## Filtering anchors
##  Retained 1851 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 8061 anchors
## Filtering anchors
##  Retained 1692 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 10783 anchors
## Filtering anchors
##  Retained 2279 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 10784 anchors
## Filtering anchors
##  Retained 2292 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 9121 anchors
## Filtering anchors
##  Retained 2005 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7496 anchors
## Filtering anchors
##  Retained 1298 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6081 anchors
## Filtering anchors
##  Retained 2849 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 5982 anchors
## Filtering anchors
##  Retained 2204 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6210 anchors
## Filtering anchors
##  Retained 4019 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 5951 anchors
## Filtering anchors
##  Retained 2419 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 5851 anchors
## Filtering anchors
##  Retained 3372 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 7287 anchors
## Filtering anchors
##  Retained 2238 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6453 anchors
## Filtering anchors
##  Retained 4170 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6154 anchors
## Filtering anchors
##  Retained 2298 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 5337 anchors
## Filtering anchors
##  Retained 2050 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 6110 anchors
## Filtering anchors
##  Retained 1999 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4401 anchors
## Filtering anchors
##  Retained 1865 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4713 anchors
## Filtering anchors
##  Retained 2000 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4657 anchors
## Filtering anchors
##  Retained 2189 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4234 anchors
## Filtering anchors
##  Retained 3288 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4142 anchors
## Filtering anchors
##  Retained 2223 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 5184 anchors
## Filtering anchors
##  Retained 1941 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4623 anchors
## Filtering anchors
##  Retained 2409 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4263 anchors
## Filtering anchors
##  Retained 3413 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4206 anchors
## Filtering anchors
##  Retained 1785 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 4529 anchors
## Filtering anchors
##  Retained 1704 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 3792 anchors
## Filtering anchors
##  Retained 2052 anchors
## Merging dataset 12 into 8
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 11 into 7
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 9 into 1
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 2 into 6
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 5 into 1 9
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 4 into 8 12
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 3 into 7 11
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 10 into 6 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 8 12 4 into 1 9 5
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 7 11 3 into 1 9 5 8 12 4
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 6 2 10 into 1 9 5 8 12 4 7 11 3
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## [1] "RNA" "CCA"
## [1] "CCA"
## 15:41:59 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 15:41:59 Read 44091 rows and found 30 numeric columns
## 15:41:59 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 15:41:59 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 15:42:09 Writing NN index file to temp file /tmp/Rtmpoov3M7/file5a4937a74894
## 15:42:09 Searching Annoy index using 1 thread, search_k = 3000
## 15:42:29 Annoy recall = 100%
## 15:42:30 Commencing smooth kNN distance calibration using 1 thread
## 15:42:33 Initializing from normalized Laplacian + noise
## 15:42:35 Commencing optimization for 200 epochs, with 2074622 positive edges
## 15:43:02 Optimization finished
## Warning: Could not find CD4 in the default search locations, found in RNA assay
## instead
## Warning: Could not find CST3 in the default search locations, found in RNA assay
## instead

Clustering

## Computing nearest neighbor graph
## Computing SNN
## [1] "CCA"
## [1] "CCA_nn"  "CCA_snn"

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 2204550)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

Differential gene expression

In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). First let’s define using the louvain clustering to identifying differentially expressed genes.

## 
##     0     1     2     3     4     5     6     7     8 
## 22397  9136  6784  2493  1196   723   692   530   140
Var1 Freq
0 22397
1 9136
2 6784
3 2493
4 1196
5 723
6 692
7 530
8 140

Cell marker genes

Let us first compute a ranking for the highly differential genes in each cluster. When looking for marker genes, we want genes that are positivelly expressed in a cell type and possibly not expressed in the others.

#Compute differentiall expression
markers_genes <- FindAllMarkers(alldata.int,
                               logfc.threshold = 0.2,
                               test.use = "wilcox",
                               min.pct = 0.1,
                               min.diff.pct = 0.2,
                               only.pos = TRUE,
                               max.cells.per.ident = 50,
                               assay = "RNA")
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8

We can now select the top 25 up regulated genes for plotting.

## Registered S3 method overwritten by 'cli':
##   method     from         
##   print.boxx spatstat.geom
## # A tibble: 245 × 7
## # Groups:   cluster [9]
##         p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene   
##         <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>  
##  1 0.00000254      2.02  0.5   0.101     0.148 0       SPARCL1
##  2 0.00000809      2.14  0.566 0.228     0.471 0       NNMT   
##  3 0.0000119       1.99  0.39  0.089     0.694 0       TIMP3  
##  4 0.0000243       1.54  0.465 0.215     1     0       TFPI   
##  5 0.0000311       1.56  0.395 0.119     1     0       CAV1   
##  6 0.0000650       1.48  0.225 0.017     1     0       CALCRL 
##  7 0.0000931       1.95  0.515 0.196     1     0       IGFBP4 
##  8 0.000129        1.41  0.233 0.024     1     0       ECSCR  
##  9 0.000393        0.942 0.256 0.045     1     0       COL4A1 
## 10 0.000494        2.02  0.288 0.025     1     0       RAMP2  
## # … with 235 more rows

We can now select the top 25 up regulated genes for plotting.

We can visualize them as a heatmap. Here we are selecting the top 5.

## Centering and scaling data matrix

Another way is by representing the overal group expression and detection rates in a dot-plot.

We can also plot a violin plot for each gene.

Differential expression across conditions

The second way of computing differential expression is to answer which genes are differentially expressed within a cluster. For example, in our case we have libraries comming from patients and controls and we would like to know which genes are influenced the most in a particular cell type.

## Calculating cluster Triple-
## Calculating cluster EPCAM+
## Calculating cluster CD31+
## Calculating cluster CD45+
## Calculating cluster SSc-ILD
## Calculating cluster IPF
## Calculating cluster Control

## The default behaviour of split.by has changed.
## Separate violin plots are now plotted side-by-side.
## To restore the old behaviour of a single split violin,
## set split.plot = TRUE.
##       
## This message will be shown once per session.

Celltype prediction

## An object of class Seurat 
## 58302 features across 8062 samples within 1 assay 
## Active assay: RNA (58302 features, 2000 variable features)
##  3 dimensional reductions calculated: pca, tsne, umap

Reference data

We load the reference dataset with annotated labels. Also, run all steps of the normal analysis pipeline with normalizaiton, variable gene selection, scaling and dimensionality reduction.

reference <- scPred::pbmc_1
reference
## An object of class Seurat 
## 32838 features across 3500 samples within 1 assay 
## Active assay: RNA (32838 features, 0 variable features)
## Centering and scaling data matrix
## 16:02:18 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 16:02:18 Read 3500 rows and found 30 numeric columns
## 16:02:18 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 16:02:18 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:02:19 Writing NN index file to temp file /tmp/Rtmpoov3M7/file5a495dda9c7f
## 16:02:19 Searching Annoy index using 1 thread, search_k = 3000
## 16:02:20 Annoy recall = 100%
## 16:02:21 Commencing smooth kNN distance calibration using 1 thread
## 16:02:22 Initializing from normalized Laplacian + noise
## 16:02:22 Commencing optimization for 500 epochs, with 158004 positive edges
## 16:02:27 Optimization finished

Run all steps of the analysis for the ctrl sample as well. Use the clustering from the integration lab with resolution 0.1.

## Centering and scaling data matrix
## 16:02:35 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 16:02:35 Read 8062 rows and found 30 numeric columns
## 16:02:35 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 16:02:35 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:02:37 Writing NN index file to temp file /tmp/Rtmpoov3M7/file5a491581e514
## 16:02:37 Searching Annoy index using 1 thread, search_k = 3000
## 16:02:39 Annoy recall = 100%
## 16:02:40 Commencing smooth kNN distance calibration using 1 thread
## 16:02:41 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
## 16:02:41 Initializing from PCA
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 16:02:41 PCA: 2 components explained 42.84% variance
## 16:02:41 Commencing optimization for 500 epochs, with 348660 positive edges
## 16:02:51 Optimization finished

Label transfer

## Performing PCA on the provided reference using 1829 features as input.
## Projecting cell embeddings
## Finding neighborhoods
## Finding anchors
##  Found 1503 anchors
## Filtering anchors
##  Retained 816 anchors
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels

Now plot how many cells of each celltypes can be found in each cluster.