Clustering cells

Clustering scRNA-seq data is dangerous because it can turn into a 'time sink'. PySCN has some functions that might help you to more efficiently identify stable and meaning clusters in your scRNA-seq data. This tutorial shows you how to use these functions:

cluster_alot: cluster adata across specified parameter values
clustering_quality_vs_nn_summary: computes practical metrics of clustering quality
cluster_subcluster: runs complete clustering procedure on specified sub-set of cells

In this tutorial, we shoul you how to use these functions to cluster the PBMC data below

Data¶

10k PBMCs from a Healthy Donor (v3 chemistry) Single Cell Gene Expression Dataset by Cell Ranger 3.0.0
From 10X Genomics click here to download the processed data in h5ad

Setting up¶

Import requisite packages

In [1]:

Copied!





import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import pySingleCellNet as cn
import marsilea as ma
import marsilea.plotter as mp
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import pySingleCellNet as cn
import marsilea as ma
import marsilea.plotter as mp

Load the data

In [2]:

Copied!

adRef = sc.read_h5ad("adPBMC_ref_040623.h5ad")
adRef.shape
adRef = sc.read_h5ad("adPBMC_ref_040623.h5ad")
adRef.shape

Out[2]:

(10309, 20104)

Filter genes

In [3]:

Copied!





min_cells = 50
adata = adRef.copy()
sc.pp.filter_genes(adata, min_cells = min_cells)
adata.shape
min_cells = 50
adata = adRef.copy()
sc.pp.filter_genes(adata, min_cells = min_cells)
adata.shape

Out[3]:

(10309, 13503)

Normalize counts

In [4]:

Copied!

adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Identify highly variable genes and perform principle component analysis.

In [5]:

Copied!

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='cell_ranger')
sc.tl.pca(adata, mask_var='highly_variable')
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='cell_ranger')
sc.tl.pca(adata, mask_var='highly_variable')

Sir Cluster-Alot¶

The snippet below runs a typical clustering procedure. However, instead of running it just once, cluster_alot runs it once for each combination of parameters values. In this example, we just vary Leiden resolution, but you can specify other parameters, too. See the cluster_alot documentation to learn about other clustering parameters that can be tested.

In [6]:

Copied!





# Define clustering parameters
l_res = [0.01, 0.05, 0.10, 0.15, 0.20, 0.25]
knns = [10]
pcs = [20]

# Perform clustering
clusters = cn.tl.cluster_alot(adata, leiden_resolutions = l_res, pca_params = {"top_n_pcs": pcs}, knn_params = {'n_neighbors': knns})
# Define clustering parameters
l_res = [0.01, 0.05, 0.10, 0.15, 0.20, 0.25]
knns = [10]
pcs = [20]

# Perform clustering
clusters = cn.tl.cluster_alot(adata, leiden_resolutions = l_res, pca_params = {"top_n_pcs": pcs}, knn_params = {'n_neighbors': knns})

[leiden] res=0.01 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.01
[leiden] res=0.05 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.05
[leiden] res=0.1 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.1
[leiden] res=0.15 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.15
[leiden] res=0.2 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.2
[leiden] res=0.25 | N=20 | pct=1.00 | s01 | k=10 -> autoc_pc20_pct1.00_s01_k10_res0.25

Visualize the results of the different clustering runs. To show these in a UMAP embedding, we need to run kNN (which cluster_alot does not save) and make the embedding first.

In [7]:

Copied!

sc.pp.neighbors(adata, n_neighbors = 10,  n_pcs = 20)
sc.tl.umap(adata)
sc.pp.neighbors(adata, n_neighbors = 10,  n_pcs = 20)
sc.tl.umap(adata)

Now plot the clusters

In [8]:

Copied!

cluster_run_names = list(clusters.obs_key)
sc.pl.umap(adata, color= cluster_run_names, alpha=.8, s=12, legend_loc='on data', frameon=False, ncols=2, legend_fontoutline=2,legend_fontsize=14, title = l_res)
cluster_run_names = list(clusters.obs_key)
sc.pl.umap(adata, color= cluster_run_names, alpha=.8, s=12, legend_loc='on data', frameon=False, ncols=2, legend_fontoutline=2,legend_fontsize=14, title = l_res)

No description has been provided for this image

Clustering metics¶

If a clustering is good, then we expect strong cluster-specific signatures. clustering_quality_vs_nn_summary computes this by counting the number of genes differentially expressed between each cluster and its nearest neighboring cluster. We have reasonable defaults to define differential expression based on p-value and fold change ('naive') or based on p-value and the percent of cells detected (>= min percent incluster & <= max out-cluster -- 'strict'). See documentation of clustering_quality_vs_nn_summary for details including other clustering metrics computed.

In [9]:

Copied!

evals = cn.tl.clustering_quality_vs_nn_summary(adata, label_cols = cluster_run_names, n_genes = 10, n_pcs_for_nn = 20)
print(evals)
evals = cn.tl.clustering_quality_vs_nn_summary(adata, label_cols = cluster_run_names, n_genes = 10, n_pcs_for_nn = 20)
print(evals)

                            label_col  n_clusters  n_pairs  tested_genes  \
0  autoc_pc20_pct1.00_s01_k10_res0.01           4        4         13503   
1  autoc_pc20_pct1.00_s01_k10_res0.05           7        7         13503   
2   autoc_pc20_pct1.00_s01_k10_res0.1           8        8         13503   
3  autoc_pc20_pct1.00_s01_k10_res0.15           8        8         13503   
4   autoc_pc20_pct1.00_s01_k10_res0.2           9        9         13503   
5  autoc_pc20_pct1.00_s01_k10_res0.25          11       11         13503   

   unique_naive_genes  unique_strict_genes  min_naive_per_pair  \
0                6406                 1194                   0   
1                7412                 1093                   0   
2                6866                 1057                   0   
3                6877                 1045                   0   
4                6059                  923                   0   
5                5001                  695                   0   

   min_strict_per_pair  max_naive_per_pair  max_strict_per_pair  ...  \
0                    0                2955                  967  ...   
1                    0                1980                  361  ...   
2                    0                1956                  349  ...   
3                    0                1956                  349  ...   
4                    0                1949                  362  ...   
5                    0                1267                  306  ...   

   median_naive_per_pair  median_strict_per_pair  gini_naive_per_pair  \
0                 1879.5                   162.0            -0.125331   
1                 1666.0                   148.0            -0.138721   
2                 1177.5                    88.0            -0.199735   
3                 1154.5                    85.0            -0.192674   
4                  954.0                    72.0            -0.367479   
5                  692.0                    31.0            -0.352869   

   gini_strict_per_pair  frac_pairs_with_at_least_n_naive  \
0             -0.515253                          1.000000   
1             -0.343273                          1.000000   
2             -0.417936                          1.000000   
3             -0.425214                          1.000000   
4             -0.568440                          0.777778   
5             -0.666413                          0.818182   

   frac_pairs_with_at_least_n_strict  min_naive_exclusive_per_pair  \
0                           1.000000                             0   
1                           1.000000                             0   
2                           1.000000                             0   
3                           1.000000                             0   
4                           0.777778                             0   
5                           0.636364                             0   

   min_strict_exclusive_per_pair  max_naive_exclusive_per_pair  \
0                              0                          1597   
1                              0                          1016   
2                              0                           864   
3                              0                           855   
4                              0                           936   
5                              0                           675   

   max_strict_exclusive_per_pair  
0                            837  
1                            285  
2                            285  
3                            287  
4                            305  
5                            290  

[6 rows x 22 columns]

A plot of this result is easier to digest.

In [10]:

Copied!

cn.pl.heatmap_clustering_eval(evals, cmap_eval = 'bwr')
plt.show()
cn.pl.heatmap_clustering_eval(evals, cmap_eval = 'bwr')
plt.show()

The best result in terms of total number of differentially expressed genes (both naive and strict combined) is resolution = 0.05 (7 clusters). Let's compare these clustering results to previously defined 'cell_type' labels for these cells.

In [11]:

Copied!

sc.pl.umap(adata, color=["autoc_pc20_pct1.00_s01_k10_res0.05", "cell_type"], alpha=.8, s=10, legend_loc='on data', frameon=False, ncols=3, legend_fontoutline=1.5,legend_fontsize=10, title = ["0.05", "cell_type"])
sc.pl.umap(adata, color=["autoc_pc20_pct1.00_s01_k10_res0.05", "cell_type"], alpha=.8, s=10, legend_loc='on data', frameon=False, ncols=3, legend_fontoutline=1.5,legend_fontsize=10, title = ["0.05", "cell_type"])

The results are highly similar. To make sure the results are sane, let's perform differential gene expression analysis and a post-hoc filtering.

In [12]:

Copied!





sc.tl.rank_genes_groups(adata, use_raw=False, groupby='autoc_pc20_pct1.00_s01_k10_res0.05', mask_var='highly_variable')
minin = 0.20
maxout = 0.05
foldchange = 0.25

sc.tl.filter_rank_genes_groups(adata, min_fold_change=foldchange, min_in_group_fraction=minin, max_out_group_fraction=maxout, use_raw=False)

sc.pl.rank_genes_groups_dotplot(adata, n_genes=6, groupby='autoc_pc20_pct1.00_s01_k10_res0.05', dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True)
sc.tl.rank_genes_groups(adata, use_raw=False, groupby='autoc_pc20_pct1.00_s01_k10_res0.05', mask_var='highly_variable')
minin = 0.20
maxout = 0.05
foldchange = 0.25

sc.tl.filter_rank_genes_groups(adata, min_fold_change=foldchange, min_in_group_fraction=minin, max_out_group_fraction=maxout, use_raw=False)

sc.pl.rank_genes_groups_dotplot(adata, n_genes=6, groupby='autoc_pc20_pct1.00_s01_k10_res0.05', dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True)

WARNING: dendrogram data not found (using key=dendrogram_autoc_pc20_pct1.00_s01_k10_res0.05). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

Sub-clustering¶

Often sub-clustering a group of cells independently of other cells in the data can reveal subtler differences between groups of cells. In the PBMC data, for example, the T cell cluster (cluster 0) seems to harbor at least two different sub-groups. Let's sub-cluster cluster 0.

In [13]:

Copied!

cn.tl.cluster_subclusters(adata, cluster_column = 'autoc_pc20_pct1.00_s01_k10_res0.05', to_subcluster = ['0'], layer = 'counts', leiden_resolution=.05, n_pcs = 10)
sc.pl.umap(adata, color=['subcluster'], size=8, alpha=.8,frameon=False,legend_loc='on data')
cn.tl.cluster_subclusters(adata, cluster_column = 'autoc_pc20_pct1.00_s01_k10_res0.05', to_subcluster = ['0'], layer = 'counts', leiden_resolution=.05, n_pcs = 10)
sc.pl.umap(adata, color=['subcluster'], size=8, alpha=.8,frameon=False,legend_loc='on data')

What genes are preferentially expressed in the cluster 0 sub-groups?

In [14]:

Copied!

sc.tl.rank_genes_groups(adata, use_raw=False, groupby='subcluster', mask_var='highly_variable')

sc.tl.filter_rank_genes_groups(adata, min_fold_change=foldchange, min_in_group_fraction=minin, max_out_group_fraction=maxout, use_raw=False)

sc.pl.rank_genes_groups_dotplot(adata, n_genes=12, groupby='subcluster', dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True,groups=["0_0", "0_1"])
sc.tl.rank_genes_groups(adata, use_raw=False, groupby='subcluster', mask_var='highly_variable')

sc.tl.filter_rank_genes_groups(adata, min_fold_change=foldchange, min_in_group_fraction=minin, max_out_group_fraction=maxout, use_raw=False)

sc.pl.rank_genes_groups_dotplot(adata, n_genes=12, groupby='subcluster', dendrogram=True, key='rank_genes_groups_filtered', swap_axes=True,groups=["0_0", "0_1"])

WARNING: dendrogram data not found (using key=dendrogram_subcluster). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0_0, 0_1, 1, etc.
var_group_labels: 0_0, 0_1

In [ ]: