analysis tools including classification
Functions that create or require the classifier clf
object, and friends
build_gene_knn
build_gene_knn(adata, mask_var=None, mean_cluster=True, groupby='leiden', knn=5, use_knn=True, metric='euclidean', key='gene')
Compute a gene–gene kNN graph (hard or Gaussian‑weighted) and store sparse connectivities & distances in adata.uns.
Parameters
adata
AnnData object (cells × genes). Internally transposed to (genes × cells).
mask_var
If not None, must be a column name in adata.var of boolean values.
Only genes where adata.var[mask_var] == True are included. If None, use all genes.
mean_cluster
If True, aggregate cells by cluster defined in adata.obs[groupby].
The kNN graph is computed on the mean‑expression profiles of each cluster
(genes × n_clusters) rather than genes × n_cells.
groupby
Column in adata.obs holding cluster labels. Only used if mean_cluster=True.
knn
Integer: how many neighbors per gene to consider.
Passed as n_neighbors=knn to sc.pp.neighbors.
use_knn
Boolean: passed to sc.pp.neighbors as knn=use_knn.
- If True, builds a hard kNN graph (only k nearest neighbors).
- If False, uses a Gaussian kernel to weight up to the k-th neighbor.
metric
Distance metric for kNN computation (e.g. "euclidean", "manhattan", "correlation", etc.).
If metric=="correlation" and the gene‑expression matrix is sparse, it will be converted to dense.
key
Prefix under which to store results in adata.uns. The function sets:
- adata.uns[f"{key}_gene_index"]
- adata.uns[f"{key}_connectivities"]
- adata.uns[f"{key}_distances"]
Source code in src/pySingleCellNet/tools/gene.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
categorize_classification
categorize_classification(adata_c, thresholds, graph=None, k=3, columns_to_ignore=['rand'], inplace=True, class_obs_name='SCN_class_argmax')
Classify cells based on SCN scores and thresholds, then categorize multi-class cells as either 'Intermediate' or 'Hybrid'.
Classification rules
- If exactly one cell type exceeds threshold: "Singular"
- If zero cell types exceed threshold: "None"
- If more than one cell type exceeds threshold:
- If all pairs of high-scoring cell types are within
k
edges in the provided graph: "Intermediate" - Otherwise: "Hybrid"
- If all pairs of high-scoring cell types are within
- If predicted cell type is 'rand': Set classification to "Rand"
Parameters:
-
adata_c
(AnnData
) –Annotated data matrix containing: -
.obsm["SCN_score"]
: DataFrame of SCN scores for each cell type. -.obs[class_obs_name]
: Predicted cell type (argmax classification). -
thresholds
(DataFrame
) –Thresholds for each cell type. Expected to match the columns in
SCN_score
. -
graph
(Graph
, default:None
) –An iGraph describing relationships between cell types. Must have vertex names matching the cell-type columns in SCN_score.
-
k
(int
, default:3
) –Maximum graph distance to consider cell types "Intermediate". Defaults to 3.
-
columns_to_ignore
(list
, default:['rand']
) –List of SCN score columns to ignore. Defaults to ["rand"].
-
inplace
(bool
, default:True
) –If True, modify
adata_c
in place. Otherwise, return a new AnnData object. Defaults to True. -
class_obs_name
(str
, default:'SCN_class_argmax'
) –The name of the
.obs
column with argmax classification. Defaults to 'SCN_class_argmax'.
Raises:
-
ValueError
–If
graph
is None. -
ValueError
–If "SCN_score" is missing in
adata_c.obsm
. -
ValueError
–If
class_obs_name
is not found inadata_c.obs
. -
ValueError
–If the provided graph does not have vertex "name" attributes.
Returns:
-
–
AnnData or None: Returns modified AnnData if
inplace
is False, otherwise None.
Source code in src/pySingleCellNet/tools/categorize.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
classify_anndata
classify_anndata(adata, rf_tsp, nrand=0)
Classifies cells in the adata
object based on the given gene expression and cross-pair information using a
random forest classifier in rf_tsp trained with the provided xpairs genes.
Parameters:
adata: AnnData
An annotated data matrix containing the gene expression information for cells.
rf_tsp: List[float]
A list of random forest classifier parameters used for classification.
nrand: int
Number of random permutations for the null distribution. Default is 0.
Returns:
Updates adata with classification results
Source code in src/pySingleCellNet/tools/classifier.py
437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 |
|
cluster_alot
cluster_alot(adata, leiden_resolutions, prefix='autoc', pca_params=None, knn_params=None, random_state=None, overwrite=True, verbose=True)
Grid-search Leiden clusterings over (n_pcs, n_neighbors, resolution).
Runs a parameter sweep that combines different numbers of principal components,
k-nearest-neighbor sizes, and Leiden resolutions. Optionally performs random
PC subsampling (within the first N
PCs) when constructing the KNN graph,
repeating each configuration multiple times for robustness. Cluster labels
are written to adata.obs
under keys derived from prefix
and the
parameter settings.
Assumptions
adata.X
is already log-transformed.- PCA has been computed and
adata.obsm['X_pca']
is present; this is used as the base embedding for PC selection/subsampling.
Parameters:
-
adata
–AnnData object containing the log-transformed expression matrix. Must include
obsm['X_pca']
(shape(n_cells, n_pcs_total)
). -
leiden_resolutions
(Sequence[float]
) –Leiden resolution values to evaluate (passed to
sc.tl.leiden
). Each resolution is combined with every KNN/PC configuration in the sweep. -
prefix
(str
, default:'autoc'
) –String prefix used to construct output keys for cluster labels in
adata.obs
(e.g.,"{prefix}_pc{N}_k{K}_res{R}"
). Defaults to"autoc"
. -
pca_params
(Optional[Dict[str, Any]]
, default:None
) –Configuration for PC selection and optional subsampling. Supported keys: *
"top_n_pcs"
(List[int], default[40]
): Candidate values for the maximum PC indexN
(i.e., use the firstN
PCs). *"percent_of_pcs"
(Optional[float], defaultNone
): If set with0 < value <= 1
, randomly selectround(value * N)
PCs from the firstN
for KNN construction. IfNone
or1
, use the firstN
PCs without subsampling. *"n_random_samples"
(Optional[int], defaultNone
): Number of random PC subsets to draw per (N, K) whenpercent_of_pcs
is set in(0, 1)
. IfNone
or less than 1, no repeated subsampling is performed. -
knn_params
(Optional[Dict[str, Any]]
, default:None
) –KNN graph parameters. Supported keys: *
"n_neighbors"
(List[int], default[10]
): Candidate values forK
used insc.pp.neighbors
. -
random_state
(Optional[int]
, default:None
) –Random seed for PC subset sampling (when
percent_of_pcs
is used). PassNone
for non-deterministic sampling. Defaults toNone
. -
overwrite
(bool
, default:True
) –If
True
(default), overwrite existingadata.obs
keys produced by previous runs that match the constructed names. IfFalse
, skip runs whose target keys already exist. -
verbose
(bool
, default:True
) –If
True
(default), print progress messages for each run.
Returns:
-
DataFrame
–pd.DataFrame:
-
DataFrame
–- runs (
pd.DataFrame
): One row per clustering run with metadata columns such as: obs_key
: Name of the column inadata.obs
that stores cluster labels.neighbors_key
: Name of the neighbors graph key used/created.resolution
: Leiden resolution value used for the run.top_n_pcs
: Number of leading PCs considered.pct_pcs
: Fraction of PCs used when subsampling (percent_of_pcs
), or1.0
if all were used.sample_idx
: Index of the PC subsampling repeat (0..n-1
) or0
if no subsampling.n_neighbors
: Number of neighbors (K
) used in KNN construction.n_clusters
: Number of clusters returned by Leiden for that run.pcs_used_count
: Actual number of PCs used to build the KNN graph (round(pct_pcs * top_n_pcs)
ortop_n_pcs
if no subsampling).
- runs (
Raises:
-
KeyError
–If
'X_pca'
is missing fromadata.obsm
. -
ValueError
–If any provided parameter is out of range (e.g.,
percent_of_pcs
not in(0, 1]
; empty lists; non-positiven_neighbors
). -
RuntimeError
–If neighbor graph construction or Leiden clustering fails.
Notes
- This function modifies
adata
in place by adding cluster label columns toadata.obs
(and potentially adding or reusing neighbor graphs inadata.obsp
/adata.uns
with a constructedneighbors_key
). - To ensure reproducibility when using PC subsampling, set
random_state
and keep other sources of randomness (e.g., parallel BLAS) controlled in your environment.
Examples:
>>> runs = cluster_alot(
... adata,
... leiden_resolutions=[0.1, 0.25, 0.5],
... pca_params={"top_n_pcs": [20, 40],
... "percent_of_pcs": 0.5,
... "n_random_samples": 3},
... knn_params={"n_neighbors": [10, 20]},
... random_state=42,
... )
>>> runs[["obs_key", "n_clusters"]].head()
Source code in src/pySingleCellNet/tools/cluster.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
clustering_quality_vs_nn_summary
clustering_quality_vs_nn_summary(adata, label_cols, n_genes=5, naive={'p_val': 0.01, 'fold_change': 0.5}, strict={'minpercentin': 0.2, 'maxpercentout': 0.1, 'p_val': 0.01}, n_pcs_for_nn=30, has_log1p=True, gene_mask_col=None, layer=None, p_adjust_method='fdr_bh', deduplicate_partitions=True, return_pairs=False)
Summarize clustering quality across multiple label columns.
Computes clustering-quality metrics for each .obs
label column in
label_cols
and returns a single summary table (one row per labeling).
Optionally returns per–cluster-pair differential-expression tables for each
labeling. A single PCA/neighbor graph (using n_pcs_for_nn
PCs) is
reused across runs, and identical partitions (up to relabeling) can be
deduplicated for speed.
The method evaluates per-cluster marker genes under two regimes:
- Naive: rank by test statistic and select the top
n_genes
that meet the naive thresholds (e.g., unadjustedp_val
and minimumfold_change
). - Strict: apply stricter filters on expression prevalence inside vs.
outside the cluster (
minpercentin
/maxpercentout
) and an adjusted p-value cutoff (p_val
afterp_adjust_method
), then count genes.
Parameters:
-
adata
–AnnData object containing count/expression data. Uses
adata.X
or the specifiedlayer
; cluster labels must be inadata.obs
. -
label_cols
(Sequence[str]
) –Names of
adata.obs
columns whose clusterings will be evaluated (e.g.,["leiden_0.2", "leiden_0.5"]
). -
n_genes
(int
, default:5
) –Number of top genes to consider per cluster in the naive regime (after applying naive thresholds). Defaults to
5
. -
naive
(dict
, default:{'p_val': 0.01, 'fold_change': 0.5}
) –Thresholds for the naive regime. Expected keys: -
"p_val"
(float): Maximum unadjusted p-value. -"fold_change"
(float): Minimum log2 fold-change. Defaults to{"p_val": 1e-2, "fold_change": 0.5}
. -
strict
(dict
, default:{'minpercentin': 0.2, 'maxpercentout': 0.1, 'p_val': 0.01}
) –Thresholds for the strict regime. Expected keys: -
"minpercentin"
(float): Minimum fraction of cells within the cluster expressing the gene. -"maxpercentout"
(float): Maximum fraction of cells outside the cluster expressing the gene. -"p_val"
(float): Maximum adjusted p-value (perp_adjust_method
). Defaults to{"minpercentin": 0.20, "maxpercentout": 0.10, "p_val": 0.01}
. -
n_pcs_for_nn
(int
, default:30
) –Number of principal components to use when building the neighbor graph used for nearest-neighbor detection. Defaults to
30
. -
has_log1p
(bool
, default:True
) –Whether the data are already log1p-transformed. If
False
, the implementation may log1p-transform counts before testing. Defaults toTrue
. -
gene_mask_col
(Optional[str]
, default:None
) –Optional name of a boolean column in
adata.var
used to mask genes prior to testing (e.g., to restrict to HVGs or exclude mitochondrial genes). IfNone
, no mask is applied. Defaults toNone
. -
layer
(Optional[str]
, default:None
) –Name of an
adata.layers
matrix to use instead ofadata.X
. For example,"log1p"
or"counts"
. Defaults toNone
. -
p_adjust_method
(str
, default:'fdr_bh'
) –Method for multiple testing correction (e.g.,
"fdr_bh"
). Passed to the underlying p-value adjustment routine. Defaults to"fdr_bh"
. -
deduplicate_partitions
(bool
, default:True
) –If
True
, detect and skip evaluations for labelings that produce the same partition (up to label renaming), reusing the computed result. Defaults toTrue
. -
return_pairs
(bool
, default:False
) –If
True
, also return a dict of per–cluster-pair result tables keyed by the label column. Each value is apd.DataFrame
with pairwise statistics for that labeling. Defaults toFalse
.
Returns:
-
Union[DataFrame, Tuple[DataFrame, Dict[str, DataFrame]]]
–Union[pd.DataFrame, Tuple[pd.DataFrame, Dict[str, pd.DataFrame]]]:
-
Union[DataFrame, Tuple[DataFrame, Dict[str, DataFrame]]]
–- summary (
pd.DataFrame
): One row per labeling with columns such as: label_col
: The label column name.n_clusters
: Number of clusters in the labeling.n_pairs
: Number of cluster pairs evaluated.tested_genes
: Number of genes tested after masking.unique_naive_genes
/unique_strict_genes
: Count of genes uniquely satisfying naive/strict criteria.frac_pairs_with_at_least_n_strict
: Fraction of cluster pairs with ≥ n strict marker genes (exact column name may reflect n).- Additional min/max/median summaries for naive/strict exclusivity per pair.
- summary (
-
Union[DataFrame, Tuple[DataFrame, Dict[str, DataFrame]]]
–- pairs_by_label (
Dict[str, pd.DataFrame]
, optional): Returned only whenreturn_pairs=True
. For each labeling, a DataFrame of per–cluster-pair statistics and gene sets.
- pairs_by_label (
Raises:
-
KeyError
–If any entry in
label_cols
is not found inadata.obs
, or ifgene_mask_col
is provided but not found inadata.var
. -
ValueError
–If required keys are missing from
naive
orstrict
, ifn_genes
< 1, or ifp_adjust_method
is unsupported. -
RuntimeError
–If neighbor graph construction or differential testing fails.
Notes
- The function does not modify
adata
in place (beyond any cached neighbor graph/PCs if your implementation chooses to store them). - For reproducibility, set any random seeds used by the nearest-neighbor or clustering components upstream.
Examples:
>>> summary = clustering_quality_vs_nn_summary(
... adata,
... label_cols=["leiden_0.2", "leiden_0.5"],
... n_genes=10,
... strict={"minpercentin": 0.25, "maxpercentout": 0.05, "p_val": 0.01},
... )
>>> summary[["label_col", "n_clusters", "unique_strict_genes"]].head()
>>> summary, pairs = clustering_quality_vs_nn_summary(
... adata,
... label_cols=["leiden_0.5"],
... return_pairs=True,
... )
>>> pairs["leiden_0.5"].head()
Source code in src/pySingleCellNet/tools/cluster_eval.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 |
|
collect_gsea_results_from_dict
collect_gsea_results_from_dict(gsea_dict2, fdr_thr=0.25, top_n=3)
Collect and filter GSEA results from a dictionary of GSEA objects.
For each cell type
- Sets NES=0 for any gene set with FDR > fdr_thr.
- Selects up to top_n sets with the largest positive NES and top_n with the most negative NES.
The final output is limited to the union of all such selected sets across all cell types, with zeroes preserved for cell types in which the pathway is not among the top_n or fails the FDR threshold.
Parameters:
-
gsea_dict2
(dict
) –Dictionary mapping cell types to GSEA result objects. Each object has a .res2d DataFrame with columns ["Term", "NES", "FDR q-val"].
-
fdr_thr
(float
, default:0.25
) –FDR threshold above which NES values are set to 0. Defaults to 0.25.
-
top_n
(int
, default:3
) –Maximum number of positive and negative results (by NES) to keep per cell type. Defaults to 10.
Returns:
-
–
pd.DataFrame: A DataFrame whose rows are the union of selected gene sets across all cell types, and whose columns are cell types. Entries are filtered NES values (0 where FDR fails, or if not in the top_n).
Source code in src/pySingleCellNet/tools/comparison.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 |
|
comp_ct_thresh
comp_ct_thresh(adata_c, qTile=0.05, obs_name='SCN_class_argmax')
Compute quantile thresholds for each cell type based on SCN scores.
For each cell type (excluding "rand"), this function calculates the qTile quantile of the SCN scores for cells predicted to belong to that type.
Parameters:
-
adata_c
(AnnData
) –Annotated data matrix with: -
.obsm["SCN_score"]
: DataFrame of SCN scores. -.obs
: Observation metadata containing predictions. -
qTile
(int
, default:0.05
) –The quantile to compute (e.g., 0.05 for 5th percentile). Defaults to 0.05.
-
obs_name
(str
, default:'SCN_class_argmax'
) –The column in
.obs
containing cell type predictions. Defaults to 'SCN_class_argmax'.
Returns:
-
DataFrame
–pd.DataFrame: A DataFrame where each row corresponds to a cell type
-
DataFrame
–(excluding 'rand') and contains the computed quantile threshold.
-
DataFrame
–Returns None if 'SCN_score' is not present in
adata_c.obsm
.
Source code in src/pySingleCellNet/tools/categorize.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
convert_diffExp_to_dict
convert_diffExp_to_dict(adata, uns_name='rank_genes_groups')
Convert differential expression results from AnnData into a dictionary of DataFrames.
This function extracts differential expression results stored in adata.uns[uns_name]
using Scanpy's get.rank_genes_groups_df
, cleans the data, and organizes it into
a dictionary where each key corresponds to a group and each value is a DataFrame
of differential expression results for that group.
Parameters:
-
adata
(AnnData
) –Annotated data matrix containing differential expression results in
adata.uns
. -
uns_name
(str
, default:'rank_genes_groups'
) –Key in
adata.uns
where rank_genes_groups results are stored. Defaults to 'rank_genes_groups'.
Returns:
-
dict
–Dictionary mapping each group to a DataFrame of its differential
-
–
expression results, with rows corresponding to genes and relevant statistics
-
–
for each gene.
Source code in src/pySingleCellNet/tools/comparison.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
create_classifier_report
create_classifier_report(adata, ground_truth, prediction)
Generate a classification report as a pandas DataFrame from an AnnData object.
This function computes a classification report using ground truth and prediction
columns in adata.obs
. It supports both string and dictionary outputs from
sklearn.metrics.classification_report
and transforms them into a standardized
DataFrame format.
Parameters:
-
adata
(AnnData
) –An annotated data matrix containing observations with categorical truth and prediction labels.
-
ground_truth
(str
) –The column name in
adata.obs
containing the true class labels. -
prediction
(str
) –The column name in
adata.obs
containing the predicted class labels.
Returns:
-
DataFrame
–pd.DataFrame: A DataFrame with columns ["Label", "Precision", "Recall",
-
DataFrame
–"F1-Score", "Support"] summarizing classification metrics for each class.
Raises:
-
ValueError
–If the classification report is neither a string nor a dictionary.
Source code in src/pySingleCellNet/tools/classifier.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
deg
deg(adata, sample_obsvals=[], limitto_obsvals=[], cellgrp_obsname='comb_cellgrp', groupby_obsname='comb_sampname', ncells_per_sample=30, test_name='t-test', mask_var='highly_variable')
Perform differential expression analysis on an AnnData object across specified cell groups and samples.
This function iterates over specified or all cell groups within the adata
object and performs
differential expression analysis using the specified statistical test (e.g., t-test). It filters
groups based on the minimum number of cells per sample and returns the results in a structured dictionary.
Parameters:
-
adata
(AnnData
) –The annotated data matrix containing observations and variables.
-
sample_obsvals
(list
, default:[]
) –List of sample observation values to include. Defaults to an empty list. Impacts the sign of the test statistic.
-
limitto_obsvals
(list
, default:[]
) –List of cell group observation values to limit the analysis to. If empty, all cell groups in
adata
are tested. Defaults to an empty list. -
cellgrp_obsname
(str
, default:'comb_cellgrp'
) –The
.obs
column name inadata
that holds the cell sub-groups. Defaults to 'comb_cellgrp'. -
groupby_obsname
(str
, default:'comb_sampname'
) –The
.obs
column name inadata
used to group observations for differential expression. Defaults to 'comb_sampname'. -
ncells_per_sample
(int
, default:30
) –The minimum number of cells per sample required to perform the test. Groups with fewer cells are skipped. Defaults to 30.
-
test_name
(str
, default:'t-test'
) –The name of the statistical test to use for differential expression. Defaults to 't-test'.
-
mask_var
(str
, default:'highly_variable'
) –The name of the .var column indicating highly variable genes Defaults to 'highly_variable'.
Returns:
-
dict
(dict
) –A dictionary containing: - 'sample_names': List of sample names used in the analysis. - 'geneTab_dict': A dictionary where each key is a cell group name and each value is a DataFrame of differential expression results for that group.
Source code in src/pySingleCellNet/tools/comparison.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|
discover_cell_cliques
discover_cell_cliques(adata, cluster_cols, k=None, mode='lenient', out_col='core_cluster', min_size=1, allow_missing=False, max_combinations=None, return_details=False)
Define 'core' clusters across multiple clustering runs.
Parameters
adata : AnnData The data object with clustering labels in .obs columns. cluster_cols : list[str] or str One or more .obs columns, each containing a clustering. k : int or None, default None Cells must be in the same cluster in at least k runs to be grouped. If None, uses all runs (k = n_runs), i.e., exact tuple agreement. mode : {'lenient','strict'}, default 'lenient' 'lenient' uses DSU over all k-run combinations (fast, transitive). 'strict' refines lenient components so every pair inside a final cluster agrees in >= k runs (complete-linkage on masked Hamming). out_col : str, default 'core_cluster' Name of the output categorical column added to adata.obs. min_size : int, default 1 Minimum size to keep a core cluster; smaller groups get 'core_-1'. allow_missing : bool, default False If False, raises if any clustering column has missing labels. If True, combinations containing missing labels for a cell are either skipped (lenient) or ignored in distance computations (strict). max_combinations : int or None If set and number of k-run combinations exceeds this, raises with guidance. return_details : bool, default False If True, also returns a dict with bookkeeping info.
Returns
core_labels : pandas.Series (categorical) details : dict (optional)
Source code in src/pySingleCellNet/tools/cluster_cliques.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
|
find_gene_modules
find_gene_modules(adata, mean_cluster=True, groupby='leiden', mask_var=None, knn=5, leiden_resolution=0.5, prefix='gmod_', metric='euclidean', *, uns_key='knn_modules', layer=None, min_module_size=2, order_genes_by_within_module_connectivity=True, random_state=0)
Find gene modules by building a kNN graph over genes (or cluster-mean profiles) and clustering with Leiden.
Writes a dict {f"{prefix}{cluster_id}": [gene names]} to adata.uns[uns_key]
and returns the same dict.
Parameters
mean_cluster
If True, aggregate cells by groupby
before building the gene kNN graph.
groupby
Column in adata.obs used for aggregation when mean_cluster=True
.
mask_var
Boolean column in adata.var used to select a subset of genes. If None, use all genes.
knn
Number of neighbors for the kNN graph on genes.
leiden_resolution
Resolution for Leiden clustering.
prefix
Prefix for module names.
metric
Distance metric for kNN (e.g. 'euclidean', 'manhattan', 'cosine', 'correlation').
NOTE: If metric=='correlation'
and the data are sparse, we densify for stability.
uns_key
Top-level .uns key to store the resulting dict of modules (default 'knn_modules').
layer
If provided, use adata.layers[layer]
as expression, otherwise adata.X
.
(Aggregation honors this choice.)
min_module_size
Remove modules smaller than this size after clustering.
order_genes_by_within_module_connectivity
If True, sort each module's genes by their within-module connectivity (descending).
random_state
Random seed passed to Leiden for reproducibility.
Source code in src/pySingleCellNet/tools/gene.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
|
graph_from_nodes_and_edges
graph_from_nodes_and_edges(edge_dataframe, node_dataframe, attribution_column_names, directed=True)
Create an iGraph graph from provided node and edge dataframes.
This function constructs an iGraph graph using nodes defined in
node_dataframe
and edges defined in edge_dataframe
. Each vertex
is assigned attributes based on specified columns, and edges are
created according to 'from' and 'to' columns in the edge dataframe.
Parameters:
-
edge_dataframe
(DataFrame
) –A DataFrame containing edge information with at least 'from' and 'to' columns indicating source and target node identifiers.
-
node_dataframe
(DataFrame
) –A DataFrame containing node information. Must include an 'id' column for vertex identifiers and any other columns specified in
attribution_column_names
. -
attribution_column_names
(list of str
) –List of column names from
node_dataframe
whose values will be assigned as attributes to the corresponding vertices in the graph. -
directed
(bool
, default:True
) –Whether the graph should be directed. Defaults to True.
Returns:
-
–
ig.Graph: An iGraph graph constructed from the given nodes and edges,
-
–
with vertex attributes and labels set according to the provided data.
Source code in src/pySingleCellNet/tools/categorize.py
284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 |
|
gsea_on_deg
gsea_on_deg(deg_res, genesets_name, genesets, permutation_num=100, threads=4, seed=3, min_size=10, max_size=500)
Perform Gene Set Enrichment Analysis (GSEA) on differential expression results.
Applies GSEA using gseapy.prerank
for each group in the differential
expression results dictionary against provided gene sets.
Parameters:
-
deg_res
(dict
) –Dictionary mapping cell group names to DataFrames of differential expression results. Each DataFrame must contain columns 'names' (gene names) and 'scores' (ranking scores).
-
genesets_name
(str
) –Name of the gene set collection (not actively used).
-
genesets
(dict
) –Dictionary of gene sets where keys are gene set names and values are lists of genes.
-
permutation_num
(int
, default:100
) –Number of permutations for GSEA. Defaults to 100.
-
threads
(int
, default:4
) –Number of parallel threads to use. Defaults to 4.
-
seed
(int
, default:3
) –Random seed for reproducibility. Defaults to 3.
-
min_size
(int
, default:10
) –Minimum gene set size to consider. Defaults to 10.
-
max_size
(int
, default:500
) –Maximum gene set size to consider. Defaults to 500.
Returns:
-
dict
(dict
) –Dictionary where keys are cell group names and values are GSEA result objects returned by
gseapy.prerank
.
Example
deg_results = { ... 'Cluster1': pd.DataFrame({'names': ['GeneA', 'GeneB'], 'scores': [2.5, -1.3]}), ... 'Cluster2': pd.DataFrame({'names': ['GeneC', 'GeneD'], 'scores': [1.2, -2.1]}) ... } gene_sets = {'Pathway1': ['GeneA', 'GeneC'], 'Pathway2': ['GeneB', 'GeneD']} results = gsea_on_deg(deg_results, 'ExampleGeneSets', gene_sets)
Source code in src/pySingleCellNet/tools/comparison.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
|
paga_connectivities_to_igraph
paga_connectivities_to_igraph(adInput, n_neighbors=10, use_rep='X_pca', n_comps=30, threshold=0.05, paga_key='paga', connectivities_key='connectivities', group_key='auto_cluster')
Convert a PAGA adjacency matrix to an undirected iGraph object and add 'ncells' attribute for each vertex based on the number of cells in each cluster.
This function extracts the PAGA connectivity matrix from adata.uns
, thresholds
the edges, constructs an undirected iGraph graph, and assigns vertex names and
the number of cells in each cluster.
Parameters:
-
adInput
(AnnData
) –The AnnData object containing: -
adata.uns[paga_key][connectivities_key]
: The PAGA adjacency matrix (CSR format). -adata.obs[group_key].cat.categories
: The node labels. -
n_neighbors
(int
, default:10
) –Number of neighbors for computing nearest neighbors. Defaults to 10.
-
use_rep
(str
, default:'X_pca'
) –The representation to use. Defaults to 'X_pca'.
-
n_comps
(int
, default:30
) –Number of principal components. Defaults to 30.
-
threshold
(float
, default:0.05
) –Minimum edge weight to include. Defaults to 0.05.
-
paga_key
(str
, default:'paga'
) –Key in
adata.uns
for PAGA results. Defaults to "paga". -
connectivities_key
(str
, default:'connectivities'
) –Key for connectivity matrix in
adata.uns[paga_key]
. Defaults to "connectivities". -
group_key
(str
, default:'auto_cluster'
) –The
.obs
column name with cluster labels. Defaults to "auto_cluster".
Returns:
-
–
ig.Graph: An undirected graph with edges meeting the threshold, edge weights assigned,
-
–
vertex names set to cluster categories when possible, and each vertex has an 'ncells' attribute.
Source code in src/pySingleCellNet/tools/categorize.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
score_gene_sets
score_gene_sets(adata, gene_sets, *, layer=None, log_transform=False, clip_percentiles=(1.0, 99.0), agg='mean', top_p=0.5, top_k=None, rank_method=None, rank_universe=None, auc_max_rank=0.05, batch_size=2048, use_average_ranks=False, min_genes_per_set=1, case_insensitive=False, obs_prefix=None, return_dataframe=True)
Compute per-cell gene-set scores with both value-based and rank-based (AUCell/UCell) modes.
Value-based pipeline (when rank_method is None
):
1) Optional log1p.
2) Per-gene percentile clipping (clip_percentiles
).
3) Per-gene min–max scaling to [0, 1].
4) Aggregate across genes in each set per cell with
'mean' | 'median' | 'sum' | 'nonzero_mean' | 'top_p_mean' | 'top_k_mean' | callable.
Rank-based pipeline (when rank_method in {'auc','ucell'}
):
• For each cell, rank genes within a chosen universe (rank_universe
).
• 'auc' : AUCell-style AUC in the top L ranks (L = auc_max_rank
).
• 'ucell': normalized Mann–Whitney U statistic in [0,1].
• Ranks are computed in batches (batch_size
) for memory efficiency.
Parameters:
-
adata
–AnnData object.
-
gene_sets
(GeneSetInput
) –Dict[name -> genes], list of gene lists (auto-named), or name of
adata.uns
key. -
layer
(Optional[str]
, default:None
) –Use
adata.layers[layer]
instead of.X
. -
log_transform
(bool
, default:False
) –Apply
np.log1p
before scoring (safe monotone transform). -
clip_percentiles
(Tuple[float, float]
, default:(1.0, 99.0)
) –(low, high) clipping percentiles for value-based mode.
-
agg
(Union[str, Callable[[ndarray], ndarray]]
, default:'mean'
) –Aggregation for value-based mode or a callable: (cells×genes) -> (cells,).
-
top_p
(Optional[float]
, default:0.5
) –Fraction for 'top_p_mean' (0<p<=1).
-
top_k
(Optional[int]
, default:None
) –Count for 'top_k_mean' (>=1).
-
rank_method
(Optional[str]
, default:None
) –None | 'auc' | 'ucell' to switch to rank-based scoring.
-
rank_universe
(Optional[Union[str, Sequence[str]]]
, default:None
) –None=all genes; or a boolean var column name (e.g. 'highly_variable'); or an explicit list of gene names defining the ranking universe.
-
auc_max_rank
(Union[int, float]
, default:0.05
) –AUCell top window (int L) or fraction (0,1].
-
batch_size
(int
, default:2048
) –Row batch size for rank computation.
-
use_average_ranks
(bool
, default:False
) –If True, uses average-tie ranks (scipy.stats.rankdata); slower.
-
min_genes_per_set
(int
, default:1
) –Require at least this many present genes to score a set (else NaN).
-
case_insensitive
(bool
, default:False
) –Case-insensitive gene matching against
var_names
. -
obs_prefix
(Optional[str]
, default:None
) –If provided, also writes scores to
adata.obs[f"{obs_prefix}{name}"]
. -
return_dataframe
(bool
, default:True
) –If True, return a DataFrame; else return ndarray.
Returns:
-
DataFrame
–DataFrame (cells × sets) of scores (and optionally writes to
adata.obs
).
Notes
• Rank-based scores ignore clipping/min–max (ranks are invariant to monotone transforms). • AUCell output here is normalized to [0,1] within the top-L window. • UCell output is the normalized U statistic in [0,1].
Source code in src/pySingleCellNet/tools/gene.py
394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 |
|
whoare_genes_neighbors
whoare_genes_neighbors(adata, gene, n_neighbors=5, key='gene', use='connectivities')
Retrieve the top n_neighbors
nearest genes to gene
, using a precomputed gene–gene kNN graph
stored in adata.uns (as produced by build_gene_knn_graph).
This version handles both sparse‐CSR matrices and dense NumPy arrays in adata.uns.
Parameters
adata
AnnData that has the following keys in adata.uns:
- adata.uns[f"{key}_gene_index"] (np.ndarray of gene names, in order)
- adata.uns[f"{key}_connectivities"] (CSR sparse matrix or dense ndarray)
- adata.uns[f"{key}_distances"] (CSR sparse matrix or dense ndarray)
gene
Gene name (must appear in adata.uns[f"{key}_gene_index"]
).
n_neighbors
Number of neighbors to return.
key
Prefix under which the kNN graph was stored. For example, if build_gene_knn_graph(...)
was called with key="gene"
, the function will look for:
- adata.uns["gene_gene_index"]
- adata.uns["gene_connectivities"]
- adata.uns["gene_distances"]
use
One of {"connectivities", "distances"}.
- If "connectivities", neighbors are ranked by descending connectivity weight.
- If "distances", neighbors are ranked by ascending distance (only among nonzero entries).
Returns
neighbors : List[str]
A list of gene names (length ≤ n_neighbors) that are closest to gene
.
Source code in src/pySingleCellNet/tools/gene.py
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 |
|