Utilities
Miscellaneous functions
build_knn_graph
build_knn_graph(correlation_matrix, labels, k=5)
Build a k-nearest neighbors (kNN) graph from a correlation matrix.
Parameters:
-
correlation_matrix
(ndarray
) –Square correlation matrix.
-
labels
(list
) –Node labels corresponding to the rows/columns of the correlation matrix.
-
k
(int
, default:5
) –Number of nearest neighbors to connect each node to.
Returns:
-
–
igraph.Graph: kNN graph.
Source code in src/pySingleCellNet/utils/knn.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
call_outlier_cells
call_outlier_cells(adata, metric=['total_counts'], nmads=5)
determines whether obs[metric] exceeds nmads
Parameters:
adata : AnnData
The input AnnData object containing single-cell data.
metric : str
The column name in adata.obs
holding cell metric
nmads : int, optional (default=5)
The number of median abs deviations to define a cell as an outlier
Returns
None
The function adds a new column to adata.obs
named "outlier_" + metric, but does not return anything.
Source code in src/pySingleCellNet/utils/qc.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
create_gene_structure_dict_by_stage
create_gene_structure_dict_by_stage(file_path, stage)
Create a dictionary mapping structures to lists of genes expressed at a specific stage. Designed for parsing output from Jax Labs MGI data
Parameters:
-
file_path
(str
) –Path to the gene expression file.
-
stage
(str or int
) –The Theiler Stage to filter the data.
Returns:
-
dict
–A dictionary where keys are structures and values are lists of genes expressed in those structures.
Source code in src/pySingleCellNet/utils/annotation.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
filter_adata_by_group_size
filter_adata_by_group_size(adata, groupby, ncells=20)
Filters an AnnData object to retain only cells from groups with at least 'ncells' cells.
Parameters:
adata : AnnData
The input AnnData object containing single-cell data.
groupby : str
The column name in adata.obs
used to define groups (e.g., cluster labels).
ncells : int, optional (default=20)
The minimum number of cells a group must have to be retained.
Returns:
filtered_adata : AnnData A new AnnData object containing only cells from groups with at least 'ncells' cells.
Raises:
ValueError:
- If groupby
is not a column in adata.obs
.
- If ncells
is not a positive integer.
Source code in src/pySingleCellNet/utils/adataTools.py
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 |
|
filter_anndata_slots
filter_anndata_slots(adata, slots_to_keep, *, keep_dependencies=True)
Return a filtered COPY of adata
that only keeps requested slots/keys.
Unspecified slots (or with value None) are cleared.
Parameters
adata : AnnData
slots_to_keep : dict
Keys among {'obs','var','obsm','obsp','varm','varp','uns'}.
Values are lists of names to keep within that slot; if a slot is not
present in the dict or is None, all contents of that slot are removed.
Example:
{'obs': ['leiden','sample'],
'obsm': ['X_pca','X_umap'],
'uns': ['neighbors', 'pca', 'umap']}
keep_dependencies : bool, default True
If True, automatically keep cross-slot items that are commonly required:
- For each neighbors block in .uns[<key>]
with
'connectivities_key' / 'distances_key', also keep those in .obsp
.
- If an .obsp
key ends with '_connectivities'/'_distances', also keep
the matching .uns[<prefix>]
if present.
- If keeping 'X_pca' in .obsm
, also keep .uns['pca']
and .varm['PCs']
if present.
- If keeping 'X_umap' in .obsm
, also keep .uns['umap']
if present.
Returns
AnnData A copy with filtered slots.
Source code in src/pySingleCellNet/utils/adataTools.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 |
|
filter_gene_list
filter_gene_list(genelist, min_genes, max_genes=1000000.0)
Filter the gene lists in the provided dictionary based on their lengths.
- genelist : dict Dictionary with keys as identifiers and values as lists of genes.
- min_genes : int Minimum number of genes a list should have.
- max_genes : int Maximum number of genes a list should have.
- dict Filtered dictionary with lists that have a length between min_genes and max_genes (inclusive of min_genes and max_genes).
Source code in src/pySingleCellNet/utils/annotation.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
find_knee_point
find_knee_point(adata, total_counts_column='total_counts')
Identifies the knee point of the UMI count distribution in an AnnData object.
Parameters:
-
adata
(AnnData
) –The input AnnData object.
-
total_counts_column
(str
, default:'total_counts'
) –Column in
adata.obs
containing total UMI counts. Default is "total_counts".
Returns:
-
float
–The UMI count value at the knee point.
Source code in src/pySingleCellNet/utils/qc.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
generate_joint_graph
generate_joint_graph(adata, connectivity_keys, weights, output_key='jointNeighbors')
Create a joint graph by combining multiple connectivity graphs with specified weights.
This function computes the weighted sum of selected connectivity and distance matrices
in an AnnData object and stores the result in .obsp
.
Parameters:
-
adata
(AnnData
) –The AnnData object containing connectivity matrices in
.obsp
. -
connectivity_keys
(list of str
) –A list of keys in
adata.obsp
corresponding to connectivity matrices to combine. -
weights
(list of float
) –A list of weights for each connectivity matrix. Must match the length of
connectivity_keys
. -
output_key
(str
, default:'jointNeighbors'
) –The base key under which to store the combined graph in
.obsp
. The default is'jointNeighbors'
.
Raises:
-
ValueError
–If the number of
connectivity_keys
does not match the number ofweights
. -
KeyError
–If any key in
connectivity_keys
or its corresponding distances key is not found inadata.obsp
.
Returns:
-
None
–Updates the AnnData object in place by adding the combined connectivity and distance matrices to
.obsp
and metadata to.uns
.
Example
generate_joint_graph(adata, ['neighbors_connectivities', 'umap_connectivities'], [0.7, 0.3]) adata.obsp['jointNeighbors_connectivities'] adata.uns['jointNeighbors']
Source code in src/pySingleCellNet/utils/knn.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
get_unique_colors
get_unique_colors(n_colors)
Generate a list of unique colors from the Tab20, Tab20b, and Tab20c colormaps.
Parameters: - n_colors: The number of unique colors needed.
Returns: - A list of unique colors.
Source code in src/pySingleCellNet/utils/colors.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
mito_rib
mito_rib(adQ, species='MM', log1p=True, clean=True)
Calculate mitochondrial and ribosomal QC metrics and add them to the .var
attribute of the AnnData object.
Parameters
adQ : AnnData Annotated data matrix with observations (cells) and variables (features). species : str, optional (default: "MM") The species of the input data. Can be "MM" (Mus musculus) or "HS" (Homo sapiens). clean : bool, optional (default: True) Whether to remove mitochondrial and ribosomal genes from the data.
Returns
AnnData
Annotated data matrix with QC metrics added to the .var
attribute.
Source code in src/pySingleCellNet/utils/qc.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
read_gmt
read_gmt(file_path)
Read a Gene Matrix Transposed (GMT) file and return a dictionary of gene sets.
Parameters:
-
file_path
(str
) –Path to the GMT file.
Returns:
-
dict
(dict
) –A dictionary where keys are gene set names and values are lists of associated genes.
Source code in src/pySingleCellNet/utils/annotation.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
rename_cluster_labels
rename_cluster_labels(adata, AnnData, old_col='cluster', new_col='short_cluster')
Renames cluster labels in the specified .obs column with multi-letter codes.
- All unique labels (including NaN) are mapped in order of appearance to a base-26 style ID: 'A', 'B', ..., 'Z', 'AA', 'AB', etc.
- The new labels are stored as a categorical column in
adata.obs[new_col]
.
Parameters:
-
adata
(AnnData
) –The AnnData object containing the cluster labels.
-
old_col
(str
, default:'cluster'
) –The name of the .obs column that has the original cluster labels. Defaults to "cluster".
-
new_col
(str
, default:'short_cluster'
) –The name of the new .obs column that will store the shortened labels. Defaults to "short_cluster".
Returns:
-
None
(None
) –The function adds a new column to
adata.obs
in place.
Source code in src/pySingleCellNet/utils/adataTools.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
score_sex
score_sex(adata, y_genes=['Eif2s3y', 'Ddx3y', 'Uty'], x_inactivation_genes=['Xist', 'Tsix'])
Adds sex chromosome expression scores to an AnnData object.
This function calculates two scores for each cell in a scRNA-seq AnnData object
- Y_score: the sum of expression values for a set of Y-chromosome specific genes.
- X_inact_score: the sum of expression values for genes involved in X-chromosome inactivation.
The scores are added to the AnnData object's .obs
DataFrame with the keys 'Y_score' and 'X_inact_score'.
Parameters
adata : AnnData
An AnnData object containing scRNA-seq data, with gene names in adata.var_names
.
y_genes : list of str, optional
List of Y-chromosome specific marker genes (default is ['Eif2s3y', 'Ddx3y', 'Uty']).
x_inactivation_genes : list of str, optional
List of genes involved in X-chromosome inactivation (default is ['Xist', 'Tsix']).
Raises
ValueError
If none of the Y-specific or X inactivation genes are found in adata.var_names
.
Returns
None
The function modifies the AnnData object in place by adding the score columns to adata.obs
.
Source code in src/pySingleCellNet/utils/qc.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
split_adata_indices
split_adata_indices(adata, n_cells=100, groupby='cell_ontology_class', cellid=None, strata_col=None)
Splits an AnnData object into training and validation indices based on stratification by cell type and optionally by another categorical variable.
Parameters:
-
adata
(AnnData
) –The annotated data matrix to split.
-
n_cells
(int
, default:100
) –The number of cells to sample per cell type.
-
groupby
(str
, default:'cell_ontology_class'
) –The column name in adata.obs that specifies the cell type. Defaults to "cell_ontology_class".
-
cellid
(str
, default:None
) –The column in adata.obs to use as a unique identifier for cells. If None, it defaults to using the index.
-
strata_col
(str
, default:None
) –The column name in adata.obs used for secondary stratification, such as developmental stage, gender, or disease status.
Returns:
-
tuple
(tuple
) –A tuple containing two lists: - training_indices (list): List of indices for the training set. - validation_indices (list): List of indices for the validation set.
Raises:
-
ValueError
–If any specified column names do not exist in the DataFrame.
Source code in src/pySingleCellNet/utils/adataTools.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
write_gmt
write_gmt(gene_list, filename, collection_name, prefix='')
Write a .gmt file from a gene list.
gene_list: dict Dictionary of gene sets (keys are gene set names, values are lists of genes). filename: str The name of the file to write to. collection_name: str The name of the gene set collection. prefix: str, optional A prefix to add to each gene set name.
Source code in src/pySingleCellNet/utils/annotation.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|