XClone tutorials on TNBC1 scRNA-seq
A triple-negative breast cancer (TNBC) sample that was assayed by droplet-based scRNA-seq (10x Genomics), published by CopyKAT with dataset ID as TNBC1. The aligned reads in BAM format and the called cell list were directly downloaded from GEO GSE148673. As reported by the original study using CopyKAT, three clusters of cells were identified from the transcriptome, consisting of 300 normal cells and 797 tumour cells from two distinct CNV clones.
Download the Jupyter Notebook by clicking on the Download TNBC1 scRNA-seq demo Notebook.
Introduction
This tutorial covers how to use XClone for scRNA-seq CNV analysis in RDR module, BAF module and Combine module independently.
Author: Rongting Huang
Date: 2023-04-03
The data used in this tutorial is available in XClone package and can be download from xclonedata.
The data contains:
Single-cell RNA-seq read count data for XClone RDR module and BAF module.
Requirements
To follow this tutorial, you will need:
Python 3.7.
The following Python packages: XClone and its dependent packages, e.g., anndata, numpy, scipy, pandas, scanpy.
In this tutorial, we covered how to use XClone for scRNA CNV analysis. If you have any questions or comments, please feel free to contact the author.
Load packages
[1]:
%load_ext autoreload
%autoreload 2
import xclone
import anndata as an
import pandas as pd
import numpy as np
import scipy
print("scipy", scipy.__version__)
xclone.pp.efficiency_preview()
(Running XClone 0.3.4)
2023-04-03 10:46:28
scipy 1.7.0
[XClone efficiency] multiprocessing cpu total count in your device 112
[2]:
dataset_name = "TNBC1_scRNA"
## output results dir
outdir = "/storage/xxxx/users/xxxx/xclone/tutorials/"
To specify the name of your dataset and the output directory where the results will be saved, you can set the dataset_name
and outdir
parameters before you start using any module in XClone. Replace “BCH869_scRNA” with a name of your choice to identify your dataset, and "/storage/xxxx/users/xxxx/xclone/tutorials/"
with the path to the directory where you want to save the output files.
Load dataset
For TNBC1 dataset, users can easily load the raw read count matrices for RDR module and BAF module, respectively by xclone.data.tnbc1_rdr()
and xclone.data.tnbc1_baf()
.
Users can load the matricesin anndata object in Python, it typically contains several matrices or tables of data, each with its own set of row and/or column annotations. The obs attribute typically contains annotation information for each cell in the dataset, and the var attribute typically contains annotation information for each gene in the dataset.
obs attribute used in XClone
:
cluster
: Cluster identifier for each cell by combining expression with BAF analysis. We annotated the cell clusters into two subclones Clone 1 (Clone A), Clone 2 (Clone B) and Normal cells.
cluster.pred
: Normal and tumor cells, 'N'
and 'T'
here, from the above mentioned annotation.
copykat
: Cluster identifier for each cell by CopyKAT publication.
copykat.pred
: Normal and tumor cells identified by copyKAT publication.
Additional cell-level annotations can be merged in obs
by users.
var attribute used in XClone
:
GeneName
: Gene Names for each gene in the dataset.
GeneID
: Identifier or accession number of each gene in the dataset(if available).
chr
: Chromosome location of each gene.
start
: Start position of each gene on its chromosome.
stop
: End position of each gene on its chromosome.
arm
: Chromosome arm information for each gene.
chr_arm
: Combined Chromosome and arm information for each gene.
band
: Chromosome band information for each gene (if available)..
Additional gene-level annotations can be merged in var
by users.
For RDR module, there is raw_expr
layer in the anndata, and for BAF module, there are two layers AD
and DP
, initially.
[3]:
RDR_adata = xclone.data.tnbc1_rdr()
BAF_adata = xclone.data.tnbc1_baf()
RDR module
By calling the config method xclone.XCloneConfig()
, users can create an instance of the XCloneConfig
class in the module
of “RDR” to set data-specific configures for RDR module.
In RDR module, specify a cell_anno_key
with a cell annotation key used in your dataset, and ref_celltype
as the reference cell type.
Set the output directory by specifying outdir
.
XClone can defaultly detect and remove celltype-specific marker genes when users specify cell annotation key for marker_group_anno_key
and N top marker genes in each cell group to be removed by specify the number in top_n_marker
, \(N=15\) by default. If marker_group_anno_key
is not defined, cell_anno_key
will be used by default.
Users can decide to plot the main figures in each module or not by setting xclone_plot
, True or False. If True, can also select cell annotation term plot_cell_anno_key
for plotting cells by clusters.
XClone has default start probability for start_prob
([0.1, 0.8, 0.1] for copy loss, copy neutral and copy gain) in 10x scRNA-seq dataset.
Before running the RDR moudle, users can use xconfig.display()
to print all configurations used in RDR module for previewing and logging.
Users can run the analysis on your input data by calling the run_RDR
method by specifying the RDR_adata
and config_file
. If no custmoised config file is used, XClone will load the default config file.
Here shows the example of BCH869 RDR module.
[4]:
xconfig = xclone.XCloneConfig(dataset_name = dataset_name, module = "RDR")
xconfig.set_figure_params(xclone= True, fontsize = 18)
xconfig.outdir = outdir
xconfig.cell_anno_key = "cluster.pred"
xconfig.ref_celltype = "N"
xconfig.marker_group_anno_key = "cluster.pred"
xconfig.xclone_plot= True
xconfig.plot_cell_anno_key = "cluster"
xconfig.display()
RDR_Xdata = xclone.model.run_RDR(RDR_adata,
config_file = xconfig)
RDR
Configurations:
HMM_brk chr_arm
KNN_neighbors 10
WMA_smooth_key chr_arm
WMA_window_size 40
_file_format_data h5ad
_file_format_figs pdf
_frameon True
_outdir /storage/yhhuang/users/rthuang/xclone/tutorials
_plot_suffix
_start 1680518803.4498727
_vector_friendly True
cell_anno_key cluster.pred
dataset_name TNBC1_scRNA
dispersion_celltype None
exclude_XY False
file_format_data h5ad
file_format_figs pdf
filter_ref_ave 0.5
fit_GLM_libratio False
gene_exp_group 1
gene_exp_ref_log True
guide_chr_anno_key chr_arm
guide_cnv_ratio None
guide_qt_lst [0.0001, 0.96, 0.99]
marker_group_anno_key cluster.pred
max_iter 2
min_iter 1
module RDR
outdir /storage/yhhuang/users/rthuang/xclone/tutorials
plot_cell_anno_key cluster
plot_suffix
rdr_plot_vmax 0.7
rdr_plot_vmin -0.7
ref_celltype N
remove_guide_XY False
remove_marker True
select_normal_chr_num 4
set_figtitle True
set_smartseq False
smart_transform False
start_prob [0.1 0.8 0.1]
top_n_marker 15
trans_prob [[9.99998e-01 1.00000e-06 1.00000e-06]
[1.00000e-06 9.99998e-01 1.00000e-06]
[1.00000e-06 1.00000e-06 9.99998e-01]]
trans_t 1e-06
warninig_ignore True
xclone_plot True
[XClone RDR module running]************************
[XClone data preprocessing] check RDR raw dataset value: success
Keep valid cells: Filter out 0 cells / 1097 total cells, remain 1097 valid cells with annotation
[XClone data preprocessing] check RDR cell annotation: success
[XClone-RDR preprocessing] Filter out 16315 genes / 33472 total genes, remain 17157 genes
[XClone data preprocessing] detect RDR genes: done
[XClone-RDR preprocessing] Filter out 10724 genes / 17157 total genes, remain 6433 genes
Trying to set attribute `.var` of view, copying.
[XClone] use marker genes provided by users:
['AC036214.3' 'AC093484.3' 'AL365205.1' 'ATP1A1' 'B2M' 'CD24' 'CRYAB'
'CST3' 'CTSB' 'CYBA' 'DPYD' 'EMP3' 'EPCAM' 'GSTO1' 'H3F3A' 'HLA-A' 'HNMT'
'HSP90AB1' 'KRT7' 'MRPL14' 'POLR1C' 'RAB31' 'RPL28' 'TM4SF1' 'TMSB4X'
'TOMM6' 'TPD52' 'UBB' 'YIPF3' 'ZEB2']
filter_genes_num: 30
used_genes_num: 6403
output anndata is not sparse matrix.
Trying to set attribute `.var` of view, copying.
Trying to set attribute `.var` of view, copying.
[XClone RDR gene dispersion fitting] Time used: 1413 seconds
[XClone RDR gene-specific dispersion]: checking
max_value: 3.1872219121269994e+33
min_value: 2.967852820468968e-52
qt_0.95_value: 4.299742553561531
qt_0.05_value: 0.045671615314120464
remove no GLM results genes num: 0
remove inf dispersion genes num: 0
[XClone RDR dispersion]: clipping
[XClone RDR gene-specific dispersion]: checking
max_value: 2.920260325742404
min_value: 0.014103096362288924
qt_0.95_value: 2.920260325742404
qt_0.05_value: 0.045671615314120464
[XClone hint] RDR_base_file and bulk_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
make WMA connectivities matrix, saved in varp[WMA_connect].
[XClone] RDR CNV states chrs guiding(copy loss, copy neutral, copy gain): ['19q', '3p', '8q']
CNV loss: 0.6554608169760507
CNV neutral: 1.2388721424569358
CNV gain: 2.2130883651915747
[XClone] RDR CNV states ratio guiding(copy loss, copy neutral, copy gain): [0.65546082 1.23887214 2.21308837]
expression_brk [-0.6898304 5.8300295]
[XClone] CNV_optimazation iteration: 1
Cell level: no filtering emm_prob
Gene level: filter nan emm_prob
[XClone HMM smoothing] Time used: 61 seconds
[XClone] CNV_optimazation iteration: 2
[XClone] fit CNV ratio
26 2.920260
27 2.920260
38 2.920260
39 0.223667
43 0.175076
...
33348 0.779464
33350 0.424299
33353 0.115477
33355 0.152878
33357 0.159337
Name: dispersion_capped, Length: 6402, dtype: float64
[XClone] GLM success:
[0. 0. 0. ... 5. 9. 3.] [1. 1. 1. ... 1. 1. 1.] [0.4345959 0.5970708 0.71850723 ... 2.0021474 2.9733129 2.491469 ] [1. 1. 0. ... 0. 0. 0.]
[XClone] GLM success:
[0. 0. 0. ... 5. 9. 3.] [1. 1. 1. ... 1. 1. 1.] [0.4345959 0.5970708 0.71850723 ... 2.0021474 2.9733129 2.491469 ] [0. 0. 1. ... 0. 1. 1.]
[XClone] GLM success:
[0. 0. 0. ... 5. 9. 3.] [1. 1. 1. ... 1. 1. 1.] [0.4345959 0.5970708 0.71850723 ... 2.0021474 2.9733129 2.491469 ] [0. 0. 0. ... 1. 0. 0.]
Time used 8 seconds
Cell level: no filtering emm_prob
Gene level: filter nan emm_prob
[XClone HMM smoothing] Time used: 63 seconds
[XClone] Warning: Lower bound decreases!
Step 1, loglik decrease from -7.61e+06 to -7.61e+06
iteration_end_round: 2
Logliklihood: [-7609360.65300615 -7613910.98611593]
CNV_ratio: {'0': array([[0.65546082, 1.23887214, 2.21308837]]), '1': array([[0.57053172, 1.17758586, 2.7366992 ]])}
Time used 213 seconds
[XClone hint] RDR_final_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
[XClone plotting]
[5]:
RDR_Xdata
[5]:
AnnData object with n_obs × n_vars = 1097 × 6402
obs: 'copykat.pred', 'cluster.pred', 'cluster', 'mit_clone_id', 'confident', 'tumor', 'copykat', 'counts_ratio'
var: 'GeneName', 'GeneID', 'chr', 'start', 'stop', 'arm', 'chr_arm', 'band', 'ref_avg', 'dispersion', 'gene_dispersion_bse', 'dispersion_capped'
uns: 'data_mode', 'data_notes', 'genome_mode', 'log', 'rank_marker_genes', 'fit_dispersion_removed_genes', 'dispersion_base_celltype', 'pca', 'neighbors', 'chr_dict', 'guide_CNV_chrs_use_layers', 'guide_CNV_chrs_use_anno_key', 'ref_log_expression_brk', 'group_genes', 'CNV_ratio', 'Logliklihood'
obsm: 'X_pca'
varm: 'PCs'
layers: 'raw_expr', 'raw_ratio', 'ref_normalized', 'expected', 'WMA_smoothed', 'RDR_smooth', 'emm_prob_log', 'emm_prob_log_noHMM', 'emm_prob_noHMM', 'posterior_mtx', 'posterior_mtx_log'
obsp: 'distances', 'connectivities'
[6]:
RDR_Xdata.layers["posterior_mtx"].shape
[6]:
(1097, 6402, 3)
[7]:
RDR_Xdata.layers["posterior_mtx"].argmax(axis=-1).shape
[7]:
(1097, 6402)
Users can load the step by step results(AnnData) stored in outdir/data
.
For the final RDR module CNV detection results, load the AnnData object results into memory:
import anndata
filename = outdir+"data/RDR_adata_KNN_HMM_post.h5ad"
adata = anndata.read_h5ad(filename)
Access the posterior_mtx layer:
cnv_prob = adata.layers["posterior_mtx"]
This will extract the final CNV calling results as a NumPy array.
The posterior_mtx layer contains the posterior probabilities of each cell at each genomic segment(gene scale default) having each copy number state. The rows correspond to cells and the columns correspond to genomic segments.
copy_states = cnv_prob.argmax(axis=-1)
This will give you a 2D numpy array hard assigned CNV states for each cell each gene.
For more to explore the results of XClone, pls refer to link.
BAF moudle
Similarly, by calling the config method xclone.XCloneConfig()
, users can create an instance of the XCloneConfig
class in the module
of “BAF” to set data-specific configures for BAF module.
It is default set baf_bias_mode=1
, which supports \(K=5\) BAF states: allele A bias (++), allele A bias (+), allele balance, allele B bias (+), and allele B bias (++). XClone provides \(K=3\) BAF states for comparasion (visualization). In this mode, CNV_N_components
is 5. Alternatively, users can specify another mode by setting baf_bias_mode=0
, which supports \(K=3\) BAF states: allele A bias,allele balance, allele B bias. In this mode, CNV_N_components
is 3.
In BAF module, the start probabilitystart_prob
is default set as \([0.3, 0.4, 0.3]\) and \([0.2, 0.15, 0.3, 0.15, 0.2]\) for \(K=3\) and \(K=5\), respectively. And the transition probability trans_prob
for HMM smoothing is default set as \(\{t = 1e-6\), \(1-(K-1)t\}\) respectively for cross-state transition and state keeping. If users change CNV_N_components
manually, should also consider the correspoding start probability and transition probability in HMM
settings.
Generaly, users need specify a cell_anno_key
with a cell annotation key used in your dataset, and ref_celltype
as the reference cell type.
Set the output directory by specifying outdir
.
Generaly, data-specific theoractical B allele frequency is fitted from the reference cells. However, if the reference cells is limited, users can set theo_neutral_BAF=0.5
directly. Here we use default setting in TNBC1 dataset.
Users can decide to plot the main figures in each module or not by setting xclone_plot
, True or False. If True, can also select cell annotation term plot_cell_anno_key
for plotting cells by clusters. XClone perform the denoise strategy in BAF module and plot the denoised BAF CNV profile by default.
Before running the BAF moudle, users can use xconfig.display()
to print all configurations used in RDR module for previewing and logging.
Users can run the analysis on your input data by calling the run_BAF
method by specifying the BAF_adata
and config_file
. If no custmoised config file is used, XClone will load the default config file.
Here shows the example of BCH869 BAF module.
[8]:
xconfig = xclone.XCloneConfig(dataset_name = dataset_name, module = "BAF")
xconfig.set_figure_params(xclone= True, fontsize = 18)
xconfig.outdir = outdir
xconfig.cell_anno_key = "cluster.pred"
xconfig.ref_celltype = "N"
xconfig.xclone_plot= True
xconfig.plot_cell_anno_key = "cluster"
xconfig.display()
BAF_merge_Xdata = xclone.model.run_BAF(BAF_adata,
config_file = xconfig)
BAF
Configurations:
BAF_add None
BAF_denoise True
BAF_denoise_GMM_comp 2
BAF_denoise_GMM_detection True
BAF_denoise_cellprop_cutoff 0.05
CNV_N_components 5
HMM_brk chr_arm
KNN_neighbors 10
RDR_file None
WMA_smooth_key chr_arm
WMA_window_size 101
_file_format_data h5ad
_file_format_figs pdf
_frameon True
_outdir /storage/yhhuang/users/rthuang/xclone/tutorials
_plot_suffix
_start 1680520658.023401
_vector_friendly True
baf_bias_mode 1
bin_nproc 20
cell_anno_key cluster.pred
concentration 100
concentration_lower 20
concentration_upper 100
dataset_name TNBC1_scRNA
exclude_XY False
extreme_count_cap False
feature_mode GENE
file_format_data h5ad
file_format_figs pdf
gene_specific_concentration False
guide_theo_CNV_states None
module BAF
outdir /storage/yhhuang/users/rthuang/xclone/tutorials
phasing_len 100
phasing_region_key chr
plot_cell_anno_key cluster
plot_suffix
ref_BAF_clip False
ref_celltype N
remove_guide_XY False
remove_marker_genes True
set_figtitle True
set_smartseq False
start_prob [0.2 0.15 0.3 0.15 0.2 ]
theo_neutral_BAF None
trans_prob [[9.99996e-01 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06]
[1.00000e-06 9.99996e-01 1.00000e-06 1.00000e-06 1.00000e-06]
[1.00000e-06 1.00000e-06 9.99996e-01 1.00000e-06 1.00000e-06]
[1.00000e-06 1.00000e-06 1.00000e-06 9.99996e-01 1.00000e-06]
[1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 9.99996e-01]]
trans_t 1e-06
warninig_ignore True
xclone_plot True
[XClone BAF module running]************************
[XClone data preprocessing] check BAF raw dataset value: success
Keep valid cells: Filter out 0 cells / 1097 total cells, remain 1097 valid cells with annotation
[XClone data preprocessing] check BAF cell annotation: success
[XClone data checking]: RDR and BAF in same cell order.
[XClone-data removing]:
Filter out 30 genes / 33472 total genes, remain 33442 regions
[XClone-Local_phasing] time_used: 76.63seconds
[XClone-Global_phasing] time_used: 3.32seconds
make WMA connectivities matrix, saved in varp[WMA_connect].
WMA_connect exists for direct use.
... storing 'bin_stop_arm' as categorical
... storing 'bin_stop_chr_arm' as categorical
... storing 'bin_stop_band' as categorical
[XClone hint] BAF_base_file and merged_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
[XClone get_CNV_states] time_used: 303.31seconds
states used: [0.24059262 0.3962636 0.50714286 0.63970433 0.79573162]
.....
[XClone] specific Center states used.
[XClone]: validated probability, all finite.
cal emm prob time 7 seconds
normalize the input emm_prob_log
normalized emm_prob_log
generate new layer key value: bin_phased_BAF_specific_center_emm_prob_log_KNN
[BAF smoothing] time_used: 1.93seconds
Cell level: no filtering emm_prob
Gene level: no filtering emm_prob
[XClone] multiprocessing for each brk item
nproc: 80
[XClone HMM smoothing] Time used: 11 seconds
[XClone get_CNV_states] time_used: 204.47seconds
states used: [0.28692468 0.50714286 0.72361922]
.....
[XClone] specific Center states used.
[XClone]: validated probability, all finite.
cal emm prob time 1 seconds
normalize the input emm_prob_log
normalized emm_prob_log
generate new layer key value: correct_emm_prob_log_KNN
[BAF smoothing] time_used: 1.04seconds
Cell level: no filtering emm_prob
Gene level: no filtering emm_prob
[XClone] multiprocessing for each brk item
nproc: 80
[XClone HMM smoothing] Time used: 14 seconds
[[0.14598567]
[0.66374004]]
[[0.70412148]
[0.24318348]]
[XClone hint] BAF_final_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
[9]:
BAF_merge_Xdata
[9]:
AnnData object with n_obs × n_vars = 1097 × 347
obs: 'copykat.pred', 'cluster.pred', 'cluster', 'mit_clone_id', 'confident', 'tumor', 'copykat'
var: 'chr', 'start', 'stop', 'arm', 'chr_arm', 'band', 'gene1_stop', 'bin_stop_arm', 'bin_stop_chr_arm', 'bin_stop_band', 'bin_idx', 'bin_idx_cum', 'GeneName_lst', 'GeneID_lst', 'bin_genes_cnt', 'ref_BAF_phased'
uns: 'local_phasing_key', 'local_phasing_len'
layers: 'ad_bin_softcnt', 'ad_bin', 'dp_bin', 'ad_bin_softcnt_phased', 'ad_bin_phased', 'BAF', 'BAF_phased', 'fill_BAF_phased', 'BAF_phased_KNN', 'BAF_phased_KNN_WMA', 'BAF_phased_WMA', 'bin_phased_BAF_specific_center_emm_prob_log', 'bin_phased_BAF_specific_center_emm_prob_log_KNN', 'emm_prob_log_noHMM', 'emm_prob_noHMM', 'posterior_mtx', 'posterior_mtx_log', 'add_posterior_mtx', 'denoised_add_posterior_mtx', 'denoised_posterior_mtx'
obsp: 'connectivities_expr'
varp: 'WMA_connect'
[10]:
BAF_merge_Xdata.layers["posterior_mtx"].shape
[10]:
(1097, 347, 5)
The final step anndata stored in outdir/data/BAF_merge_Xdata_KNN_HMM_post.h5ad
and users can get the CNV indentification in layer posterior_mtx
.
Before combine, We checked the dimensions of RDR module and BAF module to make sure they are in same dimensions.
[11]:
RDR_Xdata.var
[11]:
GeneName | GeneID | chr | start | stop | arm | chr_arm | band | ref_avg | dispersion | gene_dispersion_bse | dispersion_capped | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
26 | HES4 | ENSG00000188290 | 1 | 998962 | 1000172 | p | 1p | p36.33 | 0.611296 | 5.409898 | NaN | 2.920260 |
27 | ISG15 | ENSG00000187608 | 1 | 1001138 | 1014541 | p | 1p | p36.33 | 6.817276 | 3.253048 | NaN | 2.920260 |
38 | TNFRSF4 | ENSG00000186827 | 1 | 1211326 | 1214138 | p | 1p | p36.33 | 0.637874 | 5.979831 | NaN | 2.920260 |
39 | SDF4 | ENSG00000078808 | 1 | 1216908 | 1232031 | p | 1p | p36.33 | 1.408638 | 0.223667 | NaN | 0.223667 |
43 | UBE2J2 | ENSG00000160087 | 1 | 1253909 | 1273885 | p | 1p | p36.33 | 0.624585 | 0.175076 | NaN | 0.175076 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33348 | MPP1 | ENSG00000130830 | X | 154778684 | 154821007 | q | Xq | q28 | 1.614618 | 0.779464 | NaN | 0.779464 |
33350 | F8 | ENSG00000185010 | X | 154835788 | 155026940 | q | Xq | q28 | 0.554817 | 0.424299 | NaN | 0.424299 |
33353 | FUNDC2 | ENSG00000165775 | X | 155025980 | 155060303 | q | Xq | q28 | 0.860465 | 0.115477 | NaN | 0.115477 |
33355 | MTCP1 | ENSG00000214827 | X | 155061622 | 155147937 | q | Xq | q28 | 0.873754 | 0.152878 | NaN | 0.152878 |
33357 | VBP1 | ENSG00000155959 | X | 155197007 | 155239817 | q | Xq | q28 | 0.750831 | 0.159337 | NaN | 0.159337 |
6402 rows × 12 columns
[12]:
BAF_merge_Xdata.var
[12]:
chr | start | stop | arm | chr_arm | band | gene1_stop | bin_stop_arm | bin_stop_chr_arm | bin_stop_band | bin_idx | bin_idx_cum | GeneName_lst | GeneID_lst | bin_genes_cnt | ref_BAF_phased | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 29554 | 2530245 | p | 1p | p36.33 | 31109 | p | 1p | p36.32 | 0 | 0 | MIR1302-2HG,FAM138A,OR4F5,AL627309.1,AL627309.... | ENSG00000243485,ENSG00000237613,ENSG0000018609... | 818 | 0.507143 |
100 | 1 | 2530064 | 8434838 | p | 1p | p36.32 | 2547460 | p | 1p | p36.23 | 1 | 1 | AL139246.5,TNFRSF14-AS1,TNFRSF14,AL139246.3,FA... | ENSG00000272449,ENSG00000238164,ENSG0000015787... | 870 | 0.517442 |
200 | 1 | 8805860 | 13179464 | p | 1p | p36.23 | 8807051 | p | 1p | p36.21 | 2 | 2 | AL357552.2,ENO1,ENO1-AS1,CA6,SLC2A7,SLC2A5,GPR... | ENSG00000228423,ENSG00000074800,ENSG0000023067... | 809 | 0.520401 |
300 | 1 | 13196330 | 18486126 | p | 1p | p36.21 | 13201409 | p | 1p | p36.13 | 3 | 3 | PRAMEF13,PRAMEF18,PRAMEF5,PRAMEF8,PRAMEF33,PRA... | ENSG00000279169,ENSG00000279804,ENSG0000027060... | 860 | 0.534310 |
400 | 1 | 18631006 | 23838620 | p | 1p | p36.13 | 18748866 | p | 1p | p36.11 | 4 | 4 | PAX7,TAS1R2,AL080251.1,ALDH4A1,IFFO2,UBR4,AL03... | ENSG00000009709,ENSG00000179002,ENSG0000025527... | 756 | 0.553076 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32996 | X | 109623700 | 128052398 | q | Xq | q23 | 109625172 | q | Xq | q25 | 7 | 342 | KCNE5,ACSL4,TMEM164,AMMECR1,RTL9,CHRDL1,PAK3,C... | ENSG00000176076,ENSG00000068366,ENSG0000015760... | 731 | 0.501661 |
33096 | X | 128323620 | 140772679 | q | Xq | q25 | 128600468 | q | Xq | q27.1 | 8 | 343 | AL442647.1,SMARCA1,OCRL,APLN,XPNPEP2,SASH3,ZDH... | ENSG00000225689,ENSG00000102038,ENSG0000012212... | 743 | 0.483389 |
33196 | X | 140782405 | 153689010 | q | Xq | q27.1 | 140784871 | q | Xq | q28 | 9 | 344 | CDR1,AL078639.1,AL451048.1,SPANXB1,AC234778.2,... | ENSG00000184258,ENSG00000281508,ENSG0000022926... | 774 | 0.495017 |
33296 | X | 153688099 | 156016837 | q | Xq | q28 | 153696593 | q | Xq | q28 | 10 | 345 | SLC6A8,BCAP31,ABCD1,U52111.1,PLXNB3,SRPK3,IDH3... | ENSG00000130821,ENSG00000185825,ENSG0000010198... | 513 | 0.503322 |
33372 | Y | 2786855 | 25733388 | p | Yp | p11.2 | 2787699 | q | Yq | q11.23 | 0 | 346 | SRY,RPS4Y1,AC006157.1,ZFY,ZFY-AS1,LINC00278,TG... | ENSG00000184895,ENSG00000129824,ENSG0000027884... | 718 | 0.500000 |
347 rows × 16 columns
[13]:
flag = ~(BAF_merge_Xdata.var["chr"] == "Y")
BAF_merge_Xdata = BAF_merge_Xdata[:, flag]
Combine module
Similarly, by calling the config method xclone.XCloneConfig()
, users can create an instance of the XCloneConfig
class in the module
of “Combine” to set data-specific configures for Combine module.
Set the output directory by specifying outdir
.
Users can decide to plot the main figures in each module or not by setting xclone_plot
, True or False. If True, can also select cell annotation term plot_cell_anno_key
for plotting cells by clusters. If plot_cell_anno_key
is not specified, default cell_anno_key= "cell_type"
will be used.
For plotting functions in combine moudle, users can set bool variable merge_loss
and merge_loh
to see if they want to merge allele-specific copy loss/loh states. For more setting in XClone plotting, refer to plotting page. Here in TNBC1 datasets, we merge both the allele-specific copy loss and loh states.
Users can set BAF_denoise = True
to combine denoised results from BAF module with RDR module.
Before running the Combine moudle, users can use xconfig.display()
to print all configurations used in Combine module for previewing and logging.
Users can run the analysis on your input data by calling the run_combine
method by specifying the RDR_Xdata
, BAF_merge_Xdata
and config_file
. If no custmoised config file is used, XClone will load the default config file.
Here shows the example of TNBC1 Combine module.
[14]:
xconfig = xclone.XCloneConfig(dataset_name = dataset_name, module = "Combine")
xconfig.set_figure_params(xclone= True, fontsize = 18)
xconfig.outdir = outdir
xconfig.cell_anno_key = "cluster.pred"
xconfig.ref_celltype = "N"
xconfig.xclone_plot= True
xconfig.plot_cell_anno_key = "cluster"
xconfig.BAF_denoise = True
xconfig.display()
xclone.model.run_combine(RDR_Xdata,
BAF_merge_Xdata,
verbose = True,
run_verbose = True,
config_file = xconfig)
Trying to set attribute `.var` of view, copying.
Combine
Configurations:
BAF_denoise True
KNN_neighbors 10
_file_format_data h5ad
_file_format_figs pdf
_frameon True
_outdir /storage/yhhuang/users/rthuang/xclone/tutorials
_plot_suffix
_start 1680521352.1022782
_vector_friendly True
cell_anno_key cluster.pred
copygain_correct False
copygain_correct_mode None
copyloss_correct True
copyloss_correct_mode 1
dataset_name TNBC1_scRNA
exclude_XY False
file_format_data h5ad
file_format_figs pdf
merge_loh True
merge_loss True
module Combine
outdir /storage/yhhuang/users/rthuang/xclone/tutorials
plot_cell_anno_key cluster
plot_suffix
ref_celltype N
remove_guide_XY False
set_figtitle True
set_smartseq False
warninig_ignore True
xclone_plot True
[XClone Combination module running]************************
[XClone] BAF extend bins to genes.
[XClone data checking]: RDR and BAF in same cell order.
No genes in this bin: 3103 3152 , skip this bin.
No genes in this bin: 21554 21654 , skip this bin.
No genes in this bin: 22754 22839 , skip this bin.
No genes in this bin: 30061 30063 , skip this bin.
[XClone hint] combine_corrected_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
[XClone hint] combine_final_file saved in /storage/yhhuang/users/rthuang/xclone/tutorials/data/.
[XClone plotting]
[14]:
AnnData object with n_obs × n_vars = 1097 × 6402
obs: 'copykat.pred', 'cluster.pred', 'cluster', 'mit_clone_id', 'confident', 'tumor', 'copykat', 'counts_ratio'
var: 'GeneName', 'GeneID', 'chr', 'start', 'stop', 'arm', 'chr_arm', 'band', 'ref_avg', 'dispersion', 'gene_dispersion_bse', 'dispersion_capped', 'gene_index'
uns: 'data_mode', 'data_notes', 'genome_mode', 'log', 'rank_marker_genes', 'fit_dispersion_removed_genes', 'dispersion_base_celltype', 'pca', 'neighbors', 'chr_dict', 'guide_CNV_chrs_use_layers', 'guide_CNV_chrs_use_anno_key', 'ref_log_expression_brk', 'group_genes', 'CNV_ratio', 'Logliklihood'
obsm: 'X_pca'
varm: 'PCs'
layers: 'raw_expr', 'raw_ratio', 'ref_normalized', 'expected', 'WMA_smoothed', 'RDR_smooth', 'emm_prob_log', 'emm_prob_log_noHMM', 'emm_prob_noHMM', 'posterior_mtx', 'posterior_mtx_log', 'BAF_extend_post_prob', 'combine_base_prob', 'corrected_prob', 'prob1_merge', 'plot_prob_merge1', 'plot_prob_merge2', 'plot_prob_merge3', 'plot_prob_merge4'
obsp: 'distances', 'connectivities'
The final step anndata stored in outdir/data/combined_final.h5ad
and users can get the CNV indentification in layer corrected_prob
and prob1_merge
.