XClone preprocessing
XClone takes three cell by gene matrices as input: allele-specific AD and DP matrices and the total read depth matrix. XClone preprocessing pipeline is aimed to generate the three matrices from SAM/BAM/CRAM files. We recommend you use xcltk as the preprocessing pipeline. Before that, you need prepare the data.
Step1: Install xcltk and run xctlkt RDR module and BAF module, independently.
Step2: Load xctlkt RDR module.
Step3: Load xctlkt BAF module.
Step4: XClone analysis, see page Getting started.
Tool installation
Preprocessing via xcltk
xcltk is a toolkit for XClone count generation. We recommend to use xcltk Read depth count matrix and allelic count matrix. xcltk is avaliable through pypi. To install, type the following command line, and add -U for upgrading
pip install -U xcltk
Alternatively, you can install from this GitHub repository for latest (often development) version by following command line
pip install -U git+https://github.com/hxj5/xcltk
Required data
XClone RDR module
For RDR module, we need 3 files to create Anndata
format data with layer raw_expr
,
cell annotation in obs
and feature annotation in var
.
RDR matrix
cell annotation
feature annotation
load RDR demo data
import xclone
RDR_adata = xclone.data.tnbc1_rdr()
## preview data details
RDR_adata
RDR_adata.obs
RDR_adata.var
How to build anndata from the files
Specify path of the files, and use xclone.pp.xclonedata
to get the anndata format.
mtx_barcodes_file
at least include cell barcodes ID. If there are more cell annotations
in other file, can use xclone.pp.extra_anno
to import.
import xclone
data_dir = "xxx/xxx/xxx/"
RDR_file = data_dir + "xcltk.rdr.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
regions_anno_file = data_dir + "features.tsv" # feature annnotation
xclone.pp.xclonedata(RDR_file,
data_mode = 'RDR',
mtx_barcodes_file,
regions_anno_file,
genome_mode = "hg38_genes",
data_notes = None)
RDR_adata = xclone.pp.extra_anno(RDR_adata, anno_file, barcodes_key = "cell",
cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
# default sep = ",", also support "\t"
if you use the xcltk
tool to prepare the input matrix, then you could find it easier to
use default feature annotation after you specify the genome_mode (include: “hg19_genes”,
“hg38_genes” and also default 5M length blocks annotation).
RDR_adata = xclone.pp.xclonedata(RDR_file, 'RDR', mtx_barcodes_file, genome_mode = "hg19_genes")
XClone BAF module
For BAF module, we need 4 files to create Anndata
format data with layers AD
and DP
,
cell annotation in obs
and feature annotation in var
.
AD matrix
DP matrix
cell annotation
feature annotation
load BAF demo data
import xclone
BAF_adata = xclone.data.tnbc1_baf()
## preview data details
BAF_adata
BAF_adata.obs
BAF_adata.var
How to build anndata from the files
Specify path of the files, and use xclone.pp.xclonedata
to get the anndata format, similar with
RDR module. Here the AD_file
and DP_file
are sparse matrix imported as AD
and DP
layers.
import xclone
data_dir = "xxx/xxx/xxx/"
AD_file = data_dir + "AD.mtx"
DP_file = data_dir + "DP.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
# use default gene annotation
BAF_adata = xclone.pp.xclonedata([AD_file, DP_file], 'BAF',
mtx_barcodes_file,
genome_mode = "hg19_genes")
BAF_adata = xclone.pp.extra_anno(BAF_adata, anno_file, barcodes_key = "cell",
cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
Preparing data
Detail instructions on how to prepare the data for generating Anndata for RDR module and BAF module. Both part need annotation data for cell and genome features. We recommend you prepare the annotation data as follows.
Annotation data
Feature annotation
Feature annotation at least includes chr
, start
, stop
, arm
information and
in chr1-22,X,Y order for intuitive visualization and analysis. Here are two feature annotation
examples in XClone and you can load as your annotation file. If you use xcltk pipeline, there
are default annotations provided.
import xclone
hg38_genes = xclone.pp.load_anno(genome_mode = "hg38_genes")
hg38_blocks = xclone.pp.load_anno(genome_mode = "hg38_blocks")
GeneName |
GeneID |
chr |
start |
stop |
arm |
chr_arm |
band |
---|---|---|---|---|---|---|---|
MIR1302-2HG |
ENSG00000243485 |
1 |
29554 |
31109 |
p |
1p |
p36.33 |
FAM138A |
ENSG00000237613 |
1 |
34554 |
36081 |
p |
1p |
p36.33 |
OR4F5 |
ENSG00000186092 |
1 |
65419 |
71585 |
p |
1p |
p36.33 |
AL627309.1 |
ENSG00000238009 |
1 |
89295 |
133723 |
p |
1p |
p36.33 |
AL627309.3 |
ENSG00000239945 |
1 |
89551 |
91105 |
p |
1p |
p36.33 |
AL627309.2 |
ENSG00000239906 |
1 |
139790 |
140339 |
p |
1p |
p36.33 |
AL627309.4 |
ENSG00000241599 |
1 |
160446 |
161525 |
p |
1p |
p36.33 |
AL732372.1 |
ENSG00000236601 |
1 |
358857 |
366052 |
p |
1p |
p36.33 |
OR4F29 |
ENSG00000284733 |
1 |
450703 |
451697 |
p |
1p |
p36.33 |
AC114498.1 |
ENSG00000235146 |
1 |
587629 |
594768 |
p |
1p |
p36.33 |
chr |
start |
stop |
arm |
---|---|---|---|
1 |
1 |
50000 |
p |
1 |
50001 |
100000 |
p |
1 |
100001 |
150000 |
p |
1 |
150001 |
200000 |
p |
1 |
200001 |
250000 |
p |
1 |
250001 |
300000 |
p |
1 |
300001 |
350000 |
p |
1 |
350001 |
400000 |
p |
1 |
400001 |
450000 |
p |
1 |
450001 |
500000 |
p |
Cell annotation
cell barcodes
barcodes_file include barcodes without any hearder.
AAACCTGCACCTTGTC-1 |
AAACGGGAGTCCTCCT-1 |
AAACGGGTCCAGAGGA-1 |
AAAGATGCAGTTTACG-1 |
AAAGCAACAGGAATGC-1 |
AAAGCAATCGGAATCT-1 |
AAAGTAGAGTGTACTC-1 |
AAAGTAGCAGCCTATA-1 |
AAAGTAGGTACAGTTC-1 |
AAAGTAGTCGCATGAT-1 |
AAAGTAGTCTATCCCG-1 |
cell annotation
Cell annotation (anno_file) at least includes cell
, cell_type
information (Tumor or Normal, T/N), where cell
is the key of cell barcodes.
cell |
Sample |
GenesExpressed |
Type |
Cellcycle |
Clone_ID |
cell_type |
---|---|---|---|---|---|---|
BT_869-P01-A02 |
BCH869 |
4669 |
Malignant |
-0.508018255 |
1.0 |
T |
BT_869-P01-A03 |
BCH869 |
5610 |
Malignant |
0.773063553 |
2.0 |
T |
BT_869-P01-A04 |
BCH869 |
4291 |
Malignant |
-0.627932323 |
2.0 |
T |
BT_869-P01-A05 |
BCH869 |
7037 |
Malignant |
1.879331871 |
2.0 |
T |
BT_869-P01-A06 |
BCH869 |
6305 |
Malignant |
-0.433514129 |
2.0 |
T |
BT_869-P01-A07 |
BCH869 |
7034 |
Malignant |
-0.664458445 |
2.0 |
T |
BT_869-P01-A08 |
BCH869 |
8289 |
Malignant |
-1.106595637 |
2.0 |
T |
BT_869-P01-A10 |
BCH869 |
4572 |
Malignant |
-0.768966081 |
1.0 |
T |
BT_869-P01-A11 |
BCH869 |
6465 |
Malignant |
-1.03425176 |
2.0 |
T |
BT_869-P01-B01 |
BCH869 |
4708 |
Malignant |
-0.326930934 |
1.0 |
T |
Prepare the allele-specific data (BAF) and expression data (RDR)
XClone takes 2 cell by features (genes/blocks) integer allelic AD and DP count matrices as BAF input, and it takes a cell by features (genes/blocks) integer UMI/read count matrix as RDR input.
For BAF, we recommend using xcltk
tool to get the two allelic AD and DP matrices. For RDR, you may use xcltk
, 10x CellRanger
or any other expression quantification tools to get the RDR UMI/read count matrix.
See xcltk_preprocess for details of how to prepare BAF and RDR data.