XClone preprocessing
XClone takes three cell by gene matrices as input: allele-specific AD and DP matrices and the total read depth matrix. XClone preprocessing pipeline is aimed to generate the three matrices from SAM/BAM/CRAM files. We recommend you use xcltk as the preprocessing pipeline. Before that, you need prepare the data.
Step1: Install xcltk and run xctlkt RDR module and BAF module, independently.
Step2: Load xctlkt RDR module.
Step3: Load xctlkt BAF module.
Step4: XClone analysis, see page Getting started.
Tool installation
Preprocessing via xcltk
xcltk is a toolkit for XClone count generation. We recommend to use xcltk Read depth count matrix and allelic count matrix. xcltk is avaliable through pypi. To install, type the following command line, and add -U for upgrading
pip install -U xcltk
Alternatively, you can install from this GitHub repository for latest (often development) version by following command line
pip install -U git+https://github.com/hxj5/xcltk
Required data
XClone RDR module
For RDR module, we need 3 files to create Anndata format data with layer raw_expr,
cell annotation in obs and feature annotation in var.
RDR matrix
cell annotation
feature annotation
load RDR demo data
import xclone
RDR_adata = xclone.data.tnbc1_rdr()
## preview data details
RDR_adata
RDR_adata.obs
RDR_adata.var
How to build anndata from the files
Specify path of the files, and use xclone.pp.xclonedata to get the anndata format.
mtx_barcodes_file at least include cell barcodes ID. If there are more cell annotations
in other file, can use xclone.pp.extra_anno to import.
import xclone
data_dir = "xxx/xxx/xxx/"
RDR_file = data_dir + "xcltk.rdr.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
regions_anno_file = data_dir + "features.tsv" # feature annnotation
xclone.pp.xclonedata(RDR_file,
data_mode = 'RDR',
mtx_barcodes_file,
regions_anno_file,
genome_mode = "hg38_genes",
data_notes = None)
RDR_adata = xclone.pp.extra_anno(RDR_adata, anno_file, barcodes_key = "cell",
cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
# default sep = ",", also support "\t"
if you use the xcltk tool to prepare the input matrix, then you could find it easier to
use default feature annotation after you specify the genome_mode (include: “hg19_genes”,
“hg38_genes” and also default 5M length blocks annotation).
RDR_adata = xclone.pp.xclonedata(RDR_file, 'RDR', mtx_barcodes_file, genome_mode = "hg19_genes")
XClone BAF module
For BAF module, we need 4 files to create Anndata format data with layers AD and DP,
cell annotation in obs and feature annotation in var.
AD matrix
DP matrix
cell annotation
feature annotation
load BAF demo data
import xclone
BAF_adata = xclone.data.tnbc1_baf()
## preview data details
BAF_adata
BAF_adata.obs
BAF_adata.var
How to build anndata from the files
Specify path of the files, and use xclone.pp.xclonedata to get the anndata format, similar with
RDR module. Here the AD_file and DP_file are sparse matrix imported as AD and DP layers.
import xclone
data_dir = "xxx/xxx/xxx/"
AD_file = data_dir + "AD.mtx"
DP_file = data_dir + "DP.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
# use default gene annotation
BAF_adata = xclone.pp.xclonedata([AD_file, DP_file], 'BAF',
mtx_barcodes_file,
genome_mode = "hg19_genes")
BAF_adata = xclone.pp.extra_anno(BAF_adata, anno_file, barcodes_key = "cell",
cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
Preparing data
Detail instructions on how to prepare the data for generating Anndata for RDR module and BAF module. Both part need annotation data for cell and genome features. We recommend you prepare the annotation data as follows.
Annotation data
Feature annotation
Feature annotation at least includes chr, start, stop, arm information and
in chr1-22,X,Y order for intuitive visualization and analysis. Here are two feature annotation
examples in XClone and you can load as your annotation file. If you use xcltk pipeline, there
are default annotations provided.
import xclone
hg38_genes = xclone.pp.load_anno(genome_mode = "hg38_genes")
hg38_blocks = xclone.pp.load_anno(genome_mode = "hg38_blocks")
GeneName |
GeneID |
chr |
start |
stop |
arm |
chr_arm |
band |
|---|---|---|---|---|---|---|---|
MIR1302-2HG |
ENSG00000243485 |
1 |
29554 |
31109 |
p |
1p |
p36.33 |
FAM138A |
ENSG00000237613 |
1 |
34554 |
36081 |
p |
1p |
p36.33 |
OR4F5 |
ENSG00000186092 |
1 |
65419 |
71585 |
p |
1p |
p36.33 |
AL627309.1 |
ENSG00000238009 |
1 |
89295 |
133723 |
p |
1p |
p36.33 |
AL627309.3 |
ENSG00000239945 |
1 |
89551 |
91105 |
p |
1p |
p36.33 |
AL627309.2 |
ENSG00000239906 |
1 |
139790 |
140339 |
p |
1p |
p36.33 |
AL627309.4 |
ENSG00000241599 |
1 |
160446 |
161525 |
p |
1p |
p36.33 |
AL732372.1 |
ENSG00000236601 |
1 |
358857 |
366052 |
p |
1p |
p36.33 |
OR4F29 |
ENSG00000284733 |
1 |
450703 |
451697 |
p |
1p |
p36.33 |
AC114498.1 |
ENSG00000235146 |
1 |
587629 |
594768 |
p |
1p |
p36.33 |
chr |
start |
stop |
arm |
|---|---|---|---|
1 |
1 |
50000 |
p |
1 |
50001 |
100000 |
p |
1 |
100001 |
150000 |
p |
1 |
150001 |
200000 |
p |
1 |
200001 |
250000 |
p |
1 |
250001 |
300000 |
p |
1 |
300001 |
350000 |
p |
1 |
350001 |
400000 |
p |
1 |
400001 |
450000 |
p |
1 |
450001 |
500000 |
p |
Cell annotation
cell barcodes
barcodes_file include barcodes without any hearder.
AAACCTGCACCTTGTC-1 |
AAACGGGAGTCCTCCT-1 |
AAACGGGTCCAGAGGA-1 |
AAAGATGCAGTTTACG-1 |
AAAGCAACAGGAATGC-1 |
AAAGCAATCGGAATCT-1 |
AAAGTAGAGTGTACTC-1 |
AAAGTAGCAGCCTATA-1 |
AAAGTAGGTACAGTTC-1 |
AAAGTAGTCGCATGAT-1 |
AAAGTAGTCTATCCCG-1 |
cell annotation
Cell annotation (anno_file) at least includes cell, cell_type
information (Tumor or Normal, T/N), where cell is the key of cell barcodes.
cell |
Sample |
GenesExpressed |
Type |
Cellcycle |
Clone_ID |
cell_type |
|---|---|---|---|---|---|---|
BT_869-P01-A02 |
BCH869 |
4669 |
Malignant |
-0.508018255 |
1.0 |
T |
BT_869-P01-A03 |
BCH869 |
5610 |
Malignant |
0.773063553 |
2.0 |
T |
BT_869-P01-A04 |
BCH869 |
4291 |
Malignant |
-0.627932323 |
2.0 |
T |
BT_869-P01-A05 |
BCH869 |
7037 |
Malignant |
1.879331871 |
2.0 |
T |
BT_869-P01-A06 |
BCH869 |
6305 |
Malignant |
-0.433514129 |
2.0 |
T |
BT_869-P01-A07 |
BCH869 |
7034 |
Malignant |
-0.664458445 |
2.0 |
T |
BT_869-P01-A08 |
BCH869 |
8289 |
Malignant |
-1.106595637 |
2.0 |
T |
BT_869-P01-A10 |
BCH869 |
4572 |
Malignant |
-0.768966081 |
1.0 |
T |
BT_869-P01-A11 |
BCH869 |
6465 |
Malignant |
-1.03425176 |
2.0 |
T |
BT_869-P01-B01 |
BCH869 |
4708 |
Malignant |
-0.326930934 |
1.0 |
T |
Prepare the allele-specific data (BAF) and expression data (RDR)
XClone takes 2 cell by features (genes/blocks) integer allelic AD and DP count matrices as BAF input, and it takes a cell by features (genes/blocks) integer UMI/read count matrix as RDR input.
For BAF, we recommend using xcltk tool to get the two allelic AD and DP matrices. For RDR, you may use xcltk, 10x CellRanger or any other expression quantification tools to get the RDR UMI/read count matrix.
See xcltk_preprocess for details of how to prepare BAF and RDR data.