XClone preprocessing

XClone takes three cell by gene matrices as input: allele-specific AD and DP matrices and the total read depth matrix. XClone preprocessing pipeline is aimed to generate the three matrices from SAM/BAM/CRAM files. We recommend you use xcltk as the preprocessing pipeline. Before that, you need prepare the data.

Step1: Install xcltk and run xctlkt RDR module and BAF module, independently.
Step2: Load xctlkt RDR module.
Step3: Load xctlkt BAF module.
Step4: XClone analysis, see page Getting started.

Tool installation

Preprocessing via xcltk

xcltk is a toolkit for XClone count generation. We recommend to use xcltk Read depth count matrix and allelic count matrix. xcltk is avaliable through pypi. To install, type the following command line, and add -U for upgrading

pip install -U xcltk

Alternatively, you can install from this GitHub repository for latest (often development) version by following command line

pip install -U git+https://github.com/hxj5/xcltk

Required data

XClone RDR module

For RDR module, we need 3 files to create Anndata format data with layer raw_expr, cell annotation in obs and feature annotation in var.

RDR matrix
cell annotation
feature annotation

load RDR demo data

import xclone
RDR_adata = xclone.data.tnbc1_rdr()
## preview data details
RDR_adata
RDR_adata.obs
RDR_adata.var

How to build anndata from the files

Specify path of the files, and use xclone.pp.xclonedata to get the anndata format. mtx_barcodes_file at least include cell barcodes ID. If there are more cell annotations in other file, can use xclone.pp.extra_anno to import.

import xclone
data_dir = "xxx/xxx/xxx/"
RDR_file = data_dir + "xcltk.rdr.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
regions_anno_file = data_dir + "features.tsv" # feature annnotation
xclone.pp.xclonedata(RDR_file,
                     data_mode = 'RDR',
                     mtx_barcodes_file,
                     regions_anno_file,
                     genome_mode = "hg38_genes",
                     data_notes = None)

RDR_adata = xclone.pp.extra_anno(RDR_adata, anno_file, barcodes_key = "cell",
            cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
# default sep = ",", also support "\t"

if you use the xcltk tool to prepare the input matrix, then you could find it easier to use default feature annotation after you specify the genome_mode (include: “hg19_genes”, “hg38_genes” and also default 5M length blocks annotation).

RDR_adata = xclone.pp.xclonedata(RDR_file, 'RDR', mtx_barcodes_file, genome_mode = "hg19_genes")

XClone BAF module

For BAF module, we need 4 files to create Anndata format data with layers AD and DP, cell annotation in obs and feature annotation in var.

AD matrix
DP matrix
cell annotation
feature annotation

load BAF demo data

import xclone
BAF_adata = xclone.data.tnbc1_baf()
## preview data details
BAF_adata
BAF_adata.obs
BAF_adata.var

How to build anndata from the files

Specify path of the files, and use xclone.pp.xclonedata to get the anndata format, similar with RDR module. Here the AD_file and DP_file are sparse matrix imported as AD and DP layers.

import xclone
data_dir = "xxx/xxx/xxx/"
AD_file = data_dir + "AD.mtx"
DP_file = data_dir + "DP.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
# use default gene annotation
BAF_adata = xclone.pp.xclonedata([AD_file, DP_file], 'BAF',
                                 mtx_barcodes_file,
                                 genome_mode = "hg19_genes")
BAF_adata = xclone.pp.extra_anno(BAF_adata, anno_file, barcodes_key = "cell",
            cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")

Preparing data

Detail instructions on how to prepare the data for generating Anndata for RDR module and BAF module. Both part need annotation data for cell and genome features. We recommend you prepare the annotation data as follows.

Annotation data

Feature annotation

Feature annotation at least includes chr, start, stop, arm information and in chr1-22,X,Y order for intuitive visualization and analysis. Here are two feature annotation examples in XClone and you can load as your annotation file. If you use xcltk pipeline, there are default annotations provided.

import xclone
hg38_genes = xclone.pp.load_anno(genome_mode = "hg38_genes")
hg38_blocks = xclone.pp.load_anno(genome_mode = "hg38_blocks")

Feature (genes) annotation sample in hg38
GeneName	GeneID	chr	start	stop	arm	chr_arm	band
MIR1302-2HG	ENSG00000243485	1	29554	31109	p	1p	p36.33
FAM138A	ENSG00000237613	1	34554	36081	p	1p	p36.33
OR4F5	ENSG00000186092	1	65419	71585	p	1p	p36.33
AL627309.1	ENSG00000238009	1	89295	133723	p	1p	p36.33
AL627309.3	ENSG00000239945	1	89551	91105	p	1p	p36.33
AL627309.2	ENSG00000239906	1	139790	140339	p	1p	p36.33
AL627309.4	ENSG00000241599	1	160446	161525	p	1p	p36.33
AL732372.1	ENSG00000236601	1	358857	366052	p	1p	p36.33
OR4F29	ENSG00000284733	1	450703	451697	p	1p	p36.33
AC114498.1	ENSG00000235146	1	587629	594768	p	1p	p36.33

Feature (blocks) annotation sample in hg38
chr	start	stop	arm
1	1	50000	p
1	50001	100000	p
1	100001	150000	p
1	150001	200000	p
1	200001	250000	p
1	250001	300000	p
1	300001	350000	p
1	350001	400000	p
1	400001	450000	p
1	450001	500000	p

Cell annotation

cell barcodes

barcodes_file include barcodes without any hearder.

barcodes_sample
AAACCTGCACCTTGTC-1
AAACGGGAGTCCTCCT-1
AAACGGGTCCAGAGGA-1
AAAGATGCAGTTTACG-1
AAAGCAACAGGAATGC-1
AAAGCAATCGGAATCT-1
AAAGTAGAGTGTACTC-1
AAAGTAGCAGCCTATA-1
AAAGTAGGTACAGTTC-1
AAAGTAGTCGCATGAT-1
AAAGTAGTCTATCCCG-1

cell annotation

Cell annotation (anno_file) at least includes cell, cell_type information (Tumor or Normal, T/N), where cell is the key of cell barcodes.

cell annotation sample
cell	Sample	GenesExpressed	Type	Cellcycle	Clone_ID	cell_type
BT_869-P01-A02	BCH869	4669	Malignant	-0.508018255	1.0	T
BT_869-P01-A03	BCH869	5610	Malignant	0.773063553	2.0	T
BT_869-P01-A04	BCH869	4291	Malignant	-0.627932323	2.0	T
BT_869-P01-A05	BCH869	7037	Malignant	1.879331871	2.0	T
BT_869-P01-A06	BCH869	6305	Malignant	-0.433514129	2.0	T
BT_869-P01-A07	BCH869	7034	Malignant	-0.664458445	2.0	T
BT_869-P01-A08	BCH869	8289	Malignant	-1.106595637	2.0	T
BT_869-P01-A10	BCH869	4572	Malignant	-0.768966081	1.0	T
BT_869-P01-A11	BCH869	6465	Malignant	-1.03425176	2.0	T
BT_869-P01-B01	BCH869	4708	Malignant	-0.326930934	1.0	T

Prepare the allele-specific data (BAF) and expression data (RDR)

XClone takes 2 cell by features (genes/blocks) integer allelic AD and DP count matrices as BAF input, and it takes a cell by features (genes/blocks) integer UMI/read count matrix as RDR input. For BAF, we recommend using xcltk tool to get the two allelic AD and DP matrices. For RDR, you may use xcltk, 10x CellRanger or any other expression quantification tools to get the RDR UMI/read count matrix.

See xcltk_preprocess for details of how to prepare BAF and RDR data.