XClone preprocessing

XClone takes three cell by gene matrices as input: allele-specific AD and DP matrices and the total read depth matrix. XClone preprocessing pipeline is aimed to generate the three matrices from SAM/BAM/CRAM files. We recommend you use xcltk as the preprocessing pipeline. Before that, you need prepare the data.

Tool installation

Preprocessing via xcltk

xcltk is a toolkit for XClone count generation. We recommend to use xcltk Read depth count matrix and allelic count matrix. xcltk is avaliable through pypi. To install, type the following command line, and add -U for upgrading

pip install -U xcltk

Alternatively, you can install from this GitHub repository for latest (often development) version by following command line

pip install -U git+https://github.com/hxj5/xcltk

Required data

XClone RDR module

For RDR module, we need 3 files to create Anndata format data with layer raw_expr, cell annotation in obs and feature annotation in var.

  • RDR matrix

  • cell annotation

  • feature annotation

load RDR demo data

import xclone
RDR_adata = xclone.data.tnbc1_rdr()
## preview data details
RDR_adata
RDR_adata.obs
RDR_adata.var

How to build anndata from the files

Specify path of the files, and use xclone.pp.xclonedata to get the anndata format. mtx_barcodes_file at least include cell barcodes ID. If there are more cell annotations in other file, can use xclone.pp.extra_anno to import.

import xclone
data_dir = "xxx/xxx/xxx/"
RDR_file = data_dir + "xcltk.rdr.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
regions_anno_file = data_dir + "features.tsv" # feature annnotation
xclone.pp.xclonedata(RDR_file,
                     data_mode = 'RDR',
                     mtx_barcodes_file,
                     regions_anno_file,
                     genome_mode = "hg38_genes",
                     data_notes = None)

RDR_adata = xclone.pp.extra_anno(RDR_adata, anno_file, barcodes_key = "cell",
            cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")
# default sep = ",", also support "\t"

if you use the xcltk tool to prepare the input matrix, then you could find it easier to use default feature annotation after you specify the genome_mode (include: “hg19_genes”, “hg38_genes” and also default 5M length blocks annotation).

RDR_adata = xclone.pp.xclonedata(RDR_file, 'RDR', mtx_barcodes_file, genome_mode = "hg19_genes")

XClone BAF module

For BAF module, we need 4 files to create Anndata format data with layers AD and DP, cell annotation in obs and feature annotation in var.

  • AD matrix

  • DP matrix

  • cell annotation

  • feature annotation

load BAF demo data

import xclone
BAF_adata = xclone.data.tnbc1_baf()
## preview data details
BAF_adata
BAF_adata.obs
BAF_adata.var

How to build anndata from the files

Specify path of the files, and use xclone.pp.xclonedata to get the anndata format, similar with RDR module. Here the AD_file and DP_file are sparse matrix imported as AD and DP layers.

import xclone
data_dir = "xxx/xxx/xxx/"
AD_file = data_dir + "AD.mtx"
DP_file = data_dir + "DP.mtx"
mtx_barcodes_file = data_dir + "barcodes.tsv" # cell barcodes
# use default gene annotation
BAF_adata = xclone.pp.xclonedata([AD_file, DP_file], 'BAF',
                                 mtx_barcodes_file,
                                 genome_mode = "hg19_genes")
BAF_adata = xclone.pp.extra_anno(BAF_adata, anno_file, barcodes_key = "cell",
            cell_anno_key = ["Clone_ID", "Type", "cell_type"], sep = ",")

Preparing data

Detail instructions on how to prepare the data for generating Anndata for RDR module and BAF module. Both part need annotation data for cell and genome features. We recommend you prepare the annotation data as follows.

Annotation data

Feature annotation

Feature annotation at least includes chr, start, stop, arm information and in chr1-22,X,Y order for intuitive visualization and analysis. Here are two feature annotation examples in XClone and you can load as your annotation file. If you use xcltk pipeline, there are default annotations provided.

import xclone
hg38_genes = xclone.pp.load_anno(genome_mode = "hg38_genes")
hg38_blocks = xclone.pp.load_anno(genome_mode = "hg38_blocks")
Feature (genes) annotation sample in hg38

GeneName

GeneID

chr

start

stop

arm

chr_arm

band

MIR1302-2HG

ENSG00000243485

1

29554

31109

p

1p

p36.33

FAM138A

ENSG00000237613

1

34554

36081

p

1p

p36.33

OR4F5

ENSG00000186092

1

65419

71585

p

1p

p36.33

AL627309.1

ENSG00000238009

1

89295

133723

p

1p

p36.33

AL627309.3

ENSG00000239945

1

89551

91105

p

1p

p36.33

AL627309.2

ENSG00000239906

1

139790

140339

p

1p

p36.33

AL627309.4

ENSG00000241599

1

160446

161525

p

1p

p36.33

AL732372.1

ENSG00000236601

1

358857

366052

p

1p

p36.33

OR4F29

ENSG00000284733

1

450703

451697

p

1p

p36.33

AC114498.1

ENSG00000235146

1

587629

594768

p

1p

p36.33

Feature (blocks) annotation sample in hg38

chr

start

stop

arm

1

1

50000

p

1

50001

100000

p

1

100001

150000

p

1

150001

200000

p

1

200001

250000

p

1

250001

300000

p

1

300001

350000

p

1

350001

400000

p

1

400001

450000

p

1

450001

500000

p

Cell annotation

  • cell barcodes

barcodes_file include barcodes without any hearder.

barcodes_sample

AAACCTGCACCTTGTC-1

AAACGGGAGTCCTCCT-1

AAACGGGTCCAGAGGA-1

AAAGATGCAGTTTACG-1

AAAGCAACAGGAATGC-1

AAAGCAATCGGAATCT-1

AAAGTAGAGTGTACTC-1

AAAGTAGCAGCCTATA-1

AAAGTAGGTACAGTTC-1

AAAGTAGTCGCATGAT-1

AAAGTAGTCTATCCCG-1

  • cell annotation

Cell annotation (anno_file) at least includes cell, cell_type information (Tumor or Normal, T/N), where cell is the key of cell barcodes.

cell annotation sample

cell

Sample

GenesExpressed

Type

Cellcycle

Clone_ID

cell_type

BT_869-P01-A02

BCH869

4669

Malignant

-0.508018255

1.0

T

BT_869-P01-A03

BCH869

5610

Malignant

0.773063553

2.0

T

BT_869-P01-A04

BCH869

4291

Malignant

-0.627932323

2.0

T

BT_869-P01-A05

BCH869

7037

Malignant

1.879331871

2.0

T

BT_869-P01-A06

BCH869

6305

Malignant

-0.433514129

2.0

T

BT_869-P01-A07

BCH869

7034

Malignant

-0.664458445

2.0

T

BT_869-P01-A08

BCH869

8289

Malignant

-1.106595637

2.0

T

BT_869-P01-A10

BCH869

4572

Malignant

-0.768966081

1.0

T

BT_869-P01-A11

BCH869

6465

Malignant

-1.03425176

2.0

T

BT_869-P01-B01

BCH869

4708

Malignant

-0.326930934

1.0

T

Prepare the allele-specific data (BAF) and expression data (RDR)

XClone takes 2 cell by features (genes/blocks) integer allelic AD and DP count matrices as BAF input, and it takes a cell by features (genes/blocks) integer UMI/read count matrix as RDR input. For BAF, we recommend using xcltk tool to get the two allelic AD and DP matrices. For RDR, you may use xcltk, 10x CellRanger or any other expression quantification tools to get the RDR UMI/read count matrix.

See xcltk_preprocess for details of how to prepare BAF and RDR data.