The ChangKang Project

Quality Check and Raw Data Conversion

Raw Sequencing Data were preprocessed by fastp v0.12.6(Chen, Shifu et al., 2018) for subsequent analysis:

1) adapter trimming

2) remove the reads in which the N base has reached a certain percentage (default length of 5 bp)

3) remove the reads which contain low quality bases (default quality threshold value <= 20) above a certain portion (default 40%)

4) sliding window trimming: the bases in the sliding window (default is 4bp) with mean quality below cutting quality (default is 20) will be cut

Alignment

The sequenced reads were aligned to the reference human genome (build hg19) using sentieon bwa-mem. Subsequent processing including sorting read and marking duplicates were performed using sentieon tools v201808 (https://www.sentieon.com/ ), which were improved upon the GATK Toolkit (https://gatk.broadinstitute.org/hc/en-us). Sequence depth and coverage were obtained using bamdst (https://github.com/shiquan/bamdst).

Mutation Calling

To identify all the variants, we used three somatic mutation callers for single nucleotide variants (SNVs) and indels:

1) VarScan2 (v2.3.8): -min-coverage 10 -min-var-freq 0.05 -min-freq-for-hom 0.75 -p-value 0.99 -somatic-p-value 0.05.(Koboldt et al., 2012)

2) Mutect2: gatk mutect2 -R -pon -germline-resource -af-of-alleles-not-in-resource 0.0000025 -L -disable-read-filter.(Cibulskis, Kristian et al., 2013)

3) TNscope: sentieon driver -t -algo TNscope -dbsnp -pon -cosmic

To improve specificity, a panel of normal (PON) samples filtration was used to removed background germline variations and artifacts. Mutect2 and TNscope are based on bam files which processed by quality score recalibration that performed using GATK4 (v 4.1.1.0) (McKenna et al., 2010).

Specifically, processSomatic and somaticFilter: -min-coverage 10, -min-reads2 2, -min-strands2 1, -min-avg-qual 20, -p-value 0.05, were used to extract high-confidence somatic variants from the raw VarScan2 results and to remove clusters of false positives and SNVs near indels. Additional filtration with fpfilter.pl script from Varscan2 was performed tumor read counts generated by bam-readcount(https://github.com/genome/bam-readcount) was performed to reduce false positive calls. We discarded SNVs in RepeatMasker repeat regions(http://www.repeatmasker.org).

Annotation and Filtering Strategies

Somatic mutations were then annotated using VEP(McLaren et al., 2016). To obtain reliable mutation calls, we used a two-step approach.

First, chose mutations that were identified in at least two out of the three callers (Mutect2, TNscope and VarScan2).

Second, additional filtering with three criteria was performed:

1) variant allele frequency (VAF) >=8%

2) sequencing depth in the region >=8

3) sequence reads in support of the variant call >= 2

CNV Calling and GISTIC Analysis

CNV analusis was done using EXCAVATOR2 v1.1.2(D'Aurizio, R et al., 2016). EXCAVATOR2 is a collection of bash, R and Fortran scripts and codes that analyses WES data to identify CNVs. It extends the Read Count approach to the whole genome sequence and exploits the Shifting Level Model (SLM) algorithm to segment the two combined profiles.

To study significantly recurrent regions of SCNA, we applied GISTIC2 v2.0.23(Mermel et al., 2011) applied to the copy number segments. GISTIC2 was run with the following parameters: -ta 0.3 -td 0.3 -armpeel 1 -cap 1.5 -conf 0.99 -genegistic 1 -gcm mean -js 4 -maxseg 2000 -qvt 0.05 -savegene 1 -brlen 0.98 -broad 0 -rx 0.

Signature Identification and Decomposition

Mutational signatures were identified using R package LassoSig from the CRC samples (only samples with at least 20 SNVs were included). The normalization method was set to ‘exome2genome’. This approach organized sample information in the form of the fraction of mutations in each of 96 trinucleotides and determined the weighted combination of the COSMIC signatures (https://cancer.sanger.ac.uk/cosmic/signatures) that most closely reconstructed the mutational profile.