PCaDB

77

Datasets

9,068

Samples

30

Signatures

Introduction

Prostate cancer

Prostate cancer (PCa) is the second most frequently diagnosed cancer in men worldwide, which accounts for 14.1% of the newly diagnosed cancer cases in 2020 [ GLOBOCAN ]. Localized PCa is a heterogeneous disease with highly variable clinical outcomes, which presents enormous challenges in the clinical management. The majority of patients with indolent PCa only require active surveillance, whereas patients with aggressive PCa require immediate local treatment. Comprehensive collection of transcriptomics data from large population cohorts is very critical for understanding the molecular biology of cancer, drug target identification and evaluation, and biomarker discovery for cancer diagnosis and prognosis, etc.

About PCaDB

PCaDB is a comprehensive and interactive database for transcriptomes from prostate cancer population cohorts. We collected 77 transcriptomics datasets with 9,068 patient samples and a single-cell RNA-sequencing (scRNAseq) dataset with ~ 30,000 cells for normal human prostates from public data repositories. We also included a comprehensive collection of 30 published gene expression-based prognostic signatures in PCaDB. A standard bioinformatics pipeline is developed to download and process the expression data and sample metadata of the transcriptomics datasets.

PCaDB provides a user-friendly interface and a suite of data analysis and visualization modules for the comprehensive analysis of individual genes, prognostic signatures, and the whole transcriptomes to elucidate the molecular heterogeneity in PCa, understand the mechanisms of tumor initiation and progression, as well as develop and validate prognostic signatures in large independent cohorts.

Citation

Please cite the following publication: Li, R., Zhu, J., Zhong, W.-D., and Jia, Z. (2021) PCaDB - a comprehensive and interactive database for transcriptomes from prostate cancer population cohorts. bioRxiv, 10.1101/2021.06.29.449134.

https://doi.org/10.1101/2021.06.29.449134

Search a gene

Gene Expression in Different Sample Types

Select a Dataset:

Kaplan Meier Survival Analysis (Forest Plot)

(Low- and high-expression groups were separated by median values)

Kaplan Meier Survival Curve

(Low- and high-expression groups were separated by median values)

tSNE Plot

UMAP Plot

Prognostic Signatures

Gene Expression Signatures for Prostate Cancer Prognosis

An inclusive collection of 30 published gene expression-based prognostic signatures for prostate cancer is included in PCaDB. The performances of the signatures have been investigated in the study [ Comprehensive evaluation of machine learning models and gene expression signatures for prostate cancer prognosis using large population cohorts ]. Comprehensive characterization of the signatures can be further performed in PCaDB.

Select a Signature:

List of genes in the selected prognostic signature

Select a Dataset:

Volcano Plot of Differentially Expressed Genes in the Signatures

(|Fold Change| > 2 & FDR < 0.01)

List of Differentially Expressed Genes in the Signatures

(|Fold Change| > 2 & FDR < 0.01)

Select a Dataset:

KM Survival Analysis of RFS for Common Genes in 3 or more Signatures

(Low- and high-expression groups were separated by median values)

KM Survival Analysis of RFS for All the Signature Genes

(Low- and high-expression groups were separated by median values)

Select the Gene List:

Gene Sets:

Functional Enrichment Analysis of Signature Genes

Bar Plot of the Top 30 Enriched Pathways

Bubble Plot of the Top 30 Enriched Pathways

Select a Signature:

Select a Training Dataset:

Select a Model:

Comparison of the performances of the prognostic signatures

C Index

Hazard Ratio (KM)

Area Under the ROC Curve

C Index

Hazard Ratio of KM analysis

Time-dependent AUC

Transcriptome Analysis

Select a Transcriptome Dataset

Summary

Platform:

GEO/ArrayExpress Accession:

SRA Accession:

cBioPortal:

Data Preprocessing:

Sample Type

PSA

Gleason Pattern

BCR Status

Time to BCR

2D Principal Component Analysis

3D Principal Component Analysis

Select the Case/Control Groups for Comparison:

Method:

Limma

Fold Change:

BH Adjusted P Value:

Volcano Plot of Differentially Expressed Genes

List of Differentially Expressed Genes

Univariate Kaplan Meier (KM) Survival Analysis

(Low- and high-expression groups were separated by median values)

Univariate Cox Proportional-Hazards (CoxPH) Survival Analysis

Multivariate Cox Proportional-Hazards (CoxPH) Survival Analysis

* Clinical factors that are used for multivariate CoxPH survival analysis in each dataset

(Y: included; N: not included)

Paste A Gene List (Ensembl ID):

Method:

CoxPH

CoxLasso

CoxRidge

Prognostic Model

Kaplan Meier Survival Analysis in the Training Dataset

(Low- and high-risk groups were separated by median values)

Kaplan Meier Survival Analysis in the Validation Datasets

PCaDB Pipeline

Details about PCaDB pipeline

Overview

SRA RNAseq data

TCGA RNAseq data

cBioPortal RNAseq data

GEO/ArrayExpress microarray data

Sample metadata model

Single-cell RNAseq data

Gene annotation

Overview

A suite of bioinformatics pipelines has been developed to download, process, and harmonize the transcriptomics data from multiple public data repositories, including Sequence Read Archive (SRA), Genomic Data Commons (GDC)/The Cancer Genome Atlas (TCGA), cBioPortal, Gene Expression Omnibus (GEO), and ArrayExpress.

In general, among the 77 datasets in PCaDB, (i) 17 datasets from SRA and 37 datasets from GEO/ArrayExpress with the raw FASTQ or CEL files, respectively, were reprocessed using the PCaDB pipeline; (ii) read count data for the TCGA-PRAD project was downloaded from GDC; (iii) normalized data such as RPKM/FPKM values for the 6 RNAseq datasets in cBioPortal were downloaded directly; and (iv) normalized intensities for the other 16 microarray datasets without raw data were downloaded from GEO. For RNAseq data from TCGA and SRA, the raw count data is normalized using the Trimmed Mean of Mvalues (TMM) method implemented in the R package

edgeR

[1], and lowly-expressed genes with logCPM < 0 in more than 50% of samples are filtered out prior to any downstream analysis. For the Affymetrix microarray data with raw CEL files in GEO/ArrayExpress, the Robust Multichip Average (RMA) method implemented in the R package

oligo

[2] was used for data normalization. For a small proportion of the datasets that only have normalized data provided in the public data repository, the methods for data processing described in the original papers were carefully inspected to make sure that appropriate processing methods had been used. All the gene expression values are in log2 scale. In PCaDB, the Ensembl gene identifiers are used for all the datasets. If multiple probes/genes matched to the same Ensembl ID, only the most informative one with the maximum interquartile range (IQR) for the gene expression is used for this Ensembl ID.

A comprehensive metadata model has been developed to harmonize the sample metadata (phenotype data) from the public repositories. The ExpressionSet object has been created for each dataset, allowing users to easily download both the gene expression data and harmonized metadata for advanced downstream analyses.

A single-cell RNAseq dataset from normal human prostae tissue [3] was downloaded from GUDMAP (https://www.gudmap.org/), allowing the investigation of a gene of interest at the single-cell level.

30 prognostic signatures for prostate cancer were obtained from a comprehensive study evaluating the machine learning models and gene expression signatures for prostate cancer prognosis using large population cohorts [4]. The 10 datasets used in that study were all included in the PCaDB database.

We also systematically organized the gene annotation information from HGNC, Ensembl, NCBI, and Gencode to facilitate the query of individual genes and the cross-dataset anlaysis.

Overview of the pipeline is shown in the figure below and details about the pipeline are described in the sub-sections.

SRA RNAseq data

For the 17 RNAseq datasets in SRA, the

fasterq-dump

tool in

SRA Toolkit (2.10.8)

[5] was used to download and convert SRA data into FASTQ format. The

FastQC (v0.11.5)

[6] tool was used for a comprehensive quality control of the sequencing data, including base sequence quality, GC content, overrepresented sequences, and adapter content, etc. Read trimming was not performed since a few benchmark studies have demonstrated that it's not necessary to remove adapter sequences or low-quality bases before alignment because the adapter sequences can be efficiently removed by read aligner via 'soft-cipping' and the low-quality bases can be rescued by the aligner [ref]. The most widely used aligner for RNAseq data -

STAR (2.7.9a)

[7] was adopted to align the sequencing reads to the reference genome

GRCh38 (primary assembly)

based on the

GENCODE (release 38)

gene annotation [8]. The 2-pass mapping was performed for each sample separately.

STAR

will perform the 1st pass mapping and then automatically extract junctions and insert them into the genome index, and, finally, re-align all reads in the 2nd mapping pass [ref]. The BAM files generated from the alignment step were sorted using

samtools (v1.9)

[9]. The

featureCounts

program [10] in the

Subread (2.0.3)

package was used for gene expression quantification to generate the count matrix. Only read pairs that were uniquely mapped to the exonic regions were quantified. The count data was finally normalized using the Trimmed Mean of Mvalues (TMM) method implemented in the R package

edgeR

, and lowly-expressed genes with logCPM < 0 in more than 50% of samples were filtered out prior to any downstream analysis. To generate a more comprehensive QC report, we used the

rnaseqc (v2.4.2)

tool [11] to characterize the quality of the RNA, sequencing data, alignment, and expression profile of each sample. Over 70 QC metrics can be computed by

rnaseqc

, including rRNA rate and exonic/intronic/intergenic rate, etc. Finally, an aggregated QC report can be generated using

MultiQC

[12] to display the QC metrics from multiple bioinformatics analyses, including

FastQC

STAR

featureCounts

, and

rnaseqc

All the scipts for SRA RNAseq data processing are publicly available in Github:

https://github.com/rli012/PCaDB

TCGA RNAseq data

The raw sequencing data in TCGA was processed using a similar pipeline (

TCGA mRNA Analysis Pipeline

) as what is being used by PCaDB for SRA RNAseq data processing (described above). The sequencing reads were aligned to the human reference genome

GRCh38

using the 2-pass mapping implemented in

STAR (2.4.2a)

based on

GENCODE (release22)

for gene annotation. The major difference between TCGA workflow and PCaDB workflow is that

HTSeq (0.6.1p1)

[13] was used by TCGA for gene expression quantification, whereas

featureCounts/Subread (2.0.3)

is adopted by PCaDB. The HTSeq-Counts data for The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) project is publicly available in Genomic Data Commons (GDC), which can be downloaded and processed directly by a series of functions in the R package

GDCRNATools

[14] to generate the count matrix. The count data was also normalized using the Trimmed Mean of M values (TMM) method implemented in the R package

edgeR

. Silimarly, lowly-expressed genes with logCPM < 0 in more than 50% of samples were filtered out prior to any downstream analysis.

Below is the script for downloading, preprocessing, and organizing data from TCGA-PRAD.

cBioPortal RNAseq data

Although transcriptomics data for multiple PCa cohorts are included in cBioPortal, the cBioPortal data are usually not used in PCaDB, unless the raw data are not available in any of the other data repositories, because only the normalized RNAseq data are provided in cBioPortal, which may not be suitable for some of the downstream analyses. e.g., the FPKM/RPKM values are not recommended for differential expression analysis.

For the 6 cBioPortal RNAseq datasets included in PCaDB, normalized gene expression data (FPKM/RPKM values) were downloaded directly from cBioPortal. Log2 transformation may be performed if it hadn't been done and the gene symbols/entrez ids were all converted to Ensembl IDs. If multiple genes matched to the same Ensembl ID, only the most informative one with the maximum interquartile range (IQR) for the gene expression is used for this Ensembl ID.

A custom script below is used to curate cBioPortal data.

GEO/ArrayExpress microarray data

Among the 38 datasets generated using the Affymetrix arrays (e.g., Affymetrix Human Exon 1.0 ST Array, Affymetrix Human Gene 2.0 ST Array, Affymetrix Human Genome U133A Array, and Affymetrix Human Genome U133 Plus 2.0 Array, etc.), 37 of them with the raw CEL files were reprocessed by the PCaDB pipeline. The R package

GEOquery

[15] was used to download the CEL files and the Robust Multichip Average (RMA) method implemented in the R package

oligo

was used for background correction, quantile normalization, and log2 transformation. The annotation package downloaded from the

Brainarray database

(

Version 24.0.0; GENCODE release 32

) was used for probe/gene annotation [16].

The only Affymetrix microarray dataset that was not reprocessed by our PCaDB pipeline is the CPC-Gene dataset (GSE107299). The reason was because that this dataset was generated based on two different Affymetrix arrays: Affymetrix Human Gene 2.0 ST Array (Batch 1,3,4,5) and Affymetrix Human Transcriptome Array 2.0 (Batch 2) with a batch effect. Therefore, the batch-corrected normalized data was downloaded directly from GEO. Likewise, the normalized intensities for the 15 microarray datasets generated using other technologies (e.g., Illumina HumanHT-12 V3.0 Expression Beadchip and Agilent-012391 Whole Human Genome Oligo Microarray G4112A, etc.) without raw data in GEO were downloaded directly from the repository using

GEOquery

. The methods for data processing described in the original papers were carefully inspected to make sure that appropriate processing methods had been used. Log2 transformation may be performed on the normalized data if it hadn't been done and the probe/gene ids were all converted to Ensembl IDs. If multiple probes/genes matched to the same Ensembl ID, only the most informative one with the maximum interquartile range (IQR) for the gene expression is used for this Ensembl ID.

Sample metadata model

Harmonization of sample metadata is very critical for the management and downstream analysis of omics data. In PCaDB, a comprehensive metadata model is built to harmonize the most important sample and clinical information across the datasets. A total of 33 key fields/traits are included:

sample_id

patient_id

tissue

batch

platform

sample_type

age_at_diagnosis

ethnicity

race

clinical_stage

clinical_t_stage

clinical_n_stage

clinical_m_stage

pathological_stage

pathological_t_stage

pathological_n_stage

pathological_m_stage

preop_psa

gleason_primary_pattern

gleason_secondary_pattern

gleason_tertiary_pattern

gleason_group

gleason_score

time_to_death

os_status

time_to_bcr

bcr_status

time_to_metastasis

metastasis_status

risk_group

treatment

additional_treatment

, and

pcadb_group

The

pcadb_group

column in each dataset was manually created with the group information that are of the most interests to end users depending on the availability, such as normal, primary tumor, and metastatic castration-resistant prostate cancer (CRPC); pre-androgen-deprivation therapy (ADT) and post-ADT samples; samples from the peripheral zone and the transcription zone of the prostate; samples from the European Americans and African Americans, etc.

The phenotypic values were also standardized, especially for the sample type field. For example,

tumor, tumour, primary, localized,

etc. were all labeled as

Primary

, whereas

normal, adjacent, benign,

etc. were all labeled as

Normal

The pipeline for metadata harmonization is very similar for data from different repositories. The differences are just the ways to download and retrieve the data.

For sample metadata from the TCGA-PRAD project, the XML files with the clinical information of the patients were downloaded and organized using the R package

GDCRNATools

. Some clinical features, such as preoperative prostate-specific antigen (PSA) level, which were not available in the Genomic Data Commons (GDC) data portal were retrieved from Broad GDAC Firehose (https://gdac. broadinstitute.org/).

The phenotype data of the datasets in GEO was retrieved using our in-house web tool -

WebGEO

(under development) based on the R package

GEOquery

, which allows querying and downloading the sample metadata of a dataset in seconds. In case users are interested in downloading phenotype data programmatically using

GEOquery

, the script is also provided below

Most of the SRA datasets (14 out of 17) also have corresponding GEO accessions, thus the sample metadata for these datasets were also retrieved from GEO using

WebGEO

. For the 3 remaining SRA datasets that do not have corresponding GEO accessions, the metadata (SraRunTable) were downloaded directly from SRA.

Most of the SRA datasets (14 out of 17) also have corresponding GEO accessions, thus the sample metadata for these datasets were also retreived from GEO using

WebGEO

. For the 3 remaining SRA datasets that do not have corresponding GEO accessions, the metadata (SraRunTable) were downloaded directly from SRA.

Clinical data for the 6 cBioPortal RNAseq datasets were manually downloaded from the portal, and sample metadata (.sdrf.txt file) for the 4 ArrayExpress datasets were downloaded using

wget.

In addition, the original publications were carefully reviewed to further ensure the accuracy and comprehensiveness of the sample metadata for each dataset in PCaDB.

Single-cell RNAseq data

The Seurat object of the single-cell RNAseq data can be directly downloaded from

GUDMAP

. The normalized gene expression matrix, cell type annotation, as well as the TSNE and UMAP coordiantes were included in an R list object in the PCaDB database.

For details about the scRNAseq data analysis, such as data processing, data normalization, tSNE and UMAP analyses, etc., please refer to the original study [3].

Gene annotation

Gene annotation information were downloaded from the official websites/FTPs of HGNC [17], Ensembl [18], NCBI [19], and GENCODE [8], and a comprehensive pipeline was created to map the gene IDs from different resources to facilitate the query of individual genes and the cross-dataset anlaysis. The harmonized gene annotation data has been deposited in PCaDB and can be easily downloaded on the

Download

page (PCaDB_Gene_Annotation.RDS).

References

[1] Robinson, M.D., McCarthy, D.J. and Smyth, G.K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.

[2] Carvalho, B.S. and Irizarry, R.A. (2010). A framework for oligonucleotide microarray preprocessing. Bioinformatics, 23(14), 1846-1847.

[3] Henry, G.H., Malewska, A., Joseph, D.B., Malladi, V.S., Lee, J., Torrealba, J., Mauck, R.J., Gahan, J.C., Raj, G.V., Roehrborn, C.G., Hon, G.C., MacConmara, M.P., Reese, J.C., Hutchinson, R.C., Vezina, C.M. and Strand, D.W. (2018). A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra. Cell reports, 25(12), 3530-3542.

[4] Li, R., Zhu, J., Zhong, W.-D., and Jia, Z. (2021) Comprehensive evaluation of machine learning models and gene expression signatures for prostate cancer prognosis using large population cohorts. bioRxiv. https://doi.org/10.1101/2021.07.02.450975

[5] SRA Toolkit Development Team. Available online at: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

[6] Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

[7] Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T.R., (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), pp.15-21.

[8] Frankish, A., Diekhans, M., Jungreis, I., Lagarde, J., Loveland, J.E., Mudge, J.M., Sisu, C., Wright, J.C., Armstrong, J., Barnes, I. and Berry, A., (2021). GENCODE 2021. Nucleic acids research, 49(D1), pp.D916-D923.

[9] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R., (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), pp.2078-2079.

[10] Liao, Y., Smyth, G.K. and Shi, W., (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), pp.923-930.

[11] Graubert, A., Aguet, F., Ravi, A., Ardlie, K.G. and Getz, G., (2021). RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts. Bioinformatics.

[12] Ewels, P., Magnusson, M., Lundin, S. and Käller, M., (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

[13] Anders, S., Pyl, P.T. and Huber, W., (2015). HTSeq—a Python framework to work with high-throughput sequencing data. bioinformatics, 31(2), pp.166-169.

[14] Li, R., Qu, H., Wang, S., Wei, J., Zhang, L., Ma, R., Lu, J., Zhu, J., Zhong, W.-D. and Jia, Z. (2018). GDCRNATools: an R/Bioconductor package for integrative analysis of lncRNA, miRNA and mRNA data in GDC. Bioinformatics, 34(14), 2515-2517.

[15] Davis, S. and Meltzer, P.S. (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23(14), 1846-1847.

[16] Dai, M., Wang, P., Boyd, A.D., Kostov, G., Athey, B., Jones, E.G., Bunney, W.E., Myers, R.M., Speed, T.P., Akil, H. and Watson, S.J., (2005). Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic acids research, 33(20), pp.e175-e175.

[17] Tweedie, S., Braschi, B., Gray, K., Jones, T.E., Seal, R.L., Yates, B. and Bruford, E.A., (2021). Genenames. org: the HGNC and VGNC resources in 2021. Nucleic Acids Research, 49(D1), pp.D939-D946.

[18] Howe, K.L., Achuthan, P., Allen, J., Allen, J., Alvarez-Jarreta, J., Amode, M.R., Armean, I.M., Azov, A.G., Bennett, R., Bhai, J. and Billis, K., (2021). Ensembl 2021. Nucleic acids research, 49(D1), pp.D884-D891.

[19] Sayers, E.W., Beck, J., Bolton, E.E., Bourexis, D., Brister, J.R., Canese, K., Comeau, D.C., Funk, K., Kim, S., Klimke, W. and Marchler-Bauer, A., (2021). Database resources of the national center for biotechnology information. Nucleic acids research, 49(D1), p.D10.

Download

Transcriptome Datasets

Summary

PCaDB_Transcriptome_Datasets_Summary.xlsx

ExpressionSet

TCGA-PRAD

CPC-Gene/GSE107299

Taylor/GSE21034

DKFZ/EGAS00001002923

GSE54460/SRP036848

Cambridge/GSE70768

Stockholm/GSE70769

CancerMap/GSE94767

CIT/E-MTAB-6128

Belfast/GSE116918

GSE25136

GSE41408

GSE46691

GSE51066

GSE37199

GSE44353

GSE59745

GSE79021

GSE3933-GPL2695

GSE35988-GPL6480

GSE35988-GPL6848

SU2C-PCF-2019-Capture

SU2C-PCF-2019-PolyA

GSE62116

GSE62667

GSE79957

GSE79956

GSE79958

GSE72291

GSE79915

GSE3325

GSE32269

GSE6919-GPL8300

GSE6919-GPL92

GSE6919-GPL93

GSE6752

GSE17951

GSE8218

E-TABM-26-U133A

E-TABM-26-U133B

GSE5132-GPL3834

GSE29079

GSE97284

GSE85698

GSE2109

GSE62872

SMMU

GSE77930

Neuroendocrine

Broad-Cornell

GSE111177/SRP133573

GSE120741/SRP163173

ERP006077

SRP119917

GSE104131/SRP118614

GSE133626/SRP212704

GSE80609/SRP073789

GSE118435/SRP157215

SRP151104

GSE22260/SRP002628

GSE114740/SRP148500

E-MTAB-5021/ERP017433

GSE95369/SRP100706

GSE48403/SRP026387

GSE51005/SRP030027

GSE126078/SRP183532

GSE12378

GSE2443

GSE26910

GSE28680

GSE29650

GSE30521

GSE32448

GSE32571

GSE6956

GSE7055

GSE38241

Prognostic Signatures

Summary

PCaDB_Prognostic_Signatures_Summary.xlsx

Gene List

PCaDB_Prognostic_Signatures_Gene_List.xlsx

Gene Annotation

PCaDB_Gene_Annotation.RDS

Tutorial

Details about the tutorial

Query a gene of interest

Prognostic signature evaluation

Transcriptomics data analysis

Data download

Query a gene of interest

1.1. Overview

Users can query a gene of interest by typing the Ensembl ID, Entrez ID, or HGNC approved symbol and alias symbol in the 'Search a gene' field and selecting the gene from the dropdown list. The general information about the gene and some useful external links to the databases such as ENSEMBL, HGNC, and NCBI for more detailed description of the gene, Genotype-Tissue Expression (GTEx) and Human Protein Atlas (HPA) for the gene expression pattern in different human tissues, and Kyoto Encyclopedia of Genes and Genomes (KEGG) for the pathways that the gene involves in are provided (Figure 1-1).

A suite of advanced analyses and visualizations can be interactively performed for the selected gene, including:

(1) Gene expression in different sample types

(2) Survival analysis of relapse-free survival (RFS)

(3) Gene expression analysis at the single-cell level

Figure 1-1. Query a gene of interest

1.2. Gene expression in different sample types

Users can select a dataset from the dropdown list and a boxplot will be generated to visualize the gene expression levels in different samples, such as primary tumor, tumor-adjacent normal, or metastatic tumor samples, depending on the availability of the data in the selected dataset (Figure 1-2)

The pipelines used for gene expression data normalization have been described in the 'PCaDB Pipeline' tab. In general, (i) for RNAseq data from TCGA and SRA, the counts per million mapped reads (CPM) vaules are used; (ii) for microarray data from GEO and ArrayExpress, the normalized intensity values are used; and (iii) for the normalized RNAseq data downloaded directly from cBioPortal, the fragments/reads per kilobase of transcript per million mapped reads (FPKM/RPKM) values are used. All the gene expression values are in log2 scale. In PCaDB, the Ensembl gene identifiers are used for all the datasets. If multiple probes/genes matched to the same Ensembl ID, only the most informative one with the maximum interquartile range (IQR) for the gene expression is used for this Ensembl ID.

Figure 1-2. Gene expression in different sample types

1.3. Survival analysis of relapse-free survival (RFS)

Kaplan Meier (KM) survival analysis of relapse-free survival (RFS) for the selected gene can be performed in the 10 datasets with 1,558 primary tumor samples from PCa patients with the data of biochemical recurrence (BCR) status and follow up time after definitive treatment.

The patients in each dataset are dichotomized into low- and high-expression groups based on the median expression value of the selected gene. KM analysis is performed using the R package

survival

to estimate the hazard ratio (HR) and 95% confidence intervals (CIs). The log-rank test is used to compare the two survival curves and calculate the p value.

A forest plot with the result from the survival analysis, including the numbers of samples, hazard ratios (HRs), 95% confidence intervals (CIs), and p values across all the datasets, will be generated. The KM survival curve for each dataset will also be plotted (Figure 1-3).

Figure 1-3. Forest plot and KM survival curves

1.4. Gene expression analysis at the single-cell level

The expression pattern of the selected gene in different cell types from normal human prostate, including basal, luminal, neuroendocrine (NE), club, and hillock epithelia, endothelia, leukocyte, fibroblast, and smooth muscle, can be visualized using the uniform manifold approximation and projection (UMAP) plot and t-distributed stochastic neighbor embedding (t-SNE) plot, violin plot, and bubble plot with the average gene expression and percent of cells expressed for each cell type (Figure 1-4). The data used in this module was downloaded directly from GUDMAP (https://www.gudmap.org/chaise/record/#2/RNASeq:Study/RID=W-RAHW).

For details about the scRNAseq data analysis, such as data processing, data normalization, tSNE and UMAP analyses, etc., please refer to the orignial study:

Henry, Gervaise H., et al. A cellular anatomy of the normal adult human prostate and prostatic urethra. Cell reports 25.12 (2018): 3530-3542

Figure 1-4. tSNE plot, UMAP plot, bubble plot, and violin plot to visualize the gene expression pattern at the single-cell level

Prognostic signature evaluation

2.1. Overview

A comprehensive evaluation of the prognostic performances of 30 published signatures was performed in a previous study and we included all those signatures in the PCaDB database, allowing for a more detailed characterization of the signatures (Figure 2-1), including:

(1) List of signature genes

(2) Differential expression analysis of the signature genes

(3) KM Survival analysis of RFS for the signature genes

(4) Pathway analysis of the signature genes

(5) Evaluation of the performances of the prognostic signatures

Figure 2-1. Overview of modules for prognostic signature evaluation

2.2. List of signature genes

The list of genes in a prognositc signature can be viewed by selecting the signature from the dropdown list. A comprehensive pipeline was created to convert the gene IDs/symbols reported in the orignial publication to the Ensembl IDs and HGNC gene symbols. We also did a manual inspection very carefully to double check the ambiguous signature genes.

Figure 2-2. Th list of genes in the selected signature

2.3. Differential expression analysis of the signature genes

The DE analysis of all signature genes between primary tumor and tumor-adjacent normal samples can be performed using the R package

limma

in multiple datasets with certain sample types. The normalized gene expression data in log2 scale is used as input. i.e., for RNAseq data from TCGA and SRA, the raw count data is normalized using the Trimmed Mean of Mvalues (TMM) method implemented in the R package

edgeR

, followed by

voom

transformation using

limma

, which is more powerful especially when the library sizes are quite variable between samples. Lowly-expressed genes (logCPM < 0 in more than 50% of samples) are filtered out before any downstream analysis. A linear model will then be fitted to estimate the fold changes and standard errors for each gene, followed by empirical Bayes smoothing. The differential expression analysis for Affymetrix microarray data is very similar to that for RNAseq data, except that the Robust Multichip Average (RMA) normalized intensity in log2 scale is fitted to the linear model. For some other datasets, the normalized data provided by the original studies are used as input for differential expression analysis.

A volcano plot will be generated to visualize the DE analysis result. Only differentially expressed genes (DEGs) in >= 3 signatures are labeled in the volcano plot. A data table will also be created to list the DEGs. The log2(fold change), BH-adjusted p value (FDR), the number of signatures, and the names of the signatures are reported in the table (Figure 2-3).

Figure 2-3. Differential expression analysis of the signature genes

2.4. KM Survival analysis of RFS for the signature genes

Kaplan Meier (KM) survival analysis of relapse-free survival (RFS) for the signature genes can be performed in the 10 datasets with 1,558 primary tumor samples from PCa patients with the data of biochemical recurrence (BCR) status and follow up time after definitive treatment.

The patients in each dataset are dichotomized into low- and high-expression groups based on the median expression value of the signature gene. KM analysis is performed using the R package

survival

to estimate the hazard ratio (HR) and 95% confidence intervals (CIs). The log-rank test is used to compare the two survival curves and calculate the p value.

A forest plot for the common genes in 3 or more signatures and a data table with the survival analysis result for all the signature genes are generated. In the data table, the hazard ratios (HRs), 95% confidence intervals (CIs), nominal p values, and BH-adjusted p values (FDRs), as well as the number of signatures and names of the signatures are reported (Figure 2-4).

Figure 2-4. KM Survival analysis of RFS for the signature genes

2.5. Pathway analysis of the signature genes

Functional enrichment analysis of the signature genes can be performed using the R package

clusterProfiler

in PCaDB based on hypergeometric test. Users can select a gene list from the dropdown list for the enrichment analysis. A gene list could consist of genes in an individual signature, common genes in at least 2, 3, or 4 signatures, and all the 1,042 signature genes. Pathways or Gene sets with < 10 genes or > 500 genes are excluded. Benjamini-Hochberg (BH) is used for multiple testing correction.

PCaDB supports functional enrichment analysis with many pathway/ontology knowledgebases, including:

(1) KEGG: Kyoto Encyclopedia of Genes and Genomes

(2) REACTOME

(3) DO: Disease Ontology

(4) NCG: Network of Cancer Gene

(5) DisGeNET

(6) GO-BP: Gene Ontology (Biological Process)

(7) GO-CC: Gene Ontology (Cellular Component)

(8) GO-MF: Gene Ontology (Molecular Function)

(9) MSigDB-H: Molecular Signatures Database (Hallmark)

(10) MSigDB-C4: Molecular Signatures Database (CGN: Cancer Gene Neighborhoods)

(11) MSigDB-C4: Molecular Signatures Database (CM: Cancer Modules)

(12) MSigDB-C6: Molecular Signatures Database (C6: Oncogenic Signature Gene Sets)

A data table is produced to summarize the significantly enriched pathways/ontologies in descending order based on their significance levels, as well as the number and proportion of enriched genes and the gene symbols in each pathway/ontology term.

'Count'

is the number of genes shared between the input gene list and pathway/gene set;

'List Total'

is the total number of genes in the input gene list;

'Pop Hits'

is the number of genes in the pathway/gene set; and

'Pop Total'

is the total number of genes in the pathway/gene set database, e.g., KEGG, GO, and REACTOME, etc.

The top enriched pathways/ontologies (based on the BH-adjusted p value) are visualized using both bar plot and bubble plot.

Figure 2-5. Pathway analysis of the signature genes

2.6. Evaluation of the performances of the prognostic signatures

A comprehensive evaluation of the performances of the published prognostic signatures can be performed in PCaDB using different survival analysis algorithms and different training and test datasets.

If a given signature is selected, a prognostic model can be developed using the expression data of the signature genes in the selected training dataset and the selected survival analysis method. The risk score of each patient in the test datasets will be computed based on the model.

For the intra-dataset comparison, 10-fold CV was used for model evaluation, whereas for the inter-dataset comparison, one dataset was treated as the training set and the other datasets were used as the test datasets at a time.

Three metrics including the concordance index (C-index), time-dependent receiver operating characteristics (ROC) curve, and hazard ratio (HR) estimated by the KM survival analysis are used to assess the prognostic power for the signature based on the independent test cohorts. Forest plots are used to visualize the results, while data tables with more detailed results are also provided.

For details about the machine learning algorithms, prognostic model training and testing, as well as the calculation of C index, AUC, and HR, etc., please refer to the orignial study:

Li R, Zhu J, Zhong W-D, Jia Z. Comprehensive evaluation of machine learning models and gene expression signatures for prostate cancer prognosis using large population cohorts. bioRxiv. 2021 (https://doi.org/10.1101/2021.07.02.450975)

Figure 2-6. Evaluation of the prognostic performances of an individual signature

Users can also select ‘All Signatures’ from the dropdown list to perform a comprehensive analysis to compare and rank the prognostic abiligies of all the signatures in each individual test set based on the three metrics.

Figure 2-7. Evaluation of the prognostic performances of all the signatures

Transcriptomics data analysis

3.1. Overview

More advanced and comprehensive analyses can be performed at the whole-transcriptome level in PCaDB, allowing users to identify DEGs associated with tumor initiation and progression, identify biomarkers associated with clinical outcomes (i.e., BCR), as well as develop and validate gene expression-based signatures and models for PCa prognosis (Figure 3-1). The whole-transcriptome level analysis includes:

(1) Summary of the dataset

(2) Dimensionality reduction

(3) Differential gene expression analysis

(4) KM and CoxPH survival analyses of RFS (only for the first 10 datasets with the information of time to BCR)

(5) Development and validation of a new prognostic model (only for the first 10 datasets with the information of time to BCR)

Figure 3-1. Overview of whole-transcriptome data analysis

3.2. Summary of the dataset

When a transcriptomics dataset is selected, the summary of the dataset including platform, data processing pipeline, and the available metadata, such as sample type, preoperative PSA, Gleason score, BCR status, and time to BCR, will be displayed automatically (Figure 3-2).

Figure 3-2. Summary of the dataset

3.3. Dimensionality reduction

Principal component analysis (PCA) can be performed using the whole transcriptome profiling (lowly-expressed genes with logCPM < 0 in 50% of samples are filtered out for RNAseq data) in the selected dataset. The normalized gene expression data is rescaled by z-score standardization and the

prcomp

function in R is used for PCA analysis. The 2D or 3D interactive plot based on the first two or three principal components, respectively, will be generated for visualization (Figure 3-3).

Figure 3-3. Principal component analysis

3.4. Differential gene expression analysis

PCaDB has included transcriptomics data from a wide spectrum of sample types, which allows differential expression analysis between many different types of samples, such as the comparisons between prostate tissue from healthy donor, tumor-adjacent normal tissue, primary tumor and advanced tumor, non-metastatic castration-resistant prostate cancer (CRPC) vs. metastatic CRPC, CRPC at different sites (lymph node, bone, liver, lung, etc.), perineural invasion (PNI) tumor vs. non-PNI tumor, normal and tumor samples between African Americans vs. European Americans, tumor samples collected pre-androgen-deprivation therapy (ADT) vs. post-ADT, ADT responsive vs. castration resistant, etc.

The R package

limma

is used to identify DEGs in PCaDB. The normalized gene expression data in log2 scale is used as input. i.e., for RNAseq data from TCGA and SRA, the raw count data is normalized using the Trimmed Mean of Mvalues (TMM) method implemented in the R package

edgeR

, followed by

voom

transformation using

limma

, which is more powerful especially when the library sizes are quite variable between samples. Lowly-expressed genes (logCPM < 0 in more than 50% of samples) are filtered out before any downstream analysis. A linear model will then be fitted to estimate the fold changes and standard errors for each gene, followed by empirical Bayes smoothing. The differential expression analysis for Affymetrix microarray data is very similar to that for RNAseq data, except that the Robust Multichip Average (RMA) normalized intensity in log2 scale is fitted to the linear model. For some other datasets, the normalized data provided by the originial studies are used as input for differential expression analysis. Benjamini-Hochberg (BH) is used for multiple testing correction.

A volcano plot will be generated to visualize the DE analysis result. By default, BH-adjusted p value (FDR) < 0.01 and absolute fold change > 2 are used as the thresholds for determining DEGs. A data table will also be created to list the DE analysis result. The ensembl id, log2(fold change), average expression across all the samples, t statistics, nornimal p value, BH-adjusted p value (FDR), B statistics, gene expression regulation (Up: significantly up-regulated; Down: significantly down-regulated; NS: not significant), and gene symbol are reported in the table (Figure 3-4).

Figure 3-4. Identification of differentially expressed genes

3.5. CoxPH and KM survival analyses of RFS

Both the univariate Cox Proportional-Hazards (CoxPH) and Kaplan Meier (KM) survival analyses of relapse-free survival (RFS) can be performed at the whole-transcriptome level to identify biomarkers associated with clinical outcome of PCa in a selected dataset of interest (Figure 3-5). Note that the survival analysis can only be done in the first 10 datasets that have the BCR data.

For CoxPH regression analysis, the continuous gene expression data, i.e., normalized expression value (in log2 scale) and the censored time to BCR data are modeled to identify genes that are significantly asspciated with RFS, which may be potentially used as prognostic markers. CoxPH analysis is performed using the R package

survival

. The hazard ratio (HR), 95% confidence intervals (CIs), and p value from log-rank test for each gene are all reported in a data table (Figure 3-5).

For KM survival analysis, the patients in each dataset are dichotomized into low- and high-expression groups based on the median expression value of the signature gene. KM analysis is also performed using the R package

survival

to estimate the hazard ratio (HR) and 95% confidence intervals (CIs). The log-rank test is used to compare the two survival curves and calculate the p value. The KM survival analysis result is also reported in a data table (Figure 3-5).

Figure 3-5. Univariate CoxPH and KM survival analyses

Multivariate Cox Proportional-Hazards (CoxPH) survival analysis of relapse-free survival (RFS) can also be performed to identify biomarkers associated with BCR by adjusting for some of the most important clinical factors, including age at diagnosis, preoperative PSA, Gleason score, and clinical T stage. The clinical factors that are included in the multivariate CoxPH model for each dataset merely depend on the availablity of the data, which has been summarized in a table below (Figure 3-6). Please be noted that not all the samples in the selected dataset are used in the multivariate CoxPH survival analysis because of the existing of missing data for the four cliical factors.

Figure 3-6. Multivariate CoxPH survival analysis

3.6. Development and validation of a new prognostic model

In PCaDB, an advanced functional module is provided, allowing users to provide a list of genes (e.g., genes that are significantly associated with the endpoint based on univiarate and/or multivariate CoxPH survival analysis), and select a survival analysis method, including multivariate CoxPH, Cox model regularized with lasso penalty (Cox-Lasso), or ridge penalty (Cox-Ridge), to develop and independently validate a prognostic model using the 10 datasets with the time to biochemical recurrence (BCR) data.

The prognostic model is trained using the selected dataset based on the selected survival analysis method. The R package

survival

is used to build the CoxPH model, and the R package

glmnet

is adapted to build the Cox-Lasso and Cox-Ridge models with the penalties alpha (α) = 1 and alpha (α) = 0, respectively. Cox-Lasso uses L1 regularization technique, which adds “absolute value of magnitude” of coefficient as penalty term to the loss function, whereas Cox-Ridge uses L2 regularization technique that adds “quared magnitude” of coefficient as penalty term to the loss function. The major difference between Cox-Ridge and Cox-Lasso regression is that Cox-Lasso tends to shrink coefficients of the insignificant features to absolute zero, while Cox-Ridge never sets the coefficients to absolute zero. When training a Cox-Lasso or Cox-Ridge model, the tuning parameter lambda (λ), which controls the overall strength of the L1 or L2 penalty, is determined based on a built-in 10-fold cross-validation. A maximum of 1000 iterations is implemented over the data for all lambda (λ) values.

The risk score for a patient in the training set is calcualted as a linear combination of the gene expression valuesand coefficients of the signature genes, i.e., risk score = β1*E1 + β2*E2 + ... + βn*En, where β1 to βn are the coefficients of the signature genes and E1 to En are the normalized gene expression values in log2 scale. The median value of the risk scores for all patients is used as the threshold to dichotomize these patients into low- and high-risk groups. A KM plot is generated to show the prognostic performance of the signature in the training set. The R package

survival

is used to estimate the hazard ratio (HR) and 95% confidence intervals (CIs). The log-rank test is performed to compare the two survival curves and calculate the p value. Theoretically, any prognostic models should be significantly associated with the endpoint in the training dataset. Therefore, it's very criticial to perform external validations of the prognostic models in multiple independent cohorts. In PCaDB, the risk scores are calculated for patients in each of the nine remaining datasets with RFS data, to validate the prognostic model, and a forest plot is generated to show the HRs, 95% CIs, and log rank test p values from KM analyses in the nine validation datasets (Figure 3-7).

* Please be noted that only genes that are present in the training dataset are used for prognostic model development. If a gene in the final model has not been quantified in the validation dataset, the expression value would be set to zero.

Figure 3-7. Development and validation of the user-provided prognostic signature

Data Download

4.1. Download processed data

All the processed data deposited in PCaDB, including the

ExpresionSet

of a transcriptomics data in the .RDS format, prognostic signatures in the .xlsx format, and gene annotation in the .RDS format can be donwloaded directly by clicking the link to the data on the Download page (Figure 4-1).

Figure 4-1. Download processed gene expression data, sample metadata, and gene annotation data

For example, suppose the .RDS file for the TCGA-PRAD dataset has been downloaded, the gene expression data and sample metadata can be easily retrieved by running the following command in R (Figure 4-2).

Figure 4-2. Retrieve gene expression data and sample metadata from the eSet object

4.2. Export data analysis & visualization results

We provide two download buttons (

PDF

and

CSV

) under each figure generated by PCaDB, allowing the users to download the high-resolution publication-quality vector image in PDF format and the data that is used to generate the figure in CSV format.

Figure 4-3. Download the publication-quality vector image in PDF format

In this example, the expression data of the selected gene

KLK3

for the samples in the selected dataset GSE6919-GPL8300 will be saved to the CSV file

ENSG00000142515_KLK3_GSE6919-GPL8300_Expression.csv

by clicking the download button (Figure4-4).

Figure 4-4. Download the data that is used to generate the figure to a CSV file

All the data tables generated for the data analysis outputs are also exportable in

CSV

EXCEL

formats. The output can also be copied to the clipboard.

Figure 4-5. Download the data table for the data analysis output

Contact Us

Please feel free to contact us if you have any questions on the database or need any help to curate data for cancer studies.

Ruidong Li, Ph.D., Research Bioinformatics Scientist

Email: rli012@ucr.edu

Github Google Scholar