Reproducible Research

A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes

Benjamin Haibe-Kains, Christine Desmedt, Sherene Loi, Aedin C Culhane, Gianluca Bontempi, John Quackenbush, Christos Sotiriou

To ensure full reproducibility this work complies with the guidelines proposed by Robert Gentleman [1] in terms of availability of the code and reproducibility of results and figures.

The analysis is divided in two parts: concordance of subtype classifications and robustness of subtype classifiers. Both analyzes are implemented as Sweave (more precisely pgfSweave) scripts that you can run from a R session using the function pgfSweave(...) or from a terminal using the shell script  Both scripts use the same compendium of microarray datasets.

To run all the scripts make sure you uncompress all the archives in the same working directory.


The 36 microarray datasets are provided in form of R workspaces (*.RData files) containing three objects:

  • data: Gene expression data matrix with tumors in rows and probes in columns. The datasets have been retrieved from public databases (GEO or ArrayExpress) or authors' websites and we extracted the normalized data (log2 intensity in single-channel platforms or log2 ratio in dual-channel platforms) as described in the original studies.
  • annot: Annotation information matrix of the microarray platform  retrieved from public databases (GEO or ArrayExpress) or manufacturers' websites.
  • demo: Clinical information about the breast tumors
    • samplename: tumor identifiers (string)
    • dataset: name of the dataset (string)
    • series: cohort of patient or batch if any; name of the dataset otherwise (string)
    • id: identifier used in the original publication (string)
    • er: estrogen receptor status by IHC (binary value 0/1)
    • pgr: progesterone receptor status by IHC (binary value 0/1)
    • her2: human epidermial growth factor 2 status by IHC/FISH (binary value 0/1)
    • size: tumor size in cm (continuous)
    • node: nodal status (binary value 0/1)
    • age: age at diagnosis (continuous)
    • grade: histological grade (ternary value 1/2/3)
    • e/t.dmfs: event and time for distant metastasis-free survival
    • e/t.rfs: event and time for relapse-free survival
    • e/t.os: event and time for overall survival

The workspace DDB.RData contains an object called ddb that is a matrix of information related to the datasets.

All the datasets are included in the archive After uncompressing this archive make sure that all the R workspaces are stored in a directory called data.

Concordance of subtype classifications

The pgfSweave script sbtidentification_sweave.Rnw applies the official gene signatures (intrinsic gene lists, gene modules or prognostic gene siagntures) and  algorithms (SSP2003, SSP2006, PAM50, SCMOD1, SCMOD2, SCMGENE, ONCOTYPE, MAMMAPRINT, GGI) to all the microarray datasets in order to identify the breast cancer molecular subtype and t he risk prediction of each tumor. The concordance and the prognostic value of these subtype/risk classifications are then computed.

At the completion of the script, a PDF sbtidentification_sweave.pdf and a directory called sbtidentificationres contains all the analysis results (mainly PDF and CSV files); you can download those from here.

The colorbars representing the subtype classifications and the clinical parameters were designed with OmniGraffle from the PDFs generated with the pgfSweave script. The graffle templates are available from here. For instance, you can easily recreate the figure for SCMGENE by dragging and dropping the PDFs concolorbar_sbt_SCMGENE.pdf, concolorbar_comsigs_SCMGENE.pdf, and concolorbar_cinfo_SCMGENE.pdf into the file concordance_colorbars_sbt_clinic_scmgene.graffle.

NB: The gene signatures and algorithms are implemented in the genefu package, functions related to survival analysis are implemented in the survcomp package.

Robustness of subtype classifiers

The Sweave script sbtrobustness_sweave.Rnw assesses the robustness of the classifiers (SSP2003, SSP2006, PAM50, SCMOD1, SCMOD2, SCMGENE) in all the microarray datasets by estimating the prediction strength at the global, cluster, and individual levels.

There are few parameters you could tune (see section 'parameters' in the Sweave file). None was modified in the paper.

At the completion of the script, a PDF sbtrobustness_sweave.pdf and a directory called sbtrobustnessres contains all the analysis results (mainly PDF and CSV files); you can download those from here.

More details about the R project for statistical computing, LaTeX, Sweave and pgfSweave.


[1] Robert Gentleman (2005) Reproducible Research: A Bioinformatics Case Study, Statistical Applications in Genetics and Molecular Biology, 4(1):2.