A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes
Benjamin Haibe-Kains, Christine Desmedt, Sherene Loi, Aedin C Culhane, Gianluca Bontempi, John Quackenbush, Christos Sotiriou
To ensure full reproducibility this work complies with the guidelines proposed by Robert Gentleman  in terms of availability of the code and reproducibility of results and figures.
The analysis is divided in two parts: concordance of subtype classifications and robustness of subtype classifiers. Both analyzes are implemented as Sweave (more precisely pgfSweave) scripts that you can run from a R session using the function pgfSweave(...) or from a terminal using the shell script mypgfsweave_script.sh. Both scripts use the same compendium of microarray datasets.
To run all the scripts make sure you uncompress all the archives in the same working directory.
The 36 microarray datasets are provided in form of R workspaces (*.RData files) containing three objects:
Gene expression data matrix with tumors in rows and probes in columns.
The datasets have been retrieved from public databases (GEO or
ArrayExpress) or authors' websites and we extracted the normalized data
(log2 intensity in single-channel platforms or log2 ratio in
dual-channel platforms) as described in the original studies.
- annot: Annotation information matrix of the microarray platform retrieved from public databases (GEO or ArrayExpress) or manufacturers' websites.
- demo: Clinical information about the breast tumors
- samplename: tumor identifiers (string)
- dataset: name of the dataset (string)
- series: cohort of patient or batch if any; name of the dataset otherwise (string)
- id: identifier used in the original publication (string)
- er: estrogen receptor status by IHC (binary value 0/1)
- pgr: progesterone receptor status by IHC (binary value 0/1)
- her2: human epidermial growth factor 2 status by IHC/FISH (binary value 0/1)
- size: tumor size in cm (continuous)
- node: nodal status (binary value 0/1)
- age: age at diagnosis (continuous)
- grade: histological grade (ternary value 1/2/3)
- e/t.dmfs: event and time for distant metastasis-free survival
- e/t.rfs: event and time for relapse-free survival
- e/t.os: event and time for overall survival
The workspace DDB.RData contains an object called ddb that is a matrix of information related to the datasets.
All the datasets are included in the archive data.zip. After uncompressing this archive make sure that all the R workspaces are stored in a directory called data.
Concordance of subtype classifications
The pgfSweave script sbtidentification_sweave.Rnw applies the official gene signatures (intrinsic gene lists, gene modules or prognostic gene siagntures) and algorithms (SSP2003, SSP2006, PAM50, SCMOD1, SCMOD2, SCMGENE, ONCOTYPE, MAMMAPRINT, GGI) to all the microarray datasets in order to identify the breast cancer molecular subtype and t he risk prediction of each tumor. The concordance and the prognostic value of these subtype/risk classifications are then computed.
At the completion of the script, a PDF sbtidentification_sweave.pdf and a directory called sbtidentificationres contains all the analysis results (mainly PDF and CSV files); you can download those from here.
The colorbars representing the subtype classifications and the clinical parameters were designed with OmniGraffle from the PDFs generated with the pgfSweave script. The graffle templates are available from here. For instance, you can easily recreate the figure for SCMGENE by dragging and dropping the PDFs concolorbar_sbt_SCMGENE.pdf, concolorbar_comsigs_SCMGENE.pdf, and concolorbar_cinfo_SCMGENE.pdf into the file concordance_colorbars_sbt_clinic_scmgene.graffle.
Robustness of subtype classifiers
The Sweave script sbtrobustness_sweave.Rnw assesses the robustness of the classifiers (SSP2003, SSP2006, PAM50, SCMOD1, SCMOD2,
SCMGENE) in all the microarray datasets by estimating the prediction strength at the global, cluster, and individual levels.
There are few parameters you could tune (see section 'parameters' in the Sweave file). None was modified in the paper.
At the completion of the script, a PDF sbtrobustness_sweave.pdf and a directory called sbtrobustnessres contains all the analysis results (mainly PDF and CSV files); you can download those from here.
 Robert Gentleman (2005) Reproducible Research: A Bioinformatics Case Study, Statistical Applications in Genetics and Molecular Biology, 4(1):2.