Multiple outputs and causal ranking strategies for gene selection

Gianluca Bontempi *, Benjamin Haibe-kains *, Christine Desmedt,  Christos Sotiriou, John Quackenbush

* The authors contributed equally to the work

This page describes the R code, which contains the implementation of the causal ranking approach and
allows to reproduce the experiments described in Section 5.
The code is made of a script causalrank.R, a set of functions contained in metafs.R and a R workspace data.RData.
To run the script “causalrank.R” make sure you put all the files in the same working directory and type source(“causalrank.R”) in the R command window.

  • causalrank.R : this script performs the causal ranking for all the datasets contained in data.RData for different values of the causalisation parameter lambda.
    First it loads the R workspace and then for values of  lambda ranging over [0,2] it returns the associated ranking into the output file all.lambda.Rdata.

  • metafs.R: this file contains the set of functions needed to implement the causal ranking. In particular the function mmultirank
    implements the causal ranking for a given parameter lambda when a list of input/output datasets is provided.

  • data.RData: this R workspace contains a single object 'datas.m' that is a list of the 6 datasets described in the paper (Table 1) and composed of three items:
    1. datas: matrix of (frma) normalized gene expressions with patients in rows and probes in columns.
    2. annots: dataframe of annotations of the microarray platforms (Affymetrix GeneChip HGU133A); probes in rows and annotations (probe identifier, gene symbol, EntrezGene ID, ...) in columns
    3. demos: dataframe of clinical information with patients in rows, clinical parameters in columns. These clinical parameters are the following:
      • 'er': Estrogen receptor status
      • 'pgr': Progesterone receptor status
      • 'her2': Human epidermal growth factor 2 status
      • 'size': Tumor size (cm)
      • 'node': Nodal status
      • 'age': Age at diagnosis (years)
      • 'grade': Histological grade
      • 'surv.time': Time for disease free survival (distant metastasis or recurrence free survival)
      • 'surv.event': Event for disease free survival (distant metastasis or recurrence free survival)
      • 'surv.bin': Binarized survival data to represent high-risk patient (patients who relapsed before 5 years after diagnosis) and low-risk patients (patients who remain free of relapse for at least five years)

The Gene Set Enrichment Analysis (GSEA) analysis, presented in Figure 4, was performed by running the R scripts "gsea_mimo.R", followed by "gsea_mimo2.R". The R code and the companion files are included in contains all the detailed results from the preranked GSEA analyses.