DNA microarrays have been extensively used to study cancer transcriptome to identify gene expression signatures correlated with various types and subtypes of tumors, to discover potential therapeutic targets, and to predict patient survival. The integrative analysis of a wealth of data that are publicly available but often disjointed following publication will enable us to gain more insights into the mechanisms underlying tumor genesis and progression.
GCOD is a collection of publicly available microarray gene expression data on Affymetrix GeneChip arrays related to human cancers. We have assembled published gene expression datasets from public repositories such as ArrayExpress and GEO representing most major types of cancer that were collected on Affymetrix GeneChips; wherever possible, raw data (.cel files) were collected.
We performed extensive manual curation to capture complete and accurate sample information about each published study, applied stringent quality control procedures to flag low-quality hybridization outliers, and used consistent data normalization and transformation protocols to facilitate data comparability across different studies on different array types.
With a centralized database at the backend, a series of web-based displays were developed to provide easy access to these highly curated and systematically analyzed cancer gene expression data. Study-centered views allow users to browse the individual studies, check the quality of hybridizations, download the processed data, and perform preliminary data analysis online. Gene-centered views allow users to query the expression profiles across multiple datasets for their genes of interest. With the integration of The DFCI Gene Index database to allow cross-comparison to various array types and to provide up-to-date annotation of the array probes, including genomic locations, promoter sequences, transcription factor binding sites, gene ontology terms and metabolic pathways, this large volume of cancer gene expression data can also serve as a foundation for complex meta-analysis.
From the pull-down list of cancer types, select the one of your interest or “All studies”, and then browse through the list of studies for each individual study the title of the publication, a summary of samples and experimental factors involved, a link to PubMed, and the total number of hybridizations on the specified array type were displayed. Listed next to each study name are three separate operations you can perform qc info, download, or t-test.
QC info: displays two scatter plots showing the quality control information derived from MAS 5.0 algorithm; the one on the left is percentage of ‘present’ calls vs. scaling factor (target = 500) and the one on the right is 3’ to 5’ signal ratios of the control transcripts GAPDH vs. beta-ACTIN. A QC filter implemented with arbitrary cutoffs (see definition in next section) was used to flag questionable hybridizations as orange points in the graphs. The numbers of total and flagged arrays are listed respectively in the table below the graphs.
Download: you have the option of choosing either MAS 5.0 or RMA normalized data, whether or not to exclude the data from questionable hybridizations, as well as whether the data (rows) with less than specified percentage of MAS 5.0 ‘Present’ calls across samples (columns) should be trimmed. All the sample annotations will be listed on the header section of the downloaded data table, so you have an option to order the columns by a given experimental factor. Since data extraction may take up to 20 minutes to complete depending on the size of dataset and the load of the database, email address is required for sending the download link to once it is finished.
T-test: gene differentially expressed between two groups of samples can be identified using Student’s t-test. A default experimental factor has been chosen to display the classes available to be used for comparisons between group A and group B. Alternative factors, when available, can be selected from the pull-down list, and by clicking on ‘Update table’ the corresponding classes will be displayed accordingly. You can start the t-test by first selecting classes for group A and for group B respectively, checking data-trimming filters, and then entering a desired p-value for two-tailed t-test, and finally deciding on whether or not to apply Bonferroni correction for multiple testing.
Once the ‘Do t-test’ button is pressed, a waiting page will appear alerting you to wait for the process to finish and the page will automatically refresh every 5 seconds. If there are significantly differentially expression genes between the two comparing groups, a result table will appear with a list of significant genes along with group means, standards deviations, differences between the means, degree of freedom and raw (and corrected) p-values. The results are sorted by p-values with most significant genes on top, and with each probeset name linked to expression values for all the samples in our database. You can browse through all the pages where available and can also download the t-test results as a text file.
You can enter in the text field any identifier of a gene symbol, gene synonym, Genbank accession, Unigene id, Entrez id, Refseq accession, probeset id, or a free text in gene description, and it will be automatically mapped to probeset ids for all the array types available in our database. Clicking on a probeset id in the result table will lead you to a page with expression values across all the samples shown as box plots; box plots are color-coded by studies with each box representing expression values for samples belonging to the same class.
Below the box plot is a table with summaries for the relevant studies, and clicking on a study name will lead you to a box plot corresponding to the selected study. Here you can select alternative experimental factors to display corresponding box plots.