0.1 Introduction

The MetaIntegrator package comprises several analysis and plot functions to perform integrated multi-cohort analysis of gene expression data (meta-analysis). The advent of the gene expression microarray has allowed for a rapid increase in gene expression studies. Largely due to the MIAME standards for data sharing, many of these studies have been deposited into public repositories such as the NIH Gene Expression Omnibus (GEO) and ArrayExpress. There is now a wealth of publically available gene expression data available for re-analysis.

An obvious next step to increase statistical power in detecting changes in gene expression associated with some condition is to aggregate data from multiple studies. However, inter-study technical and biological differences prevent us from simply pooling results and summarizing our findings. A random-effects model of meta-analysis circumvents these issues by assuming that the results from each study is drawn from a single distribution, and that such inter-study differences are thus a ‘random effect’. Thus, the MetaIntegrator package will perform a DerSimonian & Laird random-effects meta-analysis for each gene (not probeset) between all target studies between cases and controls; it also performs a Fischers sum-of-logs method on the same data, and requires that a gene is significant by both methods. The resulting p-values are False discovery rate (FDR) corrected to q-values, and will evaluate the hypothesis of whether each gene is differentially expressed between cases and controls across all studies included in the analysis.

The resulting list of genes with significantly different expression between cases and controls can be used for multiple purposes, such as (1) a new diagnostic or prognostic test for the disease of interest, (2) a better understanding of the underlying biology, (3) identification of therapeutic targets, and multiple other applications. Our lab has already used these methods in a wide variety of diseases, including organ transplant reject, lung cancer, neurodegenerative disease, and sepsis (Khatri et al., J Exp Med 2013; Chen et al, Cancer Res 2014; Li et al., Acta Neur Comm 2014; Sweeney et al, Sci Trans Med 2015).

The MetaIntegrator Vignette will take the user through the basic steps of the package, including basic multi-cohort analysis, leave-one-out (LOO) analysis (whereby each of the included datasets is left out and multi-cohort analysis is run on the remaining datasets in a round-robin fashion), selection of significant genes, and then analysis of the gene set. The MetaIntegrator package assumes that the user (1) already has their data in hand, and (2) has already decided which datasets to include in the multi-cohort meta-analysis. Our group recommends that some datasets be left out of the analysis, if possible, for independent validation.

Winston A. Haynes




0.2 The Meta-Analysis Algorithm

0.2.1 Meta-analysis of gene expression data

The Metaintegrator package can be used to run a meta-analysis on microarray gene expression data as described in Khatri et al. J Exp Med. 2013. Briefly, it computes an Hedges’ g effect size for each gene in each dataset defined as:


where \(1\) and \(0\) represent the group of cases and controls for a given condition, respectively. For each gene, the summary effect size \(g_s\) is computed using a random effect model as:


where \(W_i\) is a weight equal to \(1/(V_i+T^2)\), where \(V_i\) is the variance of that gene within a given dataset \(i\), and \(T^{2}\) is the inter-dataset variation (for details see: Borenstein M et al Introduction to Meta-analysis, Wiley 2009). For each gene, the False discovery rate (FDR) is computed and a final set of genes is selected based on FDR thresholding.

0.2.2 Computation of a signature score

For a set of signature genes, a signature score can be computed as:


where \(pos\) and \(neg\) are the sets of positive and negative genes, respectively, and \(x_i(gene)\) is the expression of any particular gene in sample \(i\) (a positive score indicates an association with cases and a negative score with controls). This score \(S\) is then converted into a z-score \(Z_s\) as:


0.3 Overview Meta-Analysis workflow

1. Data collection, curation and annotation, select datasets for discovery and validation: Helper Functions
2. Meta-analysis on discovery datasets: Meta-Analysis, Filtering, Validation, Visualization, Search, Helper Functions
3. Validation on independent validation datasets: Visualization, Validation, Helper Functions


0.4 Performing a Meta-Analysis using the MetaIntegrator package

0.4.1 1. Create a metaObject as input for analysis Collect gene expression data

  • search gene expression experiments of interest at GEO or ArrayExpress
  • download the DataSet SOFT files for experiments Create a datasetObject for each gene expression GEO dataset

  • unzip and open the DataSet SOFT file
  • extract and reformat expression and phenotype information using e.g. a spreadsheet application such as MS Excel
  • populate expression (datasetObject$expr) and phenotype (datasetObject$pheno) information using the read.table() function in R
  • set the datasetObject$class vector using the phenotype information (0 is control, 1 is case)
  • provide mapping of array probe IDs to gene names for the microarray platform used in the experiment in the datasetObject$keys vector. Mappings are usually stored in GPL files (for format details see GEO Platform guidelines).
  • set name of the dataset in datasetObject$formattedName

The final datasetObject should have the structure:

datasetObject: named list
  $class: named vector. Names are sample names. Values are 0 if control, 1 if case.
  $expr: matrix. Row names are probe names. Column names are sample names. Values are expression values
  $keys: named vector. Names are probe names. Values are gene names.
  $pheno: data frame. Row names are the sample names. Column names are the annotation information (none required).
  $formattedName: string. A formatted name for this dataset which will be used in plots.

Example object structure for one datasetObject from tinyMetaObject:

dataObj1 <- tinyMetaObject$originalData$PBMC.Study.1
str(dataObj1, max.level = 1)
## List of 5
##  $ class        : Named num [1:115] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:115] "Sample 1" "Sample 2" "Sample 3" "Sample 4" ...
##  $ expr         : num [1:190, 1:115] 10.3 13.6 13.2 13.8 10.2 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ keys         : Named chr [1:190] "Gene41" "Gene58" "Gene112" "Gene59" ...
##   ..- attr(*, "names")= chr [1:190] "1294_at" "200036_s_at" "200080_s_at" "200089_s_at" ...
##  $ pheno        :'data.frame':   115 obs. of  8 variables:
##  $ formattedName: chr "PBMC Study 1"

Note: Gene expression values in dataObj1$expr should be in \(log_2\) scale and the expression data might need normalization. Also, negative gene expression values are problematic for geometric mean calculation.

This can be checked by generating a boxplot of the dataset expression values:

boxplot(dataObj1$expr[,1:15]) # -> shows samples 1-15, to see all run: boxplot(dataObj1$expr) 

Here, normalization is not necessary because the median of the samples is similar, and the data is already in log scale because the expression values are between 0 and 15. (If negative expression values would be observed e.g. the lowest expression value is -1, we recommend to shift all expression values of the dataset above 0 by adding +1 to each gene expression measurement in all samples.) Check your datasetObject using checkDataObject()

The function checks for errors within the datasetObject. I returns TRUE if the object passed error checking, FALSE otherwise, and it prints warning messages explaining failed checks.

checkDataObject(dataObj1, "Dataset")
## [1] TRUE Create metaObject from dataset objects

Generate a named list of dataset objects that have been imported for analysis:

# use the additional 2 example datasets from tinyMetaObject
dataObj2 = tinyMetaObject$originalData$Whole.Blood.Study.1
dataObj3 = tinyMetaObject$originalData$Whole.Blood.Study.2
# and create the metaObject
discovery_datasets <- list(dataObj1, dataObj2, dataObj3)
names(discovery_datasets) = c(dataObj1$formattedName, dataObj2$formattedName, dataObj3$formattedName)
exampleMetaObj$originalData <- discovery_datasets

IMPORTANT: Keep at least one dataset out of the discovery datasets to use it for validation!

The final metaObject should have the structure:

metaObject: named list
  $originalData: named list [1]
      $datasetName: Dataset object. 'datasetName' will be the (unquoted) name of that dataset.[0,n] Optional: Check your metaObject before MetaAnalysis using checkDataObject()

The function checks for errors within the metaObject.

Example how to check your metaObject:

checkDataObject(exampleMetaObj, "Meta", "Pre-Analysis")
## [1] TRUE

0.4.2 2. Run Meta-Analysis on discovery cohorts Run Meta-Analysis with runMetaAnalysis()

Once the data is written to metaObject$originalData, the Meta-Analysis can be started by:


The Meta-Analysis results are written into metaObject$metaAnalysis and the results of the leave-one-out analysis into metaObject$leaveOneOutAnalysis For details see Meta-Analysis algorithm above


exampleMetaObj <- runMetaAnalysis(exampleMetaObj, maxCores=1)
## Found common probes in 3 
## Computing effect sizes...
## Computing summary effect sizes...
## Computing Fisher's output...