| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |







From the Departments of Biostatistics,
Breast Medical Oncology,
and Pathology,
¶
University of Texas M.D. Anderson Cancer Center, Houston, Texas; Millennium Pharmaceuticals, Inc.,
*
Cambridge, Massachusetts; and Praecis Pharmaceuticals,
Waltham, Massachusetts
| Abstract |
|---|
|
|
|---|
0.7 when UniGene was used to match probes. There was substantial variation in correlation between different Affymetrix probe sets matched to the same cDNA probe. When cDNA and Affymetrix probes were matched by basic local alignment tool (BLAST) sequence identity, the correlation increased substantially. We identified 182 genes in the Affymetrix and 45 in the cDNA data (including 17 common genes) that accurately separated 91% of cases in supervised hierarchical clustering in each data set. Cross-platform testing of these informative genes resulted in lower clustering accuracy of 45 and 79%, respectively. Several sets of accurate five-gene classifiers were developed on each platform using linear discriminant analysis. The best 100 classifiers showed average misclassification error rate of 2% on the original data that rose to 19.5% when tested on data from the other platform. Random five-gene classifiers showed misclassification error rate of 33%. We conclude that multigene predictors optimized for one platform lose accuracy when applied to data from another platform due to missing genes and sequence differences in probes that result in differing measurements for the same gene. | Introduction |
|---|
|
|
|---|
The goal of the current research was to compare gene expression data generated by Affymetrix UB3A (Santa Clara, CA) oligonucleotide arrays with those generated by cDNA microarrays from the same 33 RNA samples extracted from fine-needle aspirates of breast cancer. We examined three questions. 1) How closely do normalized gene expression measurements for the same gene correlate across platforms? 2) Do genes that discriminate between responders and nonresponders to chemotherapy in hierarchical cluster analysis on one platform retain discriminating value when tested on data generated by another platform? 3) How well do multigene predictors generated on one platform hold up on data generated by the other platform? Because the same RNA was profiled on both platforms, this design allowed us to evaluate the impact of the profiling method without the confounding effect of sampling variation. This analysis was not designed to compare the accuracy of the two platforms. This would have required a third, external, gold standard measurement of mRNA expression. Similarly, our goal was not to develop the best possible predictor of pathological CR from this data. This would have required a somewhat different statistical approach to optimize the predictor for validation on independent samples.
| Materials and Methods |
|---|
|
|
|---|
|
Microarray Hybridization
RNA was extracted from FNA samples using the RNA-easy kit (Qiagen, Valencia, CA). The amount and quality of RNA was assessed with a DU-640 U.V. Spectrophotometer (Beckman Coulter, Fullerton, CA) and by an Agilent 2100 Bioanalyzer RNA 6000 LabChip kit (Agilent Technologies, Palo Alto, CA). Two microarray platforms were tested in this study: the Affymetrix Human Genome U133 A and B gene chip sets and cDNA nylon membranes proprietary to and printed by Millennium Pharmaceuticals. The cDNA arrays contained 21,594 independent human sequence-verified clones. For the cDNA array hybridization, first-strand cDNA synthesis was performed with Superscript II (Invitrogen, Carlsbad, CA) in the presence of [33P]dCTP (100 mCi/ml; Amersham, Little Chalfont, UK) from 1 to 2 µg of total RNA. The isotope-labeled cDNA probes were hybridized without further amplification to high-density nylon arrays. Affymetrix profiling was also performed without second round amplification following the standard protocol using 1 µg of total RNA from the same pool used for the cDNA array experiments. Briefly, double-stranded cDNA was synthesized, followed by in vivo transcription reaction to generate biotinylated cRNA. Biotin-labeled and fragmented cRNA was hybridized to the Affymetrix U133A and B gene chips overnight at 42°C. Procedures followed standard operating practice outlined in the Affymetrix technical manual. The Affymetrix GeneChip system was used for hybridization and scanning of the probe arrays, and Microarray Analysis Suite (MAS) 5.0 was used for data acquisition.
Data Processing
Results of the cDNA membrane array experiments and data acquisition were previously reported.7
Gene expression values of the cDNA array experiments were normalized to the median expression value of all genes on each membrane array. The median values were set to 1 by the normalization process; the postnormalization mean expression values ranged from 2.535 to 4.787. For the Affymetrix data, several standard metrics were examined to assess the quality of each hybridization result. We assessed T7 amplification and labeling efficiency by calculating the 3' to 5' ratios of ß-actin and glyceraldehyde-3-phosphate dehydrogenase. The median ratio for ß actin was 1.45 (range, 1.21 to 1.60); for glyceraldehyde-3-phosphate dehydrogenase, 1.13 (range, 0.86 to 1.67). To assess brightness, we used dCHIP V1.3 to calculate the percentages of array outliers and of single outliers for each chip.8
MAS 5.0 was used to produce P values for signal detection. Chips with more than 5% array or single outliers or with less than 15% detection P values <0.01 were flagged. All 33 Affymetrix profiles included in this analysis have passed each QC step. Affymetrix data were quantified and normalized with dCHIP V1.3 (http://biosun1.harvard.edu/complab/dchip).8
Statistical Analysis
The primary goal of this research was cross-platform testing of informative genes and multigene outcome predictors in an experimental setting in which the only variable is the gene expression profiling technique. Normalized expression data from both platforms was transformed by computing the base-two logarithm before further analysis. Pearson correlation coefficients were calculated on the normalized, log-transformed data for each gene represented on both platforms, with matching based on UniGene build 160. Further analysis was performed only on the subset of genes shared on both the Affymetrix and cDNA platforms. When multiple cDNA probes or Affymetrix probe sets targeted the same gene, separate Pearson and Spearman correlation coefficients were calculated for each distinct probe set and matching cDNA probe. The distribution of all gene expression measurements on both platforms was near normal. To identify differentially expressed genes between the two response outcome groups, pathological complete response versus residual cancer, we explored two methods: the two-sample t-test and an unequal variance t-test on the ranks. The resulting P values were analyzed as a ß-uniform mixture, and the relationship between P values and false discovery rate (FDR) was assessed.9, 10
Hierarchical clustering was performed using the informative genes identified in the univariate analysis. The distance metric was computed on log-transformed gene expression values. Data analyses were performed using the statistical software package S-Plus 2000 (Insightful Corp., Seattle, WA).
Multigene classifiers were built by combining a ge-netic algorithm (GA) with linear discriminant analysis (LDA).11, 12 This classification strategy consists of two main components. The GA is used as a gene selector to identify the best combination of discriminating genes, and the class prediction is performed by LDA. The working and optimization principle of GA is analogous to an evolutionary process and does not rely on rank-based gene selection as a t-test does.11 This method has previously been applied successfully to high-dimensional gene expression and proteomic profile data.13, 14, 15, 16 A potential advantage of this approach is that the GA can identify combination of genes that are predictive together even if individual genes that contribute to the classifier have limited predictive value and would be missed by two-sample t statistics. However, to reduce computation time, we performed GA/LDA on a subset of the data that was enriched in informative genes. The subsets in each platform were selected based on the FDR computed from the two-sample t-tests. By setting FDR < 0.6, we selected 1032 clones on the cDNA platform and 2010 probe sets on the Affymetrix platform. This preselection of genes was performed to reduce computation time by using genes with somewhat increased class-discriminating value. Twenty-seven cases were used as an independent validation set to test the predictive accuracy of the predictor that had the highest cross-validation accuracy on both platforms.
| Results |
|---|
|
|
|---|
|
0.7, another 7590 measurements had r
0.5, and 12,466 measurements had an r
0 but
0.5. There were 4044 measurements that had negative correlation coefficient. The distribution of all Pearson coefficients indicated a modest overall correlation of individual gene expression measurements across the two platforms (Figure 2A)
25 nucleotides overlapping with the RefSeq sequences, then the pair was considered "RefSeq-matched" using software bl2seq. Three categories of probe pairs were created: 1) probes pairs that targeted the same gene but were misaligned and shared no overlap in RefSeq target sequence (n = 2108), 2) pairs that matched to the same RefSeq sequence as described above (n = 3169), and 3) pairs where cDNA and Affymetrix probe sequences directly overlapped (n = 4008). As expected, closer sequence matches between the cDNA and Affymetrix probes yielded greater correlation. Figure 2B
|
Sequence-Based Explanation for the Discrepant Expression Measurements on the Two Platforms
To illustrate the impact of sequence matching and location of Affymetrix probe target sequence in relation to the 3' end of the transcript, we examined the expression results for the ER obtained with the two platforms. The expression of estrogen ER is routinely measured in breast cancer specimens by immunohistochemistry. ER protein expression correlates closely with mRNA expression of ER.7, 17
These routine clinical results, which were available for all cases, provided an external "gold standard" reference for ER expression. We observed very good correlation between ER expression determined by immunohistochemistry and by cDNA microarray (estrogen receptor 1
, IMAGE clone 725321, Hs.1657; NM_000125) or by Affymetrix gene chip when the probe set "205225_at" was used for comparison. The expression values of ER measured by cDNA arrays and by the Affymetrix probe set "205225_at" also correlated with each other very highly (r = 0.976). However, there are multiple probe sets that target the human ER
gene on the Affymetrix U133A chip. The correlation coefficients of these probe sets with the ER cDNA result (and with the gold standard immunohistochemistry (IHC) result) varied considerably, ranging from 0.146 to 0.976 (Figure 3)
. The normalized mean expression intensities of these ER probe sets were also highly variable, ranging from 62 to 1291 (arbitrary units).
|
gene in relation to the IMAGE clone ER sequence represented on the cDNA array and the various Affymetrix oligonucleotide ER probe sequences. The closer the region of an Affymetrix probe set was to the region of the IMAGE clone, the higher the correlation was between the two measurements (Figure 3)The lack of correlation of probe set "215551_at" with ER gene expression may also be explained by the peculiarities of that sequence. This sequence maps to an exon that is only reported in one ER-related expressed sequence tag sequence. This raises the possibility of an alternative 3' end that gives rise to a shorter ER transcript variant. The lack of correlation with the cDNA result and with the clinical ER status coupled with the low signal intensity for this particular probe suggest a low-abundance ER transcript variant that could be expressed at low levels uniformly in all samples. These observations suggest that the relatively poor overall correlation between the cDNA array and Affymetrix results is largely due to sequence differences between the probes present in the two platforms.
Do Informative Genes Identified on One Platform Retain Their Discriminating Function on Another Platform in Hierarchical Cluster Analysis?
We identified genes that were differentially expressed between the 10 cases with pathological complete response (pCR) and the remaining 23 cases that had residual cancer (RD) after completion of preoperative chemotherapy (Supplementary Table S1, A and B at http://jmd.amjpathol.org/). In the cDNA array data, we used unequal variance t-test on the ranks to identify 45 genes as differentially expressed each with P
0.00236 that corresponded to FDR
0.40. This high false discovery rate suggested that the most informative genes on the cDNA platform were removed when analysis was restricted to probes that had a matched pair on the U133A chip. However, high FDR rate was acceptable for our purpose because we wanted to identify a large number of discriminating genes in this particular data and to test how well their discriminating value holds up when tested on expression data generated from the same RNA with a different profiling platform. One approach to test the discriminating value of these genes is to perform supervised hierarchical clustering with these informative genes. Average linkage supervised hierarchical clustering of the cDNA array data using these 45 genes visually demonstrated that these genes had high discriminating value on the original data set, with 10 pCR/3 RD in one main cluster and 0 pCR/20 RD in the other that represents a 91% overall clustering accuracy (Figure 4)
. When the same 45 genes, corresponding to 62 Affymetrix probe sets, were used to cluster the Affymetrix data, the genes lost a substantial amount of discriminating value. Six cases of pCR and three RD clustered together in one main arm of the dendrogram and 4 pCR/20 RD were in the other main cluster, representing a 79% overall clustering accuracy. This decrease in discriminating value was predicted based on the modest overall correlation of individual gene expression values on the two platforms.
|
0.40) to the Affymetrix data yielded 182 probe sets corresponding to 166 distinct genes each with P
0.00607. When these genes were used for supervised hierarchical clustering, they also performed well on the original data. Eight cases of pCR and 1 case of RD were clustered in one main cluster and 2 pCR and 22 RD formed the other main cluster (91% clustering accuracy). However, when the same 166 genes (corresponding to 249 cDNA clones) were used to cluster the cDNA data, a practically complete loss of discriminating value was seen (data not shown).
These results indicate that informative genes identified on one gene expression profiling platform lose some of their class-discriminating value when measured with a different profiling method. This is further illustrated by the observation that only 17 genes were common to the cDNA and Affymetrix differentially expressed gene lists generated from the same cases (Table 2)
. As expected, the correlation coefficients for these 17 genes across the platforms were high. When these 17 genes common to both lists were used for hierarchical clustering, almost all cases of pathological CR clustered together in both types of expression data. It was, therefore, possible to identify genes and probe sets that retained discriminating value across platforms by using only genes that show high correlation coefficients when measured by the two distinct platforms. Such genes may be identified by sequence matching of the probes. Hierarchical clustering is not a class prediction tool, and assignment of molecular class or expected outcome to new cases based on dendogram results is not appropriate. There are several mathematical methods that are better suited to formulate outcome predictions based on the constellation of multiple genes.
|
The 100 sets of five-gene classifiers built from the cDNA data misclassified on average 0.69 cases, corresponding to 2% misclassification error rate (MER), on the original platform. Forty-eight of the 100 sets yielded 0 misclassifications. Thirty-seven sets yielded 1 and 13 sets yielded 2 misclassifications (Figure 5)
. When these 100 predictors developed from the cDNA data were tested on the Affymetrix results, the classification accuracy dropped. The average number of misclassified cases was 6.42 corresponding to 19.5% average MER. No set produced perfect classification when tested on the Affymetrix results, and only four sets yielded
2 misclassified cases (Figure 5)
. However, the cross-platform classification performance of these predictors (optimized for the cDNA results) was still better than that observed with random predictors. We tested the performance of LDA with 100 randomly selected five-gene sets from the cDNA data. The average number of misclassified cases was 10.88 (33% average MER) on the cDNA platform and 9.62 (29% average MER) on the Affymetrix data.
|
When we compared the two sets of 100 five-gene classifiers identified from the cDNA and Affymetrix data, respectively, we could identify only three classifiers that were common and therefore performed well on both platforms. The classifier that performed the best on both platforms included the following five genes: PPFIBP2 (Hs.12953), PCNT1 (Hs.184352), HNRAPA2B1 (Hs. 232400), BBS4 (Hs. 26471), and SEC24C (Hs. 81964). This predictor classified all cases correctly on the Affymetrix platform and misclassified only one case on the cDNA platform. The second best cross-platform classifier included genes RNF111 (Hs.12504), PPFIBP2 (Hs12953), SCD4 (Hs.247474), SNRPN (Hs. 48375), and PRPSAP1 (Hs. 77498) and misclassified 1 case on both platforms. The third classifier including PPFIBP2 (Hs.12953), SAMHD1 (Hs.23889), C20orf27 (Hs.274422), ZNF75 (Hs.355015), and PRPSAP1 (Hs.77498) misclassified one case on the cDNA and two cases on the Affymetrix platforms.
Assessment of the Predictive Accuracy of the Best Five-Gene Classifier on Independent Cases
A predictor that consists of a handful of genes that are reliably and comparably measured by both Affymetrix and cDNA platforms could be very useful, particularly if it predicts outcome accurately in new cases. To estimate the true predictive accuracy of the five-gene LDA predictor that performed the best on both platforms, we tested it on an independent set of 27 patients profiled on Affymetrix. The classifier showed a 70% overall prediction accuracy (95% CI = 50%, 86%). This suggests that it may be possible to develop multigene predictors that perform well across platforms and also show reasonably good predictive accuracy in independent cases.
It is important to consider that the five-gene LDA predictor examined in this analysis may not be the best of all possible predictors that can be developed from each particular data set. Many of the most informative genes (in univariate analysis) in the Affymetrix data were not represented on the cDNA arrays and vice versa. Also, there are a large number of supervised classification methods including support vector machines, k-nearest neighbor, and various types LDA algorithms that can be used to generate predictors. The comparison of different classification methods and selection of the optimal predictor for independent validation was not the main goal of this study, and it is the subject of a separate analysis that includes larger number of cases.
| Discussion |
|---|
|
|
|---|
We examined the concordance between gene expression data generated from the same RNA specimens by two different DNA microarray platforms. One platform, Millennium Pharmaceuticals cDNA arrays, contained 21,594 unique UniGene clusters; the Affymetrix HU133 A and B chips contained 30,868 clusters. Only 17,832 clusters were present on both platforms. We observed modest overall correlation between paired measurements of the same genes when probes were matched by UniGene. This was apparent regardless of the correlation metric used including Pearson and Spearman correlation coefficients or Concordance correlation of t scores. Our findings are consistent with several other reports that indicated modest overall correlation of gene expression measurements across platforms.18, 19, 20
Some Affymetrix probe sets displayed very high correlation with matching cDNA array results, but other probe sets targeting the same gene showed minimal or no correlation. We hypothesized that this variable correlation may be partly due to sequence differences between the cDNA probes and the various Affymetrix probe sets and the location of the target sequence within the gene to be measured. In the case of the ER gene, we could demonstrate that both arrays could measure the expression of ER very accurately compared with an external clinical gold standard and in a highly concordant manner. However, each ER probe set present on the U133A chip showed different levels of expression intensities and various degrees of correlation with ER gene expression defined by IHC or cDNA array. These differences in expression intensity could be explained by the location of the oligonucleotide target sequences within the ER cDNA. When we examined the correlation between cDNA and Affymetrix probes that had direct sequence overlap, the correlation was quite high. Most of the discrepant correlations between Affymetrix probe sets and cDNA measurements could be explained by the sequence differences of the probes. Important factors that contribute to the different signal intensities generated by distinct probes that target the same gene at different locations include differing GC-content, sequence length, intraplatform cross-match opportunities, and the location of the probe sequence in relation to the 3' end.18
Because many microarray studies draw their conclusions from hierarchical clustering analyses, cross-platform preservation of clustering results is important. We examined whether genes that are differentially expressed between tumors that had complete pathological response to preoperative chemotherapy and those with residual cancer retain their class-discriminating value when used in supervised hierarchical clustering across platforms. Dendograms should be similar if intraplatform relationships between measurements are similar on the two different platforms. Because only 68% of the Affymetrix U133A chip and 44% of the cDNA array genes were common to both platforms, to avoid complexities due to missing informative genes, we restricted our analysis to select differentially expressed genes from the subset of genes represented on both platforms (n = 9402). There was only limited overlap between the lists of informative genes that distinguished cases with pathological CR from those with residual disease generated from the same samples with two different profiling platforms. Only 17 genes were common to the 45-gene long cDNA and 168-gene long Affymetrix lists. Not surprisingly, informative genes performed well in supervised hierarchical clustering on the original data but showed decreased discriminating value when applied to data generated on the other platform.
We also examined the cross-platform performance of 100 sets of five-gene GA/LDA response predictors. The average misclassification error rate of these five-gene classifiers developed from the cDNA data were 2% on the original data. When the same classifiers were tested on the Affymetrix data, the average misclassification error rate has risen to 19.5%. For comparison, five-gene random classifiers produced 33% average misclassification error rates. Essentially identical results of diminished classification accuracy were observed when classifiers developed from the Affymetrix data were applied to cDNA results.
It is important to recognize that we compared two very different profiling platforms, cDNA nylon arrays hybridized to radioactive-labeled samples versus oligonucleotide Affymetrix GeneChips hybridized to biotin-labeled samples. Platforms that are more similar, for example two different versions of Affymetrix GeneChips, may show greater concordance of results because of greater similarity of the probe sequences. We only examined the performance of supervised hierarchical clustering and class (response outcome) prediction based on linear discriminant analysis, therefore it is possible that other class prediction methods may be more robust for cross-platform application. However, there is no reason to believe that any prediction algorithm would perform well across platforms if the concordance between the expression measures of the informative genes is low. Essentially all published results that attempted cross-platform testing of informative genes regardless of class prediction methodology reported diminished (but not completely lost) classification accuracy on data generated by platforms other than the original platform.21, 22
In summary, many genes with class-discriminating value on one profiling platform lose some of their discriminating value when measured with another profiling method. It is possible to select a subset of genes that retain much of their class-discriminating value across platforms based on high degree of sequence overlap between the probes. However such paired probes represent only a small minority of all probes present in any particular platform. Although it is reassuring that multigene predictors do hold up to some extent when applied across platforms, cross-platform application of multigene classifiers may have limited clinical value because of substantial and unpredictable loss of classification accuracy due to 1) missing informative genes and 2) often suboptimal measurement of informative genes between platforms. These observations underscore the importance of collaborative efforts to create uniform gene expression databases across various laboratories using standard operating procedures and a common platform to test the true diagnostic potential of this technology.
| Footnotes |
|---|
Supported in part by the Nellie B. Connally Breast Cancer Research Fund, by grants from Millennium Pharmaceuticals, The Dee Simmons Fund, and the University of Texas M.D. Anderson Cancer Center Aventis Drug Development Award and R01 CA106290-01 (to L.P.), and by grant LF2002-044HM from The Susan G. Komen Breast Cancer Foundation (to W.F.S.).
J.S. and J.W. contributed equally to the development of this manuscript.
Accepted for publication February 7, 2005.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Cronin, C. Sangli, M.-L. Liu, M. Pho, D. Dutta, A. Nguyen, J. Jeong, J. Wu, K. C. Langone, and D. Watson Analytical Validation of the Oncotype DX Genomic Diagnostic Test for Recurrence Prognosis and Therapeutic Response Prediction in Node-Negative, Estrogen Receptor-Positive Breast Cancer Clin. Chem., June 1, 2007; 53(6): 1084 - 1091. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pusztai Chips to Bedside: Incorporation of Microarray Data into Clinical Practice Clin. Cancer Res., December 15, 2006; 12(24): 7209 - 7214. [Full Text] [PDF] |
||||
![]() |
K. R. Hess, K. Anderson, W. F. Symmans, V. Valero, N. Ibrahim, J. A. Mejia, D. Booser, R. L. Theriault, A. U. Buzdar, P. J. Dempsey, et al. Pharmacogenomic Predictor of Sensitivity to Preoperative Chemotherapy With Paclitaxel and Fluorouracil, Doxorubicin, and Cyclophosphamide in Breast Cancer J. Clin. Oncol., September 10, 2006; 24(26): 4236 - 4244. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Anderson, K. R. Hess, M. Kapoor, S. Tirrell, J. Courtemanche, B. Wang, Y. Wu, Y. Gong, G. N. Hortobagyi, W. F. Symmans, et al. Reproducibility of Gene Expression Signature-Based Predictions in Replicate Experiments Clin. Cancer Res., March 15, 2006; 12(6): 1721 - 1727. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |