## Identifying Putative Biomarkers

### Methods used in BMDK

This section describes the 10 methods currently employed to identify putative biomarkers, where a putative biomarker is defined as a feature whose intensity can distinguish some or all of the subjects in one State from those in another. In other words, the intensity ranges for the samples in each State should be different; the larger this difference the stronger the marker. Two examples of a marker are shown in the figure at the right.

In Feature-a, the Cases have an intensity that span a total range of X while the Controls have a range that is smaller by an amount Za. In Feature-b, both the Cases and Controls have a range of X, but the range for the Cases is shifted higher than the range for the Controls by an amount Zb. If there are equal numbers of Cases and Controls, the maximum range (X) is the same for both markers, and there is a uniform distribution of intensities in each State, the same number of samples will be distinguishable as long as Za=2Zb. For example, if X=100 and Za=20 (Zb=10), then 10% of the samples should be distinguishable. In Feature-a, 20% of the Cases should have an intensity that is larger than any of the controls. In Feature-b, 10% of the Cases will have an intensity above all Controls and 10% of the Controls will have an intensity below all Cases. Since in this example only 10% of all samples should be correctly distinguishable, this would be considered a very weak marker. As Za and Zb increase, the marker becomes stronger since it is able to correctly distinguish more of the samples.

For each of the methods described below there will be a table that examines the ability to detect for weak markers as a function of Za (2Zb) and the number of Cases and Controls. For this examination, X is set to 100 and Za is varied from 10 to 40 (Zb from 5 to 20). The first step is to estimate the maximum possible value that a given method can achieve for a feature that contains no information (Za=Zb=0) for a given number of samples. This is done by examining 10,000 randomly generated features with intensities between 0.0 and 100.0. Then for each value of Za or Zb, 10,000 new features will be randomly generated and there will be a count of the number of times a feature has a score that is better than that obtained from the features with no information. Part of this table for the *dtgini* procedure is shown below.

Each |
Score |
10a |
10b |
15a |
15b |
---|---|---|---|---|---|

30 | 0.333 | 1 | 1 | 8 | 1 |

45 | 0.390 | 12 | 5 | 38 | 9 |

60 | 0.410 | 8 | 2 | 81 | 6 |

90 | 0.431 | 13 | 0 | 297 | 10 |

150 | 0.462 | 952 | 34 | 6778 | 318 |

300 | 0.483 | 9892 | 3746 | 10000 | 9634 |

The first column lists the number of Cases and number of Controls in these artificial features. The second column lists the minimum score (*GINI _{split}*) obtained from 10,000 features with no information, and is an estimate of the minimum value one may expect. It should be noted that this value increases as the number of Cases and Controls increases, meaning that as the number of samples increases it becomes harder to get a meaningful separation of the samples using a one-node decision tree. The third column builds feature intensities Feature-a in the figure above with Za=10. Since this is a weak marker, only one of the 10,000 features produced a

*GINI*that was less that the best result obtained from features with no information for 30 Cases and 30 Controls. If there are 300 Cases and 300 Controls, almost all of the features have a score that is better than was obtained from features with no information. Conversely, the fourth column shows that if 2Zb=10, the number of features that are better than random is significantly reduced; the 10,000 features with 300 Cases and 300 Controls only found 3746 with

_{split}*GINI*values that are less than random.

_{split}A terse description and links to the 10 methods that examine the features for putative biomarkers are as follows.

*catboot*(formerly known as*fqual*[Hab-05]) This method performs a Bootstrap analysis and determines the centroids for the remaining samples in each State. A distance-dependent K-Nearest Centroid algorithm is used for the classification of the removed samples, where K is the number of States in the dataset.*student*(formerly known as*impf*[Hab-05]) This method performs a Student t-test to test for independence distributions for Cases and Controls.*dtgini*(formerly known as*fqual*[Hab-05]) This method examines each feature by using it in a single node decision tree using the GINI Index to determine the optimum cut point.*dtinfg*(formerly known as*infgl*[Hab-05]) This method examines each feature by using it in a single node decision tree using the Information Gain to determine the optimum cut point.*nnfeat*This method uses each feature to construct a Feed-Forward Back-Propagation Artificial Neural Network. Each network has a single input node (the feature’s intensities), two processing nodes in the hidden layer and a single output node.*chisq*This method determines an approximate chi-square value by dividing the total intensity range into regions by requiring that there are at least five expected Cases and Controls in each region.*kruswal*This method performs a Kruskal-Wallis one-way analysis of variance using the ranks of the intensities for samples in each State.*kolsmir*This method performs a Kolmogorov-Smirnov test (K-S test) to measure the maximum difference in cumulative fraction plots for the Cases and Controls.- extreme This method measures the maximum number of samples from a given State at either extreme of the intensity distribution.
*vartest*This method is derived from the relevance index of Yip and coworkers [Yip-03]. It finds features with a minimum intra-State variance relative to the total variance.

### Comparison and Conclusions

As stated in Identifying Putative Biomarkers, a total of 10 methods are available in the BioMarker Development Kit (BMDK) to search a dataset of feature intensities for putative biomarkers. Included in the description of each method is an examination of the minimum strength a feature must have to have a 50% probability or better of obtaining a score that is better than a feature with no information. Two forms of a putative biomarker are examined, as shown in the figure to the right. For a feature of type Feature-a, the relative strength of the putative biomarker is determined by Za, which represents the extent to which the maximum intensity for samples in one category exceeds that of the other. For example, if Za=30, the range of intensities for one category is only 70% that of the other, or approximately 85% of all samples have another sample of a different category with the same approximate intensity.

The following table lists the minimum value of Za needed for a peak of type Feature-a to have a 50% probability or higher of obtaining a better score than a non-informative peak as a function of the number of Cases and Controls (Each).

Each |
catboot |
student |
dtgini |
dtinfg |
nnfeat |
chisq |
kruswal |
kolsmir |
extreme |
vartest |
---|---|---|---|---|---|---|---|---|---|---|

30 | 60 | 55 | 50 | 40 | 55 | 65 | 65 | 60 | 45 | 50 |

45 | 50 | 40 | 35 | 35 | 50 | 45 | 55 | 45 | 40 | 45 |

60 | 50 | 35 | 30 | 30 | 45 | 40 | 45 | 45 | 30 | 35 |

90 | 45 | 30 | 25 | 25 | 35 | 35 | 35 | 35 | 20 | 40 |

150 | 35 | 30 | 15 | 15 | 30 | 30 | 25 | 25 | 10 | 30 |

300 | 25 | 20 | 10 | 10 | 20 | 20 | 20 | 20 | 10 | 20 |

For the largest dataset (300 Cases and 300 Controls), dtgini, dtinfg and extreme only require Za to be 10. This means that if one category has a maximum intensity that is 90% of the other so that 95% of all samples have intensity in the overlapped region. For the smallest dataset (30 Cases and 30 Controls) dtinfg requires that one category have a maximum intensity that is 60% of the other while catboot and kolsmir require that it be 40% or less.

For putative biomarkers of type Feature-b, the following table lists the minimum value of 2Zb needed to find at least 50% of Feature-b peaks with scores better than non-informative peaks as a function of the number of Cases and Controls (Each).

Each |
catboot |
student |
dtgini |
dtinfg |
nnfeat |
chisq |
kruswal |
kolsmir |
extreme |
vartest |
---|---|---|---|---|---|---|---|---|---|---|

30 | 95 | 55 | 85 | 70 | 70 | 100 | 80 | 95 | 75 | 60 |

45 | 80 | 50 | 60 | 60 | 60 | 65 | 65 | 70 | 55 | 60 |

60 | 75 | 45 | 55 | 50 | 55 | 55 | 50 | 70 | 55 | 45 |

90 | 60 | 35 | 45 | 40 | 40 | 40 | 40 | 50 | 35 | 45 |

150 | 50 | 30 | 25 | 25 | 35 | 35 | 30 | 40 | 15 | 30 |

300 | 30 | 20 | 15 | 10 | 25 | 20 | 25 | 30 | 10 | 20 |

For the largest dataset dtinfg and extreme have at least 50% probability of identifying a putative biomarker if there is a 95% overlap in the ranges of intensity for the two categories; again with 95% of all samples having intensity in the overlapping region. For the smallest dataset, student has at least a 50% probability of identifying the feature if there is at most a 72.5% overlap in the intensity ranges while for chisq this overlap can be at most 50%.

The major goal of this exercise is to demonstrate that there is a limit to the level of detection of a putative biomarker using any of these methods. For the largest dataset examined, at least 5% of the samples (30 samples) must have an intensity value in a range not covered by samples in the other category. As the sample size decreases, the fraction of samples that must have intensity values in a range not covered by the other category increases. This means that if a single category is composed of multiple States, and at least one of the States contains a small fraction of the samples, identifying a marker for this State may only be likely if the total number of samples in this category, and therefore the number of samples in this State, is reasonably large.

It should be stressed that these limits of detection are only approximate. In some cases one value of Za or 2Zb finds slightly below 50% of the features. Since the tests are only performed at 5% increments, the minimum level shown in these tables identifies significantly more than 50% of the features. In addition, these artificial features are produced using a random number generator with a uniform distribution. If the distribution of intensity values is different from uniform, it may be harder or easier to identify a putative biomarker. For example, if the distributions of intensity values have a Gaussian or normal distribution about the mean, finding a putative biomarker may be considerably harder. Conversely, a putative biomarker would be easier to identify if it has an excess density of intensity values in the range not covered by samples in the other category.

(Last updated 4/30/07)