The setup is that you have a set of hypotheses or rules, C, that are used in order to classify data. For example, if we are trying to decide whether or not a patient is healthy, we may look at his resting pulse and systolic blood pressure. Then I could make a rule that says if his pulse is above 100 and his systolic blood pressure above 140, then he is not healthy. A rule like this helps us make a decision whether or not the patient is healthy. Now assume you do not have expert information (that high blood pressure is bad) but you have data points-- measurements of pulse and blood pressure-- and

*labels*for each data point which say "healthy" or "not healthy". From here we would like to

*learn*a rule that can classify other patients as healthy or not healthy, using the same measurements. From p23:

"How can we use data to choose a good rule from C? One obvious idea is to select a rule from C with the least error on the data. Then we use that rule in order to classify new data. This is basically the method of enumerative induction. Our question then, is: How good is this version of enumerative induction for choosing a rule from C?

"Clearly, it depends on what rules are in the set C from which the rule is to be chosen. If all possible rules are in that set, then there will be many rules with least error on the data, and these will give completely different advice about new cases. ...

"Of course, restricting the rules in C runs the risk of not including the best of all possible rules, the rule with the least expected error on new cases. ...

"There are other possible inductive methods for choosing rules-- methods that do not just choose the rule with the least error on the data. One such method balances data-coverage against something else, such as the simplicity of a given rule. ... but now let us concentrate on what is needed for the sort of enumerative induction that simply chooses the rule in C with the least error on the data.

"So now consider the question of how the rules in C might be restricted if enumerative induction in this sense is to be guaranteed to work, given enough evidence, no matter what the background statistical probability distribution.

"The answer to this question is one of the great discoveries of statistical learning theory-- the discovery of the importance of the

*Vapnik-Chervonenkis*dimension, or VC dimension, of a set of rules. The VC dimension is a measure of the "richness" of the set of rules and it is inversely related to the degree of falsifiability of the set.

... from p45: "What has to be true of the set C in order to guarantee that, with probability approaching 1, given more and more data, the expected error for the rules that enumerative induction endorses at each stage will approach the minimum value of expected error for rules in C?

"You might wonder whether this sort of convergence isn't guaranteed by the statistical law of large numbers. That principle implies that with probability approaching 1, the empirical error of any particular rule will approach the expected error of that rule, given more and more data. But this is not the same as what is wanted. The trouble is that, given infinitely many rules, as more and more data are taken into account, the rules endorsed by enumerative induction can change infinitely often. Even if the empirical error for each rule approaches a limit, that does not imply anything about the limit of the empirical error of the varying rule endorsed by enumerative induction at each stage. ...

"What is needed then is not just that the empirical error of each rule should converge to its expected error but also that the empirical error of the varying rules endorsed by enumerative induction should approach the value of the expected error of that rule in the limit. If c_n is a rule endorsed by enumerative induction after n data points, then what is needed is that the empirical error of the rule c_n after n data points should approach the expected error of c_n in the limit. ...

"This will happen if (with probability approaching 1) the empirical error of the rules in C converge

*uniformly*to their expected error. ...

"What has to be true of the set of rules C for such uniform convergence? Vapnik and Chervonenkis (1968) show (in effect) that this condition is met for classification rules if and only if the set of classification rules C is not too rich, where the richness of the set is measured by what has come to be called its "VC dimension." ...

"Suppose that some set of N points in the feature space is

*shattered*by rules in C in the sense that, for any possible labeling of those points, some rule in C perfectly fits the points so labeled. Then the VC dimension of the set of rules C is at least N. More specifically, the VC dimension of a set of rules C is the largest number N such that some set of N points in feature space is shattered by rules in C. ...

"Notice that the definition of VC dimension refers to

*some*set of N points being shattered, not to all sets of N points being shattered. Consider the set of all linear classifications of points in the plane where YESes and NOs are separated by a straight line. The VC dimension of this set of classification rules is 3, because some sets of three points in the plane can be shattered by this class of rules and no set of four points can be shattered. Three collinear points (i.e., three points on the same straight line) cannot be shattered by this class of rules, because there is no such rule that can classify the middle point as a YES and the outer points as NOs. But three points that are not collinear can be shattered... And no four points can be shattered by this class of rules, so the VC dimension of these linear rules is exactly 3. ...

"Some other examples: The VC dimension of the set of all linear separations in D-dimensional space is D+1. The VC dimension of the set of all inner rectangles in the plane is 4. The VC dimension of the set of all unions of rectangles in the plane is infinite.

"So, that is what VC dimension comes to. Vapnik and Chervonenkis (1968) show roughly that enumerative induction is guaranteed to work no matter what the background probability distribution if and only if the classification rules in C have a finite VC dimension."

## No comments:

Post a Comment