Feature selection
From RapidWiki
Feature selection is a data mining step usually performed in the preprocessing phase. Feature selection means selecting a subset of attributes or features which are relevant for a given analysis task out of a set of given attributes. The general strategy for executing a feature selection is to select subsets, learn a model on that subset and evaluate the performance of the learned model on that subset. The subset on which the highest performance has been achieved is then selected as input to subsequent data mining steps.
Contents |
Selection strategies
Although all feature selection algorithms generally work as described before, they differ in how they select attribute subsets for evaluation:
Brute force
The brute force strategy evaluates the performance of all possible attribute subsets. Therefore, it is complete in the sense that it guarantees to find the attribute subset on which the performance evaluation yields the maximum performance. However, this often results in a high runtime since the number of possible attribute subsets grows exponentially in the number of attributes in the original example set. Brute force feature selection is provides by the operator BruteForce.
Forward selection
This strategy initially uses only attribute subsets which exactly one attribute. Then additional attributes are added heuristically, until there is no more performance gain by adding an attribute. The FeatureSelection operator allows to use this strategy by choosing the appropriate parameter value.
Backward elimination
In contrast to the forward selection strategy, the backward elimination strategy starts with the complete attribute set as initial subset and iteratively (and also heuristically) removes attributes from that subset, until no performance gain can be achieved by removing another attribute. This strategy is also provided by the FeatureSelection operator.
Evolutionary strategy
In RapidMiner, Evolutionary strategies can be applied both for parameters' optimization and for attribute selection.
An optimal attribute subset might also be found by an evoluationary strategy. Therefore, every attribute subset is considered as an individual. An evolutionary algorithm works on a population of such individuals which may be selected to mutate or experience a cross over. For feature selection, a mutation might switch features on and off, a cross over might interchange features between individuals. An evolutionary feature selection strategy is implemented in the GeneticAlgorithm operator.
Random selection
In contrast to a deterministic or heuristic strategy, an attribute subset might be chosen completely at random. This might be fast, but no guarantees can be given, whether the selected attributes are relevant for the analysis task. In RapidMiner, the operator RandomSelection provides selection of attributes at random.
