Feature Selection

Feature Selection

Feature Selection

By December 30, 2014 Data Science No Comments

Feature Selection

By December 30, 2014 Data Science No Comments

Feature Selection is crucial to any model construction in data science. Focusing on the most important, relevant features will help any data scientist design a better model and accelerate outcomes.

So what exactly is a feature in data science analysis? Let’s start there. Frequently we refer to the set of values that describe a data point as features. In actual practice, these values can go by multiple names, attributes, values, and dimensions. However, they refer to the same thing: measurements that describe an instance.

Here’s an example.
To describe a single day of the year, what kinds of measurements could we gather?

Some values that you might encounter:

  • Temperature
  • Air pressure
  • Hours of sunlight
  • Precipitation
  • Humidity
  • Number of pennies that were found on the ground that day
  • The number of jaywalkers that day in New York
  • Price of oil
  • Economic jobs report released today

If we were interested in predicting the weather the next day, which of these measurements would be most valuable? Most likely, the temperature, air pressure, hours of sunlight, precipitation and humidity would yield the best prediction.

Data regarding pennies, jaywalkers, oil and economic jobs would likely not yield any useful information regarding the weather. However, it is always possible that there might be some form of relationship that exists mathematically, although it is not explainable. (This is referred to as a spurious relationship in statistics.)

If we dedicate resources in our learning algorithm to uninformative features such as these, we complicate our algorithm. And that ups the likelihood that we will overfit the model to our dataset and thus learn incorrect concepts. My example was simplistic. Now imagine there were 1,000 more features, like socks color of a random 1,000 people. This data won’t help predict the weather. But if it’s in the dataset, the algorithm would waste huge effort configuring and tuning parameters trying to learn the relationship between socks and weather the next day. There’s no value in that. If we only had a few parameters, we could make much more efficient use of the algorithm/model. So discovering a small set of powerful features is the goal.

There is another case where one feature carries the same information contained in another feature. In my example, a redundant feature may be the presence of precipitation and the amount of rainfall that day. The amount of precipitation is more detailed than a simple “yes/no” value. A model trained on this data will spend extra effort learning values that it already has information on.

Ideally we would like to have a set of features that have relevance to the model being constructed and carry unique information amongst each other.

Feature selection is critical in situations where the number of features greatly outnumbers the number of samples. This is referred to as The Curse Of Dimensionality, a rich topic for another time. For a quick overview:
http://en.wikipedia.org/wiki/Curse_of_dimensionality.

The basics are that you need an increasingly larger number of samples to cover the instance space at the same density. A helpful interactive demo of this concept can be found here https://prpatil.shinyapps.io/cod_app/

Now that we understand what a feature is and why too many can be a bad thing, what can we do?

Two common approaches for feature selection are Filter Methods and Wrapper Methods. In both methods, features are evaluated to assess the quality of a model that could be constructed from this feature set. A filter method generally looks at features independently, evaluating the relevance of each particular feature. A filter method would score the features independently of how they perform on the model of interest. For example, a simple filter method that could be used is a t-test. This looks for significant differences between classes, looking only at one feature at a time. We could go through each feature one by one, testing for significance between the class distributions, keeping only the features that are significantly different.

A wrapper method evaluates the features in relation to their performance on the model. The set of features are used to construct the model and the performance of the set is scored. Feature sets that perform better are indicative of good feature sets. You can think of the feature sets as a team – and the team is evaluated for its overall quality as compared to another “team” of features. For the weather example, if we selected temperature, air pressure, and precipitation and constructed a predictive model using only these features, how well would it predict tomorrow’s weather? This result could be compared to a model constructed from jaywalkers and the price of oil. The feature set containing weather measurements would likely have a better model accuracy.

What makes a good feature set? Everything depends on what you want to accomplish with your model. If you ask different questions of your data, a feature could have different importance.

The next time you face a large set of features with your dataset, consider whether the presence of all of them will be beneficial. If not, try filtering or wrapping.

Let’s continue the conversation @paulyacci

—Written by Paul Yacci