Feature Engineering

Use domain knowledge and transformations to extract and optimize features from raw data

Feature engineering is the process of turning raw data into features to be used by machine learning. Feature engineering is difficult because extracting features from signals and images requires deep domain knowledge and finding the best features fundamentally remains an iterative process, even if you apply automated methods.

Feature engineering encompasses one or more of the following steps:

  1. Feature extraction to generate candidate features
  2. Feature transformation, which maps features to make them more suitable for downstream modeling
  3. Feature selection identifies subsets that provide the better predictive power in modeling the data while reducing model size and simplifying prediction.

For example, sports statistics include numeric data like games played, average time per game, and points scored, all broken down by player. Feature extraction in this context includes compressing these statistics into derived numbers, like points per game or average time to score. Then feature selection becomes a question of whether you build a model using just these ratios, or whether the original statistics still help the model make more accurate predictions.

Manual feature extraction for signal and image data requires signal and image processing knowledge, though automated techniques such as wavelet transforms have proven very effective. These techniques are useful even if you apply deep learning to signal data since deep neural nets have trouble uncovering structure in raw signal data. The traditional approach for extracting features from text data is modeling text as a bag of words. Modern approaches apply deep learning to encode the context of words, such as the popular word embedding technique word2vec.

Feature transformation includes popular data preparation techniques, such as normalization to address large differences in the scale of features, but also aggregation to summarize data, filtering to remove noise, and dimensionality reduction techniques such as PCA and factor analysis.

Many methods for feature selection are supported by MATLAB®. Some are based on ranking features by importance, which could be as basic as correlation with the response. Some machine learning models estimate feature importance during the learning algorithm (“embedded” feature selection), while so-called filter-based methods infer a separate model of feature importance. Wrapper selection methods iteratively add and remove candidate features using a selection criterion. The figure below provides an overview of the various aspects of feature engineering to guide practitioners in finding performant features for their machine learning models.

Basic feature engineering workflow.

Deep learning has become known for taking raw image and signal data as input, thus eliminating the feature engineering step. While that works well for large image and video data sets, feature engineering is still critical for good performance when applying deep learning to smaller data sets and signal-based problems.

Key Points

  • Feature engineering is essential for applying machine learning, and also relevant for applications of deep learning to signals.
  • Wavelet scattering delivers good features from signal and image data without manual feature extraction
  • Additional steps such as feature transformation and selection can yield more accurate yet smaller sets of features suitable for deployment to hardware constrained environments.

Example

Ranking features by applying the minimum redundancy maximum relevance (MRMR) algorithm implemented in the fscmrmr function in MATLAB yields good features for classification without long runtimes, as demonstrated in this example. Large drops in importance scores imply that you can confidently determine the threshold on which features to use for your model, while small drops indicate that you may have to include lots of additional features to avoid a significant loss in accuracy for the resulting model.

MRMR applies to classification problems only. For regression, neighborhood component analysis is a good option, available in MATLAB as fsrnca.

See also: feature extraction, feature selection, cluster analysis, Wavelet Toolbox