Automated Feature Engineering for Classification
The gencfeatures
function enables you to automate the feature engineering process in the context of a machine
learning workflow. Before passing tabular training data to a classifier, you can create new
features from the predictors in the data by using gencfeatures
. Use the
returned data to train the classifier.
Generate new features based on your machine learning workflow.
To generate features for an interpretable binary classifier, use the default
TargetLearner
value of"linear"
in the call togencfeatures
. You can then use the returned data to train a binary linear classifier. For an example, see Interpret Linear Model with Generated Features.To generate features that can lead to better model accuracy, specify
TargetLearner="bag"
orTargetLearner="gaussian-svm"
in the call togencfeatures
. You can then use the returned data to train a bagged ensemble classifier or a binary support vector machine (SVM) classifier with a Gaussian kernel, respectively. For an example, see Generate New Features to Improve Bagged Ensemble Accuracy.
To better understand the generated features, use the describe
function
of the FeatureTransformer
object. To apply the same training set feature transformations to a test or validation set,
use the transform
function
of the FeatureTransformer
object.
Interpret Linear Model with Generated Features
Use automated feature engineering to generate new features. Train a linear classifier using the generated features. Interpret the relationship between the generated features and the trained model.
Load the patients
data set. Create a table from a subset of the variables. Display the first few rows of the table.
load patients Tbl = table(Age,Diastolic,Gender,Height,SelfAssessedHealthStatus, ... Systolic,Weight,Smoker); head(Tbl)
Age Diastolic Gender Height SelfAssessedHealthStatus Systolic Weight Smoker ___ _________ __________ ______ ________________________ ________ ______ ______ 38 93 {'Male' } 71 {'Excellent'} 124 176 true 43 77 {'Male' } 69 {'Fair' } 109 163 false 38 83 {'Female'} 64 {'Good' } 125 131 false 40 75 {'Female'} 67 {'Fair' } 117 133 false 49 80 {'Female'} 64 {'Good' } 122 119 false 46 70 {'Female'} 68 {'Good' } 121 142 false 33 88 {'Female'} 64 {'Good' } 130 142 true 40 82 {'Male' } 68 {'Good' } 115 180 false
Generate 10 new features from the variables in Tbl
. Specify the Smoker
variable as the response. By default, gencfeatures
assumes that the new features will be used to train a binary linear classifier.
rng("default") % For reproducibility [T,NewTbl] = gencfeatures(Tbl,"Smoker",10)
T = FeatureTransformer with properties: Type: 'classification' TargetLearner: 'linear' NumEngineeredFeatures: 10 NumOriginalFeatures: 0 TotalNumFeatures: 10
NewTbl=100×11 table
zsc(Systolic.^2) eb8(Diastolic) q8(Systolic) eb8(Systolic) q8(Diastolic) zsc(kmd9) zsc(sin(Age)) zsc(sin(Weight)) zsc(Height-Systolic) zsc(kmc1) Smoker
________________ ______________ ____________ _____________ _____________ _________ _____________ ________________ ____________________ _________ ______
0.15379 8 6 4 8 -1.7207 0.50027 0.19202 0.40418 0.76177 true
-1.9421 2 1 1 2 -0.22056 -1.1319 -0.4009 2.3431 1.1617 false
0.30311 4 6 5 5 0.57695 0.50027 -1.037 -0.78898 -1.4456 false
-0.85785 2 2 2 2 0.83391 1.1495 1.3039 0.85162 -0.010294 false
-0.14125 3 5 4 4 1.779 -1.3083 -0.42387 -0.34154 0.99368 false
-0.28697 1 4 3 1 0.67326 1.3761 -0.72529 0.40418 1.3755 false
1.0677 6 8 6 6 -0.42521 1.5181 -0.72529 -1.5347 -1.4456 true
-1.1361 4 2 2 5 -0.79995 1.1495 -1.0225 1.2991 1.1617 false
-1.1361 3 2 2 3 -0.80136 0.46343 1.0806 1.2991 -1.208 false
-0.71693 5 3 3 6 0.37961 -0.51304 0.16741 0.55333 -1.4456 false
-1.2734 2 1 1 2 1.2572 1.3025 1.0978 1.4482 -0.010294 false
-1.1361 1 2 2 1 1.001 -1.2545 -1.2194 1.0008 -0.010294 false
0.60534 1 6 5 1 -0.98493 -0.11998 -1.211 -0.043252 -1.208 false
1.0677 8 8 6 8 -0.27307 1.4659 1.2168 -0.34154 0.24706 true
-1.2734 3 1 1 4 0.93395 -1.3633 -0.17603 1.0008 -0.010294 false
1.0677 7 8 6 8 -0.91396 -1.04 -1.2109 -0.49069 0.24706 true
⋮
T
is a FeatureTransformer
object that can be used to transform new data, and newTbl
contains the new features generated from the Tbl
data.
To better understand the generated features, use the describe
object function of the FeatureTransformer
object. For example, inspect the first two generated features.
describe(T,1:2)
Type IsOriginal InputVariables Transformations ___________ __________ ______________ _______________________________________________________________ zsc(Systolic.^2) Numeric false Systolic power( ,2) Standardization with z-score (mean = 15119.54, std = 1667.5858) eb8(Diastolic) Categorical false Diastolic Equal-width binning (number of bins = 8)
The first feature in newTbl
is a numeric variable, created by first squaring the values of the Systolic
variable and then converting the results to z-scores. The second feature in newTbl
is a categorical variable, created by binning the values of the Diastolic
variable into 8 bins of equal width.
Use the generated features to fit a linear classifier without any regularization.
Mdl = fitclinear(NewTbl,"Smoker",Lambda=0);
Plot the coefficients of the predictors used to train Mdl
. Note that fitclinear
expands categorical predictors before fitting a model.
p = length(Mdl.Beta); [sortedCoefs,expandedIndex] = sort(Mdl.Beta,ComparisonMethod="abs"); sortedExpandedPreds = Mdl.ExpandedPredictorNames(expandedIndex); bar(sortedCoefs,Horizontal="on") yticks(1:2:p) yticklabels(sortedExpandedPreds(1:2:end)) xlabel("Coefficient") ylabel("Expanded Predictors") title("Coefficients for Expanded Predictors")
Identify the predictors whose coefficients have larger absolute values.
bigCoefs = abs(sortedCoefs) >= 4; flip(sortedExpandedPreds(bigCoefs))
ans = 1x7 cell
{'zsc(Systolic.^2)'} {'eb8(Systolic) >= 5'} {'eb8(Diastolic) >= 3'} {'q8(Diastolic) >= 3'} {'q8(Systolic) >= 6'} {'q8(Diastolic) >= 6'} {'zsc(Height-Systolic)'}
You can use partial dependence plots to analyze the categorical features whose levels have large coefficients in terms of absolute value. For example, inspect the partial dependence plot for the q8(Diastolic)
variable, whose levels q8(Diastolic) >= 3
and q8(Diastolic) >= 6
have coefficients with large absolute values. These two levels correspond to noticeable changes in the predicted scores.
plotPartialDependence(Mdl,"q8(Diastolic)",Mdl.ClassNames,NewTbl);
Generate New Features to Improve Bagged Ensemble Accuracy
Use gencfeatures
to engineer new features before training a bagged ensemble classifier. Before making predictions on new data, apply the same feature transformations to the new data set. Compare the test set performance of the ensemble that uses the engineered features to the test set performance of the ensemble that uses the original features.
Read the sample file CreditRating_Historical.dat
into a table. The predictor data consists of financial ratios and industry sector information for a list of corporate customers. The response variable consists of credit ratings assigned by a rating agency. Preview the first few rows of the data set.
creditrating = readtable("CreditRating_Historical.dat");
head(creditrating)
ID WC_TA RE_TA EBIT_TA MVE_BVTD S_TA Industry Rating _____ ______ ______ _______ ________ _____ ________ _______ 62394 0.013 0.104 0.036 0.447 0.142 3 {'BB' } 48608 0.232 0.335 0.062 1.969 0.281 8 {'A' } 42444 0.311 0.367 0.074 1.935 0.366 1 {'A' } 48631 0.194 0.263 0.062 1.017 0.228 4 {'BBB'} 43768 0.121 0.413 0.057 3.647 0.466 12 {'AAA'} 39255 -0.117 -0.799 0.01 0.179 0.082 4 {'CCC'} 62236 0.087 0.158 0.049 0.816 0.324 2 {'BBB'} 39354 0.005 0.181 0.034 2.597 0.388 7 {'AA' }
Because each value in the ID
variable is a unique customer ID, that is, length(unique(creditrating.ID))
is equal to the number of observations in creditrating
, the ID
variable is a poor predictor. Remove the ID
variable from the table, and convert the Industry
variable to a categorical
variable.
creditrating = removevars(creditrating,"ID");
creditrating.Industry = categorical(creditrating.Industry);
Convert the Rating
response variable to a categorical
variable.
creditrating.Rating = categorical(creditrating.Rating, ... ["AAA","AA","A","BBB","BB","B","CCC"]);
Partition the data into training and test sets. Use approximately 75% of the observations as training data, and 25% of the observations as test data. Partition the data using cvpartition
.
rng("default") % For reproducibility of the partition c = cvpartition(creditrating.Rating,Holdout=0.25); trainingIndices = training(c); % Indices for the training set testIndices = test(c); % Indices for the test set creditTrain = creditrating(trainingIndices,:); creditTest = creditrating(testIndices,:);
Use the training data to generate 40 new features to fit a bagged ensemble. By default, the 40 features include original features that can be used as predictors by a bagged ensemble.
[T,newCreditTrain] = gencfeatures(creditTrain,"Rating",40, ... TargetLearner="bag"); T
T = FeatureTransformer with properties: Type: 'classification' TargetLearner: 'bag' NumEngineeredFeatures: 34 NumOriginalFeatures: 6 TotalNumFeatures: 40
Create newCreditTest
by applying the transformations stored in the object T
to the test data.
newCreditTest = transform(T,creditTest);
Compare the test set performances of a bagged ensemble trained on the original features and a bagged ensemble trained on the new features.
Train a bagged ensemble using the original training set creditTrain
. Compute the accuracy of the model on the original test set creditTest
. Visualize the results using a confusion matrix.
originalMdl = fitcensemble(creditTrain,"Rating",Method="Bag"); originalTestAccuracy = 1 - loss(originalMdl,creditTest, ... "Rating",LossFun="classiferror")
originalTestAccuracy = 0.7542
predictedTestLabels = predict(originalMdl,creditTest); confusionchart(creditTest.Rating,predictedTestLabels);
Train a bagged ensemble using the transformed training set newCreditTrain
. Compute the accuracy of the model on the transformed test set newCreditTest
. Visualize the results using a confusion matrix.
newMdl = fitcensemble(newCreditTrain,"Rating",Method="Bag"); newTestAccuracy = 1 - loss(newMdl,newCreditTest, ... "Rating",LossFun="classiferror")
newTestAccuracy = 0.7461
newPredictedTestLabels = predict(newMdl,newCreditTest); confusionchart(newCreditTest.Rating,newPredictedTestLabels)
The bagged ensemble trained on the transformed data seems to outperform the bagged ensemble trained on the original data.
See Also
gencfeatures
| FeatureTransformer
| describe
| transform
| fitclinear
| fitcensemble
| fitcsvm
| plotPartialDependence
| genrfeatures