synthesizeTabularData
Syntax
Description
generates syntheticX = synthesizeTabularData(X,n)n observations of synthetic data using the existing data
X. The function returns the synthetic data
syntheticX. By default, synthesizeTabularData
uses a binning technique for generating synthetic data.
generates syntheticX = synthesizeTabularData(X,Yname,n)n observations of synthetic data using the existing data in
the table X and the class labels variable Yname in
X. (since R2026a)
[
generates syntheticX,syntheticY] = synthesizeTabularData(X,Y,n)n observations of synthetic data using the existing data
X and the class labels Y. The function returns
the synthetic data syntheticX and the synthetic class labels
syntheticY. (since R2026a)
___ = synthesizeTabularData(___,
specifies options using one or more name-value arguments in addition to any of the input
argument combinations in the previous syntaxes. For example, you can specify the synthetic
data generation method, the variables to use to generate synthetic data, and the options for
computing in parallel.Name=Value)
Examples
Generate synthetic data using an existing data set in a table. Visually compare the distributions of the existing and synthetic data sets.
Load the sample file fisheriris.csv, which contains iris data including sepal length, sepal width, petal width, and species type. Read the file into a table, and then convert the Species variable into a categorical variable. Display the first eight observations in the table.
fisheriris = readtable("fisheriris.csv");
fisheriris.Species = categorical(fisheriris.Species);
head(fisheriris) SepalLength SepalWidth PetalLength PetalWidth Species
___________ __________ ___________ __________ _______
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
Create 1000 new observations from the data in fisheriris by using the synthesizeTabularData function. By default, the function uses a binning technique to learn the distribution of the variables in fisheriris before synthesizing data.
rng("default")
syntheticData = synthesizeTabularData(fisheriris,1000);For each numeric variable, use box plots to visually compare the distribution of the values in fisheriris to the distribution of the values in syntheticData.
numericVariables = ["SepalLength","SepalWidth", ... "PetalLength","PetalWidth"]; boxchart(fisheriris{:,numericVariables}) hold on boxchart(syntheticData{:,numericVariables}) hold off legend(["Real data","Synthetic data"]) xticklabels(numericVariables)

Blue box plots show the distributions of real data, and red box plots show the distributions of synthetic data. For each of the four numeric variables, the real and synthetic data values have similar distributions.
Use histograms to compare the distribution of flower species in fisheriris and syntheticData.
histogram(fisheriris.Species, ... Normalization="probability") hold on histogram(syntheticData.Species, ... Normalization="probability") hold off legend(["Real data","Synthetic data"])

Overall, the distribution of flower species is similar across the two data sets. For example, 32% of the flowers in the synthetic data set are setosa irises, compared to 33% in the real data set.
Synthesize new data from existing training data by using a binning technique. Train a model using the existing training data, and then train the same type of model using the synthetic data. Compare the performance of the two models using test data.
Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Acceleration, Displacement, and so on, as well as the response variable MPG.
load carbig tbl = table(Acceleration,Cylinders,Displacement,Horsepower, ... Model_Year,Origin,MPG,Weight);
Remove rows of tbl where the table has missing values.
tbl = rmmissing(tbl);
Partition the data into training and test sets. Use approximately 60% of the observations for model training and synthesizing new data, and 40% of the observations for model testing. Use cvpartition to partition the data.
rng("default") cv = cvpartition(size(tbl,1),"Holdout",0.4); trainTbl = tbl(training(cv),:); testTbl = tbl(test(cv),:);
Synthesize new data from the trainTbl data set by using a binning technique. Specify to generate 1000 observations using 20 equal-width bins for each variable. Specify the Cylinders and Model_Year variables as discrete numeric variables.
syntheticTbl = synthesizeTabularData(trainTbl,1000, ... BinMethod="equal-width",NumBins=20, ... DiscreteNumericVariables=["Cylinders","Model_Year"]);
To visualize the difference between the existing data and synthetic data, you can use the detectdrift function. The function uses permutation testing to detect drift between trainTbl and syntheticTbl.
dd = detectdrift(trainTbl,syntheticTbl);
dd is a DriftDiagnostics object with plotEmpiricalCDF and plotHistogram object functions for visualization.
For continuous variables, use the plotEmpiricalCDF function to see the difference between the empirical cumulative distribution function (ecdf) of the values in trainTbl and the ecdf of the values in syntheticTbl.
continuousVariable ="Acceleration"; plotEmpiricalCDF(dd,Variable=continuousVariable) legend(["Real data","Synthetic data"])

For the Acceleration predictor, the ecdf plot for the existing values (in blue) matches the ecdf plot for the synthetic values (in red) fairly well.
For discrete variables, use the plotHistogram function to see the difference between the histogram of the values in trainTbl and the histogram of the values in syntheticTbl.
discreteVariable ="Cylinders"; plotHistogram(dd,Variable=discreteVariable) legend(["Real data","Synthetic data"])

For the Cylinders predictor, the histogram for the existing values (in blue) matches the histogram for the synthetic values (in red) fairly well.
Train a bagged ensemble of trees using the original training data trainTbl. Specify MPG as the response variable. Then, train the same kind of regression model using the synthetic data syntheticTbl.
originalMdl = fitrensemble(trainTbl,"MPG",Method="Bag"); newMdl = fitrensemble(syntheticTbl,"MPG",Method="Bag");
Evaluate the performance of the two models on the test set by computing the test mean squared error (MSE). Smaller MSE values indicate better performance.
originalMSE = loss(originalMdl,testTbl)
originalMSE = 7.0784
newMSE = loss(newMdl,testTbl)
newMSE = 6.1031
The model trained on the synthetic data performs slightly better on the test data.
Since R2026a
Synthesize new data from existing training data by using SMOTE (synthetic minority oversampling technique). Train a model using the existing training data, and then train the same type of model using both the existing training data and the synthetic data. Compare the performance of the two models using test data.
Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Categorize the cars based on whether they were made in Europe.
load carbig Origin = categorical(cellstr(Origin)); Origin = mergecats(Origin,["France","Germany", ... "Sweden","Italy","England"],"Europe"); Origin = mergecats(Origin,["USA","Japan"],"NotEurope"); tabulate(Origin)
Value Count Percent
Europe 73 17.98%
NotEurope 333 82.02%
The data is imbalanced, with only about 18% of cars originating in Europe.
Create a table containing the variables Acceleration, Displacement, and so on, as well as the response variable Origin. Remove rows of cars where the table has missing values.
cars = table(Acceleration,Displacement,Horsepower, ...
MPG,Weight,Origin);
cars = rmmissing(cars);Partition the data into training and test sets. Use approximately 50% of the observations for model training and synthesizing new data, and 50% of the observations for model testing. Use stratified partitioning so that approximately the same ratio of European to non-European cars exists in both the training and test sets.
rng("default")
cv = cvpartition(cars.Origin,Holdout=0.5);
trainCars = cars(training(cv),:);
testCars = cars(test(cv),:);Synthesize new data from the trainCars data set by using SMOTE. Specify Origin as the class labels variable. Specify the ClassNames name-value argument to generate 40 synthetic observations belonging to the class of European cars only.
syntheticCars = synthesizeTabularData(trainCars, ... "Origin",40,Method="smote",ClassNames="Europe"); tabulate(syntheticCars.Origin)
Value Count Percent
Europe 40 100.00%
NotEurope 0 0.00%
To visualize the difference between the existing European car data and the synthetic European car data, you can use the detectdrift function. Filter the trainCars data to include European car data only. The detectdrift function uses permutation testing to detect drift between europeanCars and syntheticCars.
europeanCars = trainCars(trainCars.Origin=="Europe",:);
dd = detectdrift(europeanCars,syntheticCars);dd is a DriftDiagnostics object with a plotEmpiricalCDF object function for visualization.
For continuous variables, use the plotEmpiricalCDF function to see the difference between the empirical cumulative distribution function (ecdf) of the values in europeanCars and the ecdf of the values in syntheticCars.
continuousVariable ="Horsepower"; plotEmpiricalCDF(dd,Variable=continuousVariable) legend(["Real data","Synthetic data"])

For the Horsepower predictor, the ecdf plot for the existing values (in blue) matches the ecdf plot for the synthetic values (in red) fairly well.
Train an SVM classifier using the original training data trainCars. Specify Origin as the response variable, and standardize the predictors before training. Then, train the same kind of classifier using both the original data and the synthetic data (syntheticCars).
originalMdl = fitcsvm(trainCars,"Origin",Standardize=true); newMdl = fitcsvm([trainCars;syntheticCars],"Origin",Standardize=true);
Evaluate the performance of the two models on the test set using confusion matrices.
originalPredictions = predict(originalMdl,testCars); newPredictions = predict(newMdl,testCars); tiledlayout(1,2) nexttile confusionchart(testCars.Origin,originalPredictions) title("Original Model") nexttile confusionchart(testCars.Origin,newPredictions) title("New Model")

The model trained on the original data classifies all test observations as non-European cars. The model trained on the original and synthetic data has greater accuracy than the other model and correctly classifies the majority of European cars in the test set.
Evaluate data synthesized from an existing data set. Compare the existing and synthetic data sets to determine distribution similarity.
Load the carsmall data set. The file contains measurements of cars from 1970, 1976, and 1982. Create a table containing the data and display the first eight observations.
load carsmall carData = table(Acceleration,Cylinders,Displacement,Horsepower, ... Mfg,Model,Model_Year,MPG,Origin,Weight); head(carData)
Acceleration Cylinders Displacement Horsepower Mfg Model Model_Year MPG Origin Weight
____________ _________ ____________ __________ _____________ _________________________________ __________ ___ _______ ______
12 8 307 130 chevrolet chevrolet chevelle malibu 70 18 USA 3504
11.5 8 350 165 buick buick skylark 320 70 15 USA 3693
11 8 318 150 plymouth plymouth satellite 70 18 USA 3436
12 8 304 150 amc amc rebel sst 70 16 USA 3433
10.5 8 302 140 ford ford torino 70 17 USA 3449
10 8 429 198 ford ford galaxie 500 70 15 USA 4341
9 8 454 220 chevrolet chevrolet impala 70 14 USA 4354
8.5 8 440 215 plymouth plymouth fury iii 70 14 USA 4312
Generate 100 new observations using the synthesizeTabularData function. Specify the Cylinders and Model_Year variables as discrete numeric variables. Display the first eight observations.
rng("default") syntheticData = synthesizeTabularData(carData,100, ... DiscreteNumericVariables=["Cylinders","Model_Year"]); head(syntheticData)
Acceleration Cylinders Displacement Horsepower Mfg Model Model_Year MPG Origin Weight
____________ _________ ____________ __________ _____________ _________________________________ __________ ______ _______ ______
11.215 8 309.73 137.28 dodge dodge coronet brougham 76 17.3 USA 4038
10.198 8 416.68 215.51 plymouth plymouth fury iii 70 9.5497 USA 4507.2
17.161 6 258.38 77.099 amc amc pacer d/l 76 18.325 USA 3199.8
9.4623 8 426.19 197.3 plymouth plymouth fury iii 70 11.747 USA 4372.1
13.992 4 106.63 91.396 datsun datsun pl510 70 30.56 Japan 1950.7
17.965 6 266.24 78.719 oldsmobile oldsmobile cutlass ciera (diesel) 82 36.416 USA 2832.4
17.028 4 139.02 100.24 chevrolet chevrolet cavalier 2-door 82 36.058 USA 2744.5
15.343 4 118.93 100.22 toyota toyota celica gt 82 26.696 Japan 2600.5
Visualize the synthetic and existing data sets. Create a DriftDiagnostics object using the detectdrift function. The object has the plotEmpiricalCDF and plotHistogram object functions you can use to visualize continuous and discrete variables.
dd = detectdrift(carData,syntheticData);
Use plotEmpiricalCDF to visualize the empirical cumulative distribution function (ECDF) of the values in carData and syntheticData.
continuousVariable ="Acceleration"; plotEmpiricalCDF(dd,Variable=continuousVariable) legend(["Real data","Synthetic data"])

For the variable Acceleration, the ECDF of the existing data (in blue) and the ECDF of the synthetic data (in red) appear to be similar.
Use plotHistogram to visualize the distribution of values for discrete variables in carData and syntheticData.
discreteVariable ="Cylinders"; plotHistogram(dd,Variable=discreteVariable) legend(["Real data","Synthetic data"])

For the variable Cylinders, the distribution of data between the bins for the existing data (in blue) and the synthetic data (in red) appear similar.
Compare the synthetic and existing data sets using the mmdtest function. The function performs a two-sample hypothesis test for the null hypothesis that the samples come from the same distribution.
[mmd,p,h] = mmdtest(carData,syntheticData)
mmd = 0.0078
p = 0.8860
h = 0
The returned value of h = 0 indicates that mmdtest fails to reject the null hypothesis that the samples come from different distributions at the 5% significance level. As with other hypothesis tests, this result does not guarantee that the null hypothesis is true. That is, the samples do not necessarily come from the same distribution, but the low MMD value and high p-value indicate that the distributions of the real and synthetic data sets are similar.
Input Arguments
Existing data set, specified as a numeric matrix or a table. Rows of
X correspond to observations, and columns of
X correspond to variables. Multicolumn variables and cell arrays
other than cell arrays of character vectors are not supported.
Data Types: single | double | table
Since R2026a
Name of the class labels variable in X, specified as a
character vector or string scalar. X must be a table, and
Yname must specify a column in X that is a
numeric, categorical, or logical vector; a character or string array; or a cell array of
character vectors. The Yname variable must contain class labels for
one or two classes only.
Data Types: char | string
Since R2026a
Class labels for one or two classes, specified as a numeric, categorical, or logical
vector; a character or string array; or a cell array of character vectors. Rows of
Y correspond to observations.
Data Types: single | double | logical | char | string | cell | categorical
Number of synthetic data observations to generate, specified as a positive integer scalar or a two-element positive integer vector.
To generate synthetic data for two classes
using Yname or Y, you can specify
n as a two-element vector. Each element indicates the number of
observations to generate for the corresponding class in ClassNames.
If you specify n as a scalar, then the software generates synthetic
observations for the two classes with approximately the same proportion found in
Yname or Y. (since R2026a)
Example: 100
Data Types: single | double
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN, where Name is
the argument name and Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: synthesizeTabularData(X,100,BinMethod="equiprobable",NumBins=10)
specifies to use 10 equiprobable bins for each variable in X to generate
100 synthetic observations.
Binning Options
Binning algorithm, specified as one of the values in this table. Note the following:
Xiis the existing data set for classiwhen you specify class labels for two classes. Otherwise, it is the existing data setX.miis the number of observations in the existing data for classiwhen you specify class labels for two classes. Otherwise, it is the number of observations in the existing data setX.
| Value | Description |
|---|---|
"auto" |
|
"equal-width" | Equal-width binning, where you must specify the number of bins
using the NumBins name-value argument |
"equiprobable" | Equiprobable binning, where you must specify the number of bins
using the NumBins name-value argument |
"dagostino-stephens" or
"ds" | Equiprobable binning with ceil(2*mi^(2/5))
bins |
"freedman-diaconis" or
"fd" | Equal-width binning, where each bin for variable
k has a width of
ceil(2*iqr(Xi(:,k))*mi^(-1/3)) |
"scott" | Equal-width binning, where each bin for variable
k has a width of
ceil(3.5*std(Xi(:,k))*mi^(-1/3)) |
"scott-multivariate" | Equal-width binning, where each bin for variable
k has a width of
3.5*std(Xi(:,k))*mi^(-1/(2+d)) |
"terrell-iqr" | Equal-width binning, where each bin for variable
k has a width of
2.603*iqr(Xi(:,k))*mi^(-1/3) |
"terrell-scott" or
"ts" | Equal-width binning with ceil((2*mi)^(1/3))
bins |
"terrell-std" | Equal-width binning, where each bin for variable
k has a width of
3.729*std(Xi(:,k))*mi^(-1/3) |
Example: BinMethod="scott"
Data Types: char | string
Number of bins to use for continuous variables, specified as a positive integer scalar or vector.
If
NumBinsis a scalar, then the function uses the same number of bins for each continuous variable.If
NumBinsis a vector, then the function usesNumBins(k)number of bins for continuous variablek.
Specify this value only when BinMethod is
"equal-width" or "equiprobable".
Example: NumBins=[10 25 10 15]
Data Types: single | double
SMOTE Options
Since R2026a
Number of nearest neighbors to use when generating synthetic data, specified as a positive integer scalar.
Example: NumNeighbors=10
Data Types: single | double
Distance metric for finding nearest neighbors, specified as a character vector or string scalar.
If all the variables are continuous (numeric), then you can specify one of the following distance metrics.
Value Description "euclidean"Euclidean distance
"fasteuclidean"Euclidean distance computed by using an alternative algorithm that saves time when the number of variables is at least 10. In some cases, this faster algorithm can reduce accuracy. "seuclidean"Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding variable standard deviation.
"fastseuclidean"Standardized Euclidean distance computed by using an alternative algorithm that saves time when the number of variables is at least 10. In some cases, this faster algorithm can reduce accuracy. Note
If you specify one of these distance metrics and the data includes categorical variables, then the software treats each categorical variable as a numeric variable for the distance computation, with each category represented by a positive integer.
If all the variables are categorical, then you can specify the following distance metric.
Value Description "hamming"Hamming distance, which is the percentage of coordinates that differ
Note
If you specify this distance metric and the data includes continuous (numeric) variables, then the software treats each continuous variable as a categorical variable for the distance computation.
If the variables are a mix of continuous (numeric) and categorical variables, then you can specify the following distance metric.
Value Description "goodall3"Modified Goodall distance
The default value is "seuclidean" if all the variables are
continuous, "hamming" if all the variables are categorical, and
"goodall3" if the variables are a mix of continuous and
categorical variables.
Example: Distance="euclidean"
Data Types: char | string
Size in megabytes of the cache allocated for the distance computation, specified as
"maximal" or a positive scalar. If the cache size is
"maximal", the software tries to allocate enough memory for an
intermediate matrix.
The CacheSize name-value argument is valid only when the
Distance value is "fasteuclidean",
"fastseuclidean", or "goodall3".
For the
fastdistance metrics, the intermediate matrix corresponds to the Gram matrix.For the modified Goodall distance metric, the intermediate matrix corresponds to the distance matrix.
Example: CacheSize="maximal"
Data Types: single | double | char | string
Additional Options
Since R2026a
Method used to synthesize data, specified as "binning" or
"smote". For more information on the binning technique used when
Method="binning", see Estimate Multivariate Data Distribution by Binning and Generate Synthetic Data Using Binning. For more information on SMOTE (synthetic minority oversampling technique), which
is used when Method="smote", see Generate Synthetic Data Using SMOTE.
Example: Method="smote"
Data Types: char | string
Variable names, excluding Yname, specified as a string array
or a cell array of character vectors. You can specify
VariableNames to choose which variables to use in table
X. That is, synthesizeTabularData uses only the
variables in VariableNames to generate synthetic data.
Xmust be a table, andVariableNamesmust be a subset ofX.Properties.VariableNames.By default,
VariableNamescontains the names of all variables, excluding the class labels variableYname.
Example: VariableNames=["SepalLength","SepalWidth","PetalLength","PetalWidth"]
Data Types: string | cell
List of the categorical variables, excluding the class labels variable
Yname, specified as one of the values in this table.
| Value | Description |
|---|---|
| Positive integer vector | Each entry in the vector is an index value indicating that the
corresponding variable is categorical. The index values are between 1 and
v, where v is the number of variables
listed in |
| Logical vector | A |
| String array or cell array of character vectors | Each element in the array is the name of a categorical variable. The names must
match the entries in VariableNames. |
"all" | All variables are categorical. |
By default, if the variables are in a numeric matrix, the software assumes all the variables
are continuous. If the variables are in a table, the software assumes they are
categorical if they are logical vectors, categorical vectors, character
arrays, string arrays, or cell arrays of character vectors. To identify any other
variables as categorical, specify them by using the
CategoricalVariables name-value argument.
Do not specify discrete numeric variables as categorical variables. Use the
DiscreteNumericVariables name-value argument instead.
Example: CategoricalVariables="all"
Data Types: single | double | logical | string | cell
List of the discrete numeric variables, specified as one of the values in this table.
| Value | Description |
|---|---|
| Positive integer vector | Each entry in the vector is an index value indicating that
the corresponding variable is a discrete numeric variable. The
index values are between 1 and v, where
v is the number of variables listed in
|
| Logical vector | A |
| String array or cell array of character vectors | Each element in the array is the name of a discrete numeric
variable. The names must match the entries in
VariableNames. |
"all" | All variables are discrete numeric variables. |
You cannot specify categorical variables as discrete numeric variables.
Example: DiscreteNumericVariables=[2 5]
Data Types: single | double | logical | string | cell
Since R2026a
Names of the classes in Yname or Y for
which to generate synthetic data, specified as a numeric, categorical, or logical
vector; a character or string array; or a cell array of character vectors.
You can use ClassNames to:
Specify the order of the classes.
Select a class for generating synthetic data. For example, suppose that the set of distinct class labels is
["b","g"]. To generate synthetic observations from class"g"only, specifyClassNames="g".
The default value for ClassNames is the ordered set of
distinct class labels in Yname or Y.
Example: ClassNames=["g","b"]
Data Types: single | double | logical | char | string | cell | categorical
Options for computing in parallel and setting random streams, specified as a
structure. Create the Options structure using statset. This table lists the option fields and their
values.
| Field Name | Value | Default |
|---|---|---|
UseParallel | Set this value to true to run computations in
parallel. | false |
UseSubstreams | Set this value to To compute
reproducibly, set | false |
Streams | Specify this value as a RandStream object or
cell array of such objects. Use a single object except when the
UseParallel value is true
and the UseSubstreams value is
false. In that case, use a cell array that
has the same size as the parallel pool. | If you do not specify Streams, then
synthesizeTabularData uses the default stream or
streams. |
Note
You need Parallel Computing Toolbox™ to run computations in parallel.
Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))
Data Types: struct
Output Arguments
Since R2026a
Synthetic class labels, returned as a numeric, categorical, or logical vector; a
character or string array; or a cell array of character vectors.
syntheticY and Y have the same data
type.
Tips
Use SMOTE-based data generation when you have an imbalanced data set with mostly numeric predictors. If your data set contains only categorical predictors, consider using a different technique. For an example that shows different methods for handling imbalanced data, see Handle Class Imbalance in Binary Classification.
Algorithms
When you use a binning technique, the synthesizeTabularData function estimates the
distribution of the multivariate data set X by performing these steps:
Bin each continuous variable using equiprobable or equal-width binning, as specified by the
BinMethodandNumBinsname-value arguments.Encode the continuous variables using the bin indices.
One-hot encode all binned and discrete variables.
Compute the probability of each unique row in the encoded data set.
If you specify class labels for two classes (using
Yname or Y), the function estimates the
distribution for each data set X1 and X2, where X1 contains the observations in the first class and X2 contains the observations in the second class. (since R2026a)
The synthesizeTabularData function uses the computed probabilities to
generate synthetic data.
When you use a binning technique, the process for estimating the multivariate data
distribution includes computing the probability of each unique row in the one-hot encoded
data set (after binning continuous variables). The
synthesizeTabularData function uses this estimated multivariate
data distribution to generate synthetic observations. The function performs these steps:
Use the previously computed probabilities to sample with replacement
nrows from the unique rows in the encoded data set.Decode the sampled data to obtain the bin indices (for continuous variables) and categories (for discrete variables).
For the binned variables, uniformly sample from within the bin edges to obtain continuous values. If you use equiprobable binning (
BinMethod) and the extreme bin widths are greater than 1.5 times the median of the nonextreme bin widths, then the function samples from the cumulative distribution function (cdf) in the extreme bins.
SMOTE (synthetic minority oversampling technique) is a technique for generating synthetic data when you have an imbalanced data set, that is, when the number of observations is not uniform across the response classes.
Assume the data set X has two classes, where class kbig has many more observations than class ksmall. If the data set X has only one class, then assume
all observations belong to ksmall. When you use SMOTE, the synthesizeTabularData
function generates each new ksmall observation in the following way:
Randomly select an observation x in ksmall.
In the data set, find the
NumNeighbors-nearest neighbors of x that also belong to ksmall.Randomly select one of the nearest neighbors .
For each continuous predictor p, set the predictor value of the new observation to , where xp is the predictor value of the original observation x, is the predictor value of the nearest neighbor , and r is a random value in (0,1) as selected by the
randfunction.For the categorical predictors q1, …, qm, set the predictor values of the new observation to the mode of the vector of predictor values among the
NumNeighbors-nearest neighbors of x. In the case of a tie, choose a vector at random.
The function follows the same process to generate observations in class kbig. The number of synthetic observations generated depends on
n (the input argument of
synthesizeTabularData).
Alternative Functionality
Instead of calling the synthesizeTabularData function to generate
synthetic data directly, you can first create a binningTabularSynthesizer or smoteTabularSynthesizer object using an existing data set, and then call the
synthesizeTabularData object function to synthesize data using the object. By
creating an object, you can easily generate synthetic data multiple times without having to
relearn characteristics of the existing data set.
References
[1] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research 16 (2002): 321-357.
Extended Capabilities
To run in parallel, specify the Options name-value argument in the call to
this function and set the UseParallel field of the
options structure to true using
statset:
Options=statset(UseParallel=true)
For more information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).
Version History
Introduced in R2024bYou can use the synthetic minority oversampling technique (SMOTE) algorithm to generate synthetic data for binary classification. Using SMOTE can be helpful when you have imbalanced data, that is, when one class contains many more observations than the other.
Use the synthesizeTabularData function with
Method="smote". You can adjust parameters of the SMOTE technique by
using the NumNeighbors, Distance, and
CacheSize name-value arguments.
Regardless of the method used (SMOTE or binning), the
synthesizeTabularData function can return synthetic observations for
one or two classes. Use the Yname or Y input
argument and the ClassNames name-value argument.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)




