splitlabels

Find indices to split labels according to specified proportions

Since R2021a

Syntax

idxs = splitlabels(lblsrc,p)

idxs = splitlabels(lblsrc,p,'randomized')

idxs = splitlabels(___,Name,Value)

Description

Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.

idxs = splitlabels(lblsrc,p) finds logical indices that split the labels in lblsrc based on the proportions or number of labels specified in p.

example

idxs = splitlabels(lblsrc,p,'randomized') randomly assigns the specified proportion of label values to each index set in idxs.

example

idxs = splitlabels(___,Name,Value) specifies additional input arguments using name-value pairs. For example, 'UnderlyingDatastoreIndex',3 splits the labels only in the third underlying datastore of a combined datastore.

example

Examples

collapse all

Split Vowels

Open Live Script

Read William Shakespeare's sonnets with the fileread function. Extract all the vowels from the text and convert them to lowercase.

sonnets = fileread("sonnets.txt");
vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';

Count the number of instances of each vowel.

cnts = countlabels(vowels)

cnts=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      4940     18.368 
      e      9028     33.569 
      i      4895     18.201 
      o      5710     21.232 
      u      2321     8.6302

Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.

spltn = splitlabels(vowels,[500 300]);

for kj = 1:length(spltn)
    cntsn{kj} = countlabels(vowels(spltn{kj}));
end
cntsn{:}

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       500       20   
      e       500       20   
      i       500       20   
      o       500       20   
      u       500       20

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       300       20   
      e       300       20   
      i       300       20   
      o       300       20   
      u       300       20

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      4140     18.083 
      e      8228      35.94 
      i      4095     17.887 
      o      4910     21.447 
      u      1521     6.6437

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

spltp = splitlabels(vowels,[0.5 0.3]);

for kj = 1:length(spltp)
    cntsp{kj} = countlabels(vowels(spltp{kj}));
end
cntsp{:}

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      2470     18.367 
      e      4514     33.566 
      i      2448     18.203 
      o      2855      21.23 
      u      1161     8.6333

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      1482     18.371 
      e      2708     33.569 
      i      1468     18.198 
      o      1713     21.235 
      u       696     8.6277

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       988     18.368 
      e      1806     33.575 
      i       979       18.2 
      o      1142     21.231 
      u       464     8.6261

Split Vowels and Consonants

Open Live Script

Read William Shakespeare's sonnets with the fileread function. Remove all nonalphabetic characters from the text and convert to lowercase.

sonnets = fileread("sonnets.txt");
letters = lower(sonnets(regexp(sonnets,"[A-z]")))';

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

type = repmat("consonant",size(letters));
type(regexp(letters',"[aeiou]")) = "vowel";

T = table(letters,type,'VariableNames',["Letter" "Type"]);
head(T)

    Letter       Type    
    ______    ___________

      t       "consonant"
      h       "consonant"
      e       "vowel"    
      s       "consonant"
      o       "vowel"    
      n       "consonant"
      n       "consonant"
      e       "vowel"

Display the number of instances of each category.

cnt = countlabels(T,'TableVariable',"Type")

cnt=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    46516    63.365 
    vowel        26894    36.635

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

splt = splitlabels(T,0.6,'TableVariable',"Type");

sixty = countlabels(T(splt{1},:),'TableVariable',"Type")

sixty=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    27910    63.366 
    vowel        16136    36.634

forty = countlabels(T(splt{2},:),'TableVariable',"Type")

forty=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    18606    63.363 
    vowel        10758    36.637

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter y, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.

splt = splitlabels(T,0.6,'Exclude',"y");

sixti = countlabels(T(splt{1},:),'TableVariable',"Type")

sixti=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    26719    62.346 
    vowel        16137    37.654

forti = countlabels(T(splt{2},:),'TableVariable',"Type")

forti=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    17813    62.349 
    vowel        10757    37.651

Split the table into two sets of the same size. Include only the letters e and s. Randomize the sets.

halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]);

cnt = countlabels(T(halves{1},:))

cnt=2×3 table
    Letter    Count    Percent
    ______    _____    _______

      e       4514     64.385 
      s       2497     35.615

Split Data in Datastore

Open Live Script

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as A, 30 as B, and 30 as C. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

dsData = arrayDatastore(randn(100,1));
dsLabels = arrayDatastore([repmat("A",40,1); ...
            repmat("B",30,1); repmat("C",30,1)]);
dsDataset = combine(dsData,dsLabels);
cnt = countlabels(dsDataset,UnderlyingDatastoreIndex=2)

cnt=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       40        40   
      B       30        30   
      C       30        30

Split the data set into two sets, one containing 60% of the numbers and the other with the rest.

splitIndices = splitlabels(dsDataset,0.6,UnderlyingDatastoreIndex=2);

dsDataset1 = subset(dsDataset,splitIndices{1});
cnt1 = countlabels(dsDataset1,UnderlyingDatastoreIndex=2)

cnt1=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       24        40   
      B       18        30   
      C       18        30

dsDataset2 = subset(dsDataset,splitIndices{2});
cnt2 = countlabels(dsDataset2,UnderlyingDatastoreIndex=2)

cnt2=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       16        40   
      B       12        30   
      C       12        30

Input Arguments

collapse all

`lblsrc` — Input label source
categorical vector | string vector | logical vector | numeric vector | cell array | table | datastore | `CombinedDatastore` object

Input label source, specified as one of these:

A categorical vector.
A string vector or a cell array of character vectors.
A numeric vector or a cell array of numeric scalars.
A logical vector or a cell array of logical scalars.
A table with variables containing any of the previous data types.
A datastore whose readall function returns any of the previous data types.
A CombinedDatastore object containing an underlying datastore whose readall function returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.

lblsrc must contain labels that can be converted to a vector with a discrete set of categories.

Example: lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C" "D"]) creates the label source as a ten-sample categorical vector with four categories: A, B, C, and D.

Example: lblsrc = [0 7 2 5 11 17 15 7 7 11] creates the label source as a ten-sample numeric vector.

`p` — Proportions or numbers of labels
integer scalar | scalar in (0, 1) | vector of integers | vector of fractions

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

If p is a scalar, splitlabels finds two splitting index sets and returns a two-element cell array in idxs.
- If p is an integer, the first element of idxs contains a vector of indices pointing to the first p values of each label category. The second element of idxs contains indices pointing to the remaining values of each label category.
- If p is a value in the range (0, 1) and lblsrc has K_i elements in the ith category, the first element of idxs contains a vector of indices pointing to the first p × K_i values of each label category. The second element of idxs contains the indices of the remaining values of each label category.
If p is a vector with N elements of the form p₁, p₂, …, p_N, splitlabels finds N + 1 splitting index sets and returns an (N + 1)-element cell array in idxs.
- If p is a vector of integers, the first element of idxs is a vector of indices pointing to the first p₁ values of each label category, the next element of idxs contains the next p₂ values of each label category, and so on. The last element in idxs contains the remaining indices of each label category.
- If p is a vector of fractions and lblsrc has K_i elements of the ith category, the first element of idxs is a vector of indices concatenating the first p₁ × K_i values of each category, the next element of idxs contains the next p₂ × K_i values of each label category, and so on. The last element in idxs contains the remaining indices of each label category.

Note

If p contains fractions, then the sum of its elements must not be greater than one.
If p contains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'TableVariable',"AreaCode",'Exclude',["617" "508"] specifies that the function split labels based on telephone area code and exclude numbers from Boston and Natick.

`Include` — Labels to include in index sets
vector of label categories | cell array of label categories

Labels to include in the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in lblsrc. Each category in the vector or cell array must match one of the label categories in lblsrc.

`Exclude` — Labels to exclude from index sets
vector of label categories | cell array of label categories

Labels to exclude from the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in lblsrc. Each category in the vector or cell array must match one of the label categories in lblsrc.

`TableVariable` — Table variable to read
first table variable (default) | character vector | string scalar

Table variable to read, specified as a character vector or string scalar. If this argument is not specified, then splitlabels uses the first table variable.

`UnderlyingDatastoreIndex` — Underlying datastore index
integer scalar

Underlying datastore index, specified as an integer scalar. This argument applies when lblsrc is a CombinedDatastore object. splitlabels counts the labels in the datastore obtained using the UnderlyingDatastores property of lblsrc.

Output Arguments

collapse all

`idxs` — Splitting indices
cell array

Splitting indices, returned as a cell array.

Version History

Introduced in R2021a

splitlabels

Syntax

Description

Examples

Split Vowels

Split Vowels and Consonants

Split Data in Datastore

Input Arguments

lblsrc — Input label source categorical vector | string vector | logical vector | numeric vector | cell array | table | datastore | CombinedDatastore object

p — Proportions or numbers of labels integer scalar | scalar in (0, 1) | vector of integers | vector of fractions

Name-Value Arguments

Include — Labels to include in index sets vector of label categories | cell array of label categories

Exclude — Labels to exclude from index sets vector of label categories | cell array of label categories

TableVariable — Table variable to read first table variable (default) | character vector | string scalar

UnderlyingDatastoreIndex — Underlying datastore index integer scalar

Output Arguments

idxs — Splitting indices cell array

Version History

See Also

`lblsrc` — Input label source
categorical vector | string vector | logical vector | numeric vector | cell array | table | datastore | `CombinedDatastore` object

`p` — Proportions or numbers of labels
integer scalar | scalar in (0, 1) | vector of integers | vector of fractions

`Include` — Labels to include in index sets
vector of label categories | cell array of label categories

`Exclude` — Labels to exclude from index sets
vector of label categories | cell array of label categories

`TableVariable` — Table variable to read
first table variable (default) | character vector | string scalar

`UnderlyingDatastoreIndex` — Underlying datastore index
integer scalar

`idxs` — Splitting indices
cell array