Main Content

splitlabels

Find indices to split labels according to specified proportions

Since R2021a

Description

Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.

example

idxs = splitlabels(lblsrc,p) finds logical indices that split the labels in lblsrc based on the proportions or number of labels specified in p.

example

idxs = splitlabels(lblsrc,p,'randomized') randomly assigns the specified proportion of label values to each index set in idxs.

example

idxs = splitlabels(___,Name,Value) specifies additional input arguments using name-value pairs. For example, 'UnderlyingDatastoreIndex',3 splits the labels only in the third underlying datastore of a combined datastore.

Examples

collapse all

Read William Shakespeare's sonnets with the fileread function. Extract all the vowels from the text and convert them to lowercase.

sonnets = fileread("sonnets.txt");
vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';

Count the number of instances of each vowel.

cnts = countlabels(vowels)
cnts=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      4940     18.368 
      e      9028     33.569 
      i      4895     18.201 
      o      5710     21.232 
      u      2321     8.6302 

Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.

spltn = splitlabels(vowels,[500 300]);

for kj = 1:length(spltn)
    cntsn{kj} = countlabels(vowels(spltn{kj}));
end
cntsn{:}
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       500       20   
      e       500       20   
      i       500       20   
      o       500       20   
      u       500       20   

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       300       20   
      e       300       20   
      i       300       20   
      o       300       20   
      u       300       20   

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      4140     18.083 
      e      8228      35.94 
      i      4095     17.887 
      o      4910     21.447 
      u      1521     6.6437 

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

spltp = splitlabels(vowels,[0.5 0.3]);

for kj = 1:length(spltp)
    cntsp{kj} = countlabels(vowels(spltp{kj}));
end
cntsp{:}
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      2470     18.367 
      e      4514     33.566 
      i      2448     18.203 
      o      2855      21.23 
      u      1161     8.6333 

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a      1482     18.371 
      e      2708     33.569 
      i      1468     18.198 
      o      1713     21.235 
      u       696     8.6277 

ans=5×3 table
    Label    Count    Percent
    _____    _____    _______

      a       988     18.368 
      e      1806     33.575 
      i       979       18.2 
      o      1142     21.231 
      u       464     8.6261 

Read William Shakespeare's sonnets with the fileread function. Remove all nonalphabetic characters from the text and convert to lowercase.

sonnets = fileread("sonnets.txt");
letters = lower(sonnets(regexp(sonnets,"[A-z]")))';

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

type = repmat("consonant",size(letters));
type(regexp(letters',"[aeiou]")) = "vowel";

T = table(letters,type,'VariableNames',["Letter" "Type"]);
head(T)
    Letter       Type    
    ______    ___________

      t       "consonant"
      h       "consonant"
      e       "vowel"    
      s       "consonant"
      o       "vowel"    
      n       "consonant"
      n       "consonant"
      e       "vowel"    

Display the number of instances of each category.

cnt = countlabels(T,'TableVariable',"Type")
cnt=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    46516    63.365 
    vowel        26894    36.635 

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

splt = splitlabels(T,0.6,'TableVariable',"Type");

sixty = countlabels(T(splt{1},:),'TableVariable',"Type")
sixty=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    27910    63.366 
    vowel        16136    36.634 

forty = countlabels(T(splt{2},:),'TableVariable',"Type")
forty=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    18606    63.363 
    vowel        10758    36.637 

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter y, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.

splt = splitlabels(T,0.6,'Exclude',"y");

sixti = countlabels(T(splt{1},:),'TableVariable',"Type")
sixti=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    26719    62.346 
    vowel        16137    37.654 

forti = countlabels(T(splt{2},:),'TableVariable',"Type")
forti=2×3 table
      Type       Count    Percent
    _________    _____    _______

    consonant    17813    62.349 
    vowel        10757    37.651 

Split the table into two sets of the same size. Include only the letters e and s. Randomize the sets.

halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]);

cnt = countlabels(T(halves{1},:))
cnt=2×3 table
    Letter    Count    Percent
    ______    _____    _______

      e       4514     64.385 
      s       2497     35.615 

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as A, 30 as B, and 30 as C. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

dsData = arrayDatastore(randn(100,1));
dsLabels = arrayDatastore([repmat("A",40,1); repmat("B",30,1); repmat("C",30,1)]);
dsDataset = combine(dsData,dsLabels);
cnt = countlabels(dsDataset,'UnderlyingDatastoreIndex',2)
cnt=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       40        40   
      B       30        30   
      C       30        30   

Split the data set into two sets, one containing 60% of the numbers and the other with the rest.

splitIndices = splitlabels(dsDataset,0.6,'UnderlyingDatastoreIndex',2);

dsDataset1 = subset(dsDataset,splitIndices{1});
cnt1 = countlabels(dsDataset1,'UnderlyingDatastoreIndex',2)
cnt1=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       24        40   
      B       18        30   
      C       18        30   

dsDataset2 = subset(dsDataset,splitIndices{2});
cnt2 = countlabels(dsDataset2,'UnderlyingDatastoreIndex',2)
cnt2=3×3 table
    Label    Count    Percent
    _____    _____    _______

      A       16        40   
      B       12        30   
      C       12        30   

Input Arguments

collapse all

Input label source, specified as one of these:

  • A categorical vector.

  • A string vector or a cell array of character vectors.

  • A numeric vector or a cell array of numeric scalars.

  • A logical vector or a cell array of logical scalars.

  • A table with variables containing any of the previous data types.

  • A datastore whose readall function returns any of the previous data types.

  • A CombinedDatastore object containing an underlying datastore whose readall function returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.

lblsrc must contain labels that can be converted to a vector with a discrete set of categories.

Example: lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C" "D"]) creates the label source as a ten-sample categorical vector with four categories: A, B, C, and D.

Example: lblsrc = [0 7 2 5 11 17 15 7 7 11] creates the label source as a ten-sample numeric vector.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical | char | string | table | cell | categorical

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

  • If p is a scalar, splitlabels finds two splitting index sets and returns a two-element cell array in idxs.

    • If p is an integer, the first element of idxs contains a vector of indices pointing to the first p values of each label category. The second element of idxs contains indices pointing to the remaining values of each label category.

    • If p is a value in the range (0, 1) and lblsrc has Ki elements in the ith category, the first element of idxs contains a vector of indices pointing to the first p × Ki values of each label category. The second element of idxs contains the indices of the remaining values of each label category.

  • If p is a vector with N elements of the form p1, p2, …, pN, splitlabels finds N + 1 splitting index sets and returns an (N + 1)-element cell array in idxs.

    • If p is a vector of integers, the first element of idxs is a vector of indices pointing to the first p1 values of each label category, the next element of idxs contains the next p2 values of each label category, and so on. The last element in idxs contains the remaining indices of each label category.

    • If p is a vector of fractions and lblsrc has Ki elements of the ith category, the first element of idxs is a vector of indices concatenating the first p1 × Ki values of each category, the next element of idxs contains the next p2 × Ki values of each label category, and so on. The last element in idxs contains the remaining indices of each label category.

Note

  • If p contains fractions, then the sum of its elements must not be greater than one.

  • If p contains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'TableVariable',"AreaCode",'Exclude',["617" "508"] specifies that the function split labels based on telephone area code and exclude numbers from Boston and Natick.

Labels to include in the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in lblsrc. Each category in the vector or cell array must match one of the label categories in lblsrc.

Labels to exclude from the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in lblsrc. Each category in the vector or cell array must match one of the label categories in lblsrc.

Table variable to read, specified as a character vector or string scalar. If this argument is not specified, then splitlabels uses the first table variable.

Underlying datastore index, specified as an integer scalar. This argument applies when lblsrc is a CombinedDatastore object. splitlabels counts the labels in the datastore obtained using the UnderlyingDatastores property of lblsrc.

Output Arguments

collapse all

Splitting indices, returned as a cell array.

Version History

Introduced in R2021a