matlab.io.datastore.Partitionable Class

Namespace: matlab.io.datastore

Add parallelization support to datastore

Description

matlab.io.datastore.Partitionable is an abstract mixin class that adds parallelization support to your custom datastore for use with Parallel Computing Toolbox™ and MATLAB^® Parallel Server™.

To use this mixin class, you must inherit from matlab.io.datastore.Partitionable class, in addition to inheriting from the matlab.io.Datastore base class. Type the following syntax as the first line of your class definition file:

classdef MyDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.Partitionable
    ...
end

To add support for parallel processing to your custom datastore, you must:

Inherit from the additional class matlab.io.datastore.Partitionable.
Define these additional methods: maxpartitions and partition.

For more details and steps to create your custom datastore with parallel processing support, see Develop Custom Datastore.

Methods

`maxpartitions`	Maximum number of partitions possible
`numpartitions`	Default number of partitions
`partition`	Partition a datastore

Attributes

Sealed false

For information on class attributes, see Class Attributes.

Examples

collapse all

Build Datastore with Parallel Processing Support

Open Script

Build a datastore with parallel processing support and use it to bring your custom or proprietary data into MATLAB®. Then, process the data in a parallel pool.

Create a .m class definition file that contains the code implementing your custom datastore. You must save this file in your working folder or in a folder that is on the MATLAB® path. The name of the .m file must be the same as the name of your object constructor function. For example, if you want your constructor function to have the name MyDatastorePar, then the name of the .m file must be MyDatastorePar.m. The .m class definition file must contain the following steps:

Step 1: Inherit from the datastore classes.
Step 2: Define the constructor and the required methods.
Step 3: Define your custom file reading function.

In addition to these steps, define any other properties or methods that you need to process and analyze your data.

%% STEP 1: INHERIT FROM DATASTORE CLASSES
classdef MyDatastorePar < matlab.io.Datastore & ...
        matlab.io.datastore.Partitionable
   
    properties(Access = private)
        CurrentFileIndex double
        FileSet matlab.io.datastore.DsFileSet
    end
    
    % Property to support saving, loading, and processing of
    % datastore on different file system machines or clusters.
    % In addition, define the methods get.AlternateFileSystemRoots()
    % and set.AlternateFileSystemRoots() in the methods section. 
    properties(Dependent)
        AlternateFileSystemRoots
    end
    
%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS
    methods
        % Define your datastore constructor
        function myds = MyDatastorePar(location,altRoots)
            myds.FileSet = matlab.io.datastore.DsFileSet(location,...
                'FileExtensions','.bin', ...
                'FileSplitSize',8*1024);
            myds.CurrentFileIndex = 1;
             
            if nargin == 2
                 myds.AlternateFileSystemRoots = altRoots;
            end
            
            reset(myds);
        end
        
        % Define the hasdata method
        function tf = hasdata(myds)
            % Return true if more data is available
            tf = hasfile(myds.FileSet);
        end
        
        % Define the read method
        function [data,info] = read(myds)
            % Read data and information about the extracted data
            % See also: MyFileReader()
            if ~hasdata(myds)
                msgII = ['Use the reset method to reset the datastore ',... 
                         'to the start of the data.']; 
                msgIII = ['Before calling the read method, ',...
                          'check if data is available to read ',...
                          'by using the hasdata method.'];
                error('No more data to read.\n%s\n%s',msgII,msgIII);
            end
            
            fileInfoTbl = nextfile(myds.FileSet);
            data = MyFileReader(fileInfoTbl);
            info.Size = size(data);
            info.FileName = fileInfoTbl.FileName;
            info.Offset = fileInfoTbl.Offset;
            
            % Update CurrentFileIndex for tracking progress
            if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                    fileInfoTbl.FileSize
                myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
            end
        end
        
        % Define the reset method
        function reset(myds)
            % Reset to the start of the data
            reset(myds.FileSet);
            myds.CurrentFileIndex = 1;
        end

        % Define the partition method
        function subds = partition(myds,n,ii)
            subds = copy(myds);
            subds.FileSet = partition(myds.FileSet,n,ii);
            reset(subds);
        end
        
        % Getter for AlternateFileSystemRoots property
        function altRoots = get.AlternateFileSystemRoots(myds)
            altRoots = myds.FileSet.AlternateFileSystemRoots;
        end

        % Setter for AlternateFileSystemRoots property
        function set.AlternateFileSystemRoots(myds,altRoots)
            try
              % The DsFileSet object manages AlternateFileSystemRoots
              % for your datastore
              myds.FileSet.AlternateFileSystemRoots = altRoots;

              % Reset the datastore
              reset(myds);  
            catch ME
              throw(ME);
            end
        end
      
    end
    
    methods (Hidden = true)          
        % Define the progress method
        function frac = progress(myds)
            % Determine percentage of data read from datastore
            if hasdata(myds) 
               frac = (myds.CurrentFileIndex-1)/...
                             myds.FileSet.NumFiles; 
            else 
               frac = 1;  
            end 
        end
    end
    
    methods(Access = protected)
        % If you use the  FileSet property in the datastore,
        % then you must define the copyElement method. The
        % copyElement method allows methods such as readall
        % and preview to remain stateless 
        function dscopy = copyElement(ds)
            dscopy = copyElement@matlab.mixin.Copyable(ds);
            dscopy.FileSet = copy(ds.FileSet);
        end
        
        % Define the maxpartitions method
        function n = maxpartitions(myds)
            n = maxpartitions(myds.FileSet);
        end
    end
end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION
function data = MyFileReader(fileInfoTbl)
% create a reader object using FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);

end

Your custom datastore is now ready. Use your custom datastore to read and process the data in a parallel pool.

Read Data Using Custom Datastore And Process in Parallel Pool

Open Live Script

Use custom datastore to preview and read your proprietary data into MATLAB for parallel processing.

This example uses a simple data set to illustrate a workflow using your custom datastore. The data set is a collection of 15 binary (.bin) files where each file contains a column (1 variable) and 10000 rows (records) of unsigned integers.

dir('*.bin')

binary_data01.bin  binary_data02.bin  binary_data03.bin  binary_data04.bin  binary_data05.bin  binary_data06.bin  binary_data07.bin  binary_data08.bin  binary_data09.bin  binary_data10.bin  binary_data11.bin  binary_data12.bin  binary_data13.bin  binary_data14.bin  binary_data15.bin

Create a datastore object using the MyDatastorePar function. For implementation details of MyDatastorePar, see the example Build Datastore with Parallel Processing Support.

folder = fullfile('*.bin'); 
ds = MyDatastorePar(folder);

Preview the data from the datastore.

preview(ds)

ans = 8x1 uint8 column vector

   113
   180
   251
    91
    29
    66
   254
   214

Identify the number of partitions for your datastore. If you have Parallel Computing Toolbox (PCT), then you can use n = numpartitions(ds,myPool), where myPool is gcp or parpool.

n = numpartitions(ds);

Partition the datastore into n parts and n workers in a parallel pool.

parfor ii = 1:n
    subds = partition(ds,n,ii);
      while hasdata(subds)
        data = read(subds);
        % do something
      end
end

Process Datastore on Different Platforms

To process your datastore with parallel and distributed computing that involves different platform cloud or cluster machines, you must pre-define 'AlternateFileSystemRoots' parameter. For instance, create a datastore on your local machine, and analyze a small portion of the data. Then, scale up your analysis to the entire dataset using Parallel Computing Toolbox and MATLAB Parallel Server.

Create a datastore using MyDatastorePar and assign a value to the 'AlternateFileSystemRoots' property. For implementation details of MyDatastorePar, see the example Build Datastore with Parallel Processing Support.

To set the value for the 'AlternateFileSystemRoots' property, identify the root paths for your data on the different platforms. The root paths differ based on the machine or file system. For instance, if you access your data using these root paths:

"Z:\DataSet" from the Windows^® machine.
"/nfs-bldg001/DataSet" from the MATLAB Parallel Server Linux^® cluster.

Then, associate these root paths using the AlternateFileSystemRoots property.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"];
ds = MyDatastorePar('Z:\DataSet',altRoots);

Analyze a small portion of the data on your local machine. For instance, get a partitioned subset of the data and clean the data by removing any missing entries. Then, examine a plot of the variables.

tt = tall(partition(ds,100,1)); 
summary(tt); 
% analyze your data                        
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

Scale up your analysis to the entire dataset by using MATLAB Parallel Server cluster (Linux cluster). For instance, start a worker pool using the cluster profile, and then perform analysis on the entire dataset by using parallel and distributed computing capabilities.

parpool('MyMjsProfile') 
tt = tall(ds);          
summary(tt);
% analyze your data
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

Tips

For your custom datastore implementation, best practice is not to implement the numpartitions method.

Version History

Introduced in R2017b