matlab.io.datastore.Partitionable 클래스

네임스페이스: matlab.io.datastore

데이터저장소에 병렬화 지원 추가

설명

matlab.io.datastore.Partitionable은 Parallel Computing Toolbox™와 MATLAB^® Parallel Server™에 사용할 사용자 지정 데이터저장소에 병렬화 지원을 추가하는 추상 믹스인 클래스입니다.

이 믹스인 클래스를 사용하려면 matlab.io.Datastore 기본 클래스에서뿐만 아니라, matlab.io.datastore.Partitionable 클래스에서도 상속해야 합니다. 다음 구문을 클래스 정의 파일의 첫 번째 라인으로 입력합니다.

classdef MyDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.Partitionable
    ...
end

사용자 지정 데이터저장소에 병렬 처리 지원을 추가하려면 다음을 수행해야 합니다.

추가 클래스 matlab.io.datastore.Partitionable에서 상속합니다.
추가 메서드 maxpartitions와 partition을 정의합니다.

병렬 처리 지원이 포함된 사용자 지정 데이터저장소를 만드는 방법에 대한 자세한 내용 및 단계는 Develop Custom Datastore 항목을 참조하십시오.

메서드

`maxpartitions`	가능한 최대 파티션 개수
`numpartitions`	디폴트 파티션 개수
`partition`	데이터저장소 파티셔닝

특성

Sealed false

클래스 특성에 대한 자세한 내용은 클래스 특성 항목을 참조하십시오.

예제

모두 축소

병렬 처리 지원을 포함한 데이터저장소 빌드하기

스크립트 열기

병렬 처리 지원을 포함한 데이터저장소를 빌드하고 사용하여 사용자 지정 데이터나 사유 데이터를 MATLAB®으로 가져옵니다. 그런 다음 해당 데이터를 병렬 풀에서 처리합니다.

사용자 지정 데이터저장소를 구현하는 코드를 포함하는 .m 클래스 정의 파일을 만듭니다. 이 파일은 작업 폴더나 MATLAB® 경로에 있는 폴더에 저장해야 합니다. .m 파일의 이름은 객체 생성자 함수의 이름과 동일해야 합니다. 예를 들어, 생성자 함수가 MyDatastorePar 이름을 가지려고 하면 .m 파일의 이름이 MyDatastorePar.m이어야 합니다. .m 클래스 정의 파일은 다음 단계를 포함해야 합니다.

1단계: datastore 클래스에서 상속합니다.
2단계: 생성자와 필수 메서드를 정의합니다.
3단계: 사용자 지정 파일 읽기 함수를 정의합니다.

이외에도 데이터를 처리하고 분석하는 데 필요한 다른 속성이나 메서드도 정의합니다.

%% STEP 1: INHERIT FROM DATASTORE CLASSES
classdef MyDatastorePar < matlab.io.Datastore & ...
        matlab.io.datastore.Partitionable
   
    properties(Access = private)
        CurrentFileIndex double
        FileSet matlab.io.datastore.DsFileSet
    end
    
    % Property to support saving, loading, and processing of
    % datastore on different file system machines or clusters.
    % In addition, define the methods get.AlternateFileSystemRoots()
    % and set.AlternateFileSystemRoots() in the methods section. 
    properties(Dependent)
        AlternateFileSystemRoots
    end
    
%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS
    methods
        % Define your datastore constructor
        function myds = MyDatastorePar(location,altRoots)
            myds.FileSet = matlab.io.datastore.DsFileSet(location,...
                'FileExtensions','.bin', ...
                'FileSplitSize',8*1024);
            myds.CurrentFileIndex = 1;
             
            if nargin == 2
                 myds.AlternateFileSystemRoots = altRoots;
            end
            
            reset(myds);
        end
        
        % Define the hasdata method
        function tf = hasdata(myds)
            % Return true if more data is available
            tf = hasfile(myds.FileSet);
        end
        
        % Define the read method
        function [data,info] = read(myds)
            % Read data and information about the extracted data
            % See also: MyFileReader()
            if ~hasdata(myds)
                msgII = ['Use the reset method to reset the datastore ',... 
                         'to the start of the data.']; 
                msgIII = ['Before calling the read method, ',...
                          'check if data is available to read ',...
                          'by using the hasdata method.'];
                error('No more data to read.\n%s\n%s',msgII,msgIII);
            end
            
            fileInfoTbl = nextfile(myds.FileSet);
            data = MyFileReader(fileInfoTbl);
            info.Size = size(data);
            info.FileName = fileInfoTbl.FileName;
            info.Offset = fileInfoTbl.Offset;
            
            % Update CurrentFileIndex for tracking progress
            if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                    fileInfoTbl.FileSize
                myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
            end
        end
        
        % Define the reset method
        function reset(myds)
            % Reset to the start of the data
            reset(myds.FileSet);
            myds.CurrentFileIndex = 1;
        end

        % Define the partition method
        function subds = partition(myds,n,ii)
            subds = copy(myds);
            subds.FileSet = partition(myds.FileSet,n,ii);
            reset(subds);
        end
        
        % Getter for AlternateFileSystemRoots property
        function altRoots = get.AlternateFileSystemRoots(myds)
            altRoots = myds.FileSet.AlternateFileSystemRoots;
        end

        % Setter for AlternateFileSystemRoots property
        function set.AlternateFileSystemRoots(myds,altRoots)
            try
              % The DsFileSet object manages AlternateFileSystemRoots
              % for your datastore
              myds.FileSet.AlternateFileSystemRoots = altRoots;

              % Reset the datastore
              reset(myds);  
            catch ME
              throw(ME);
            end
        end
      
    end
    
    methods (Hidden = true)          
        % Define the progress method
        function frac = progress(myds)
            % Determine percentage of data read from datastore
            if hasdata(myds) 
               frac = (myds.CurrentFileIndex-1)/...
                             myds.FileSet.NumFiles; 
            else 
               frac = 1;  
            end 
        end
    end
    
    methods(Access = protected)
        % If you use the  FileSet property in the datastore,
        % then you must define the copyElement method. The
        % copyElement method allows methods such as readall
        % and preview to remain stateless 
        function dscopy = copyElement(ds)
            dscopy = copyElement@matlab.mixin.Copyable(ds);
            dscopy.FileSet = copy(ds.FileSet);
        end
        
        % Define the maxpartitions method
        function n = maxpartitions(myds)
            n = maxpartitions(myds.FileSet);
        end
    end
end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION
function data = MyFileReader(fileInfoTbl)
% create a reader object using FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);

end

이제 사용자 지정 데이터저장소를 사용할 준비가 되었습니다. 사용자 지정 데이터저장소를 사용하여 병렬 풀에서 데이터를 읽고 처리할 수 있습니다.

사용자 지정 데이터저장소를 사용하여 데이터 읽기 및 병렬 풀에서 처리하기

라이브 스크립트 열기

사용자 지정 데이터저장소를 사용하여 사유 데이터를 미리 보고 병렬 처리를 위해 MATLAB으로 읽어 들입니다.

이 예제에서는 간단한 데이터 세트를 사용하여 사용자 지정 데이터저장소를 사용하는 워크플로를 보여줍니다. 데이터 세트는 각 파일이 한 개 열(1개 변수)과 10000개 행(레코드)의 부호 없는 정수로 구성된, 15개의 이진(.bin) 파일 모음입니다.

dir('*.bin')

binary_data01.bin  binary_data02.bin  binary_data03.bin  binary_data04.bin  binary_data05.bin  binary_data06.bin  binary_data07.bin  binary_data08.bin  binary_data09.bin  binary_data10.bin  binary_data11.bin  binary_data12.bin  binary_data13.bin  binary_data14.bin  binary_data15.bin

MyDatastorePar 함수를 사용하여 datastore 객체를 생성합니다. MyDatastorePar의 구현 세부 정보는 병렬 처리 지원을 포함한 데이터저장소 빌드하기 예제를 참조하십시오.

folder = fullfile('*.bin'); 
ds = MyDatastorePar(folder);

데이터저장소에서 데이터를 미리 봅니다.

preview(ds)

ans = 8x1 uint8 column vector

   113
   180
   251
    91
    29
    66
   254
   214

데이터저장소에 대한 파티션 개수를 식별합니다. PCT(Parallel Computing Toolbox)를 사용 중이라면 n = numpartitions(ds,myPool)을 사용할 수 있습니다. 여기서 myPool은 gcp 또는 parpool입니다.

n = numpartitions(ds);

데이터저장소를 병렬 풀에서 n개 부분과 n개 워커로 분할합니다.

parfor ii = 1:n
    subds = partition(ds,n,ii);
      while hasdata(subds)
        data = read(subds);
        % do something
      end
end

여러 플랫폼에서 데이터저장소 처리하기

여러 플랫폼의 클라우드 또는 클러스터 컴퓨터가 사용되는 병렬 및 분산 연산으로 데이터저장소를 처리하려면 'AlternateFileSystemRoots' 파라미터를 사전에 정의해야 합니다. 예를 들어, 로컬 컴퓨터에서 데이터저장소를 만든 다음 데이터의 작은 부분을 분석합니다. 그런 다음 Parallel Computing Toolbox와 MATLAB Parallel Server를 사용하여 전체 데이터셋으로 분석을 확장합니다.

MyDatastorePar를 사용하여 데이터저장소를 생성하고 'AlternateFileSystemRoots' 속성에 값을 할당합니다. MyDatastorePar의 구현 세부 정보는 Build Datastore with Parallel Processing Support 예제를 참조하십시오.

'AlternateFileSystemRoots' 속성의 값을 설정하려면 여러 플랫폼에서의 데이터의 루트 경로를 파악하십시오. 루트 경로는 컴퓨터나 파일 시스템에 따라 달라집니다. 예를 들어, 다음과 같은 루트 경로를 사용하여 데이터에 액세스하는 경우,

Windows^® 컴퓨터에서 "Z:\DataSet".
MATLAB Parallel Server Linux^® 클러스터에서 "/nfs-bldg001/DataSet".

AlternateFileSystemRoots 속성을 사용하여 위와 같은 루트 경로를 연결해야 합니다.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"];
ds = MyDatastorePar('Z:\DataSet',altRoots);

로컬 컴퓨터에서 데이터의 작은 부분을 분석합니다. 예를 들어, 데이터의 분할된 서브셋을 취하고 누락된 항목을 제거하여 데이터를 정리합니다. 그런 다음 변수의 플롯을 검토합니다.

tt = tall(partition(ds,100,1)); 
summary(tt); 
% analyze your data                        
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

MATLAB Parallel Server 클러스터(Linux 클러스터)를 사용하여 전체 데이터셋으로 분석을 확장합니다. 예를 들어, 클러스터 프로파일을 사용하여 워커 풀을 시작한 다음 병렬 및 분산 연산 기능을 사용하여 전체 데이터셋에 대한 분석을 수행합니다.

parpool('MyMjsProfile') 
tt = tall(ds);          
summary(tt);
% analyze your data
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

팁

사용자 지정 데이터저장소를 구현할 때는 numpartitions 메서드를 구현하지 않는 것이 좋습니다.

버전 내역

R2017b에 개발됨

참고 항목

mapreduce | datastore | matlab.io.datastore.HadoopLocationBased | matlab.io.Datastore

도움말 항목

Develop Custom Datastore
메모리에 담을 수 없는 큰 데이터를 위한 tall형 배열
Partition a Datastore in Parallel (Parallel Computing Toolbox)