matlab.io.datastore.BlockedFileSet

Blocked file-set for collection of blocks within file

Description

The matlab.io.datastore.BlockedFileSet object helps you process a large collection of blocks within files when moving through the files iteratively. Use the BlockedFileSet object together with the DsFileReader object to manage and read files from your datastore.

Creation

Description

bs = matlab.io.datastore.BlockedFileSet(location) creates a BlockedFileSet object for a collection of blocks within files based on the specified location.

example

bs = matlab.io.datastore.BlockedFileSet(location,Name,Value) specifies the file extension, subfolders, or sets object properties. You can specify multiple name-value pairs. Enclose names in quotes.

Input Arguments

expand all

Files or folders to include in the BlockedFileSet object, specified as a character vector, cell array of character vectors, string array, or a structure. If the files are not in the current folder, then location must be a full or relative path. Files within subfolders of the specified folder are not automatically included in the BlockedFileSet object.

Typically for a Hadoop® workflow, when you specify location as a structure, it must contain the fields FileName, Offset, and Size. This requirement enables you to use the location argument directly with the initializeDatastore method of the matlab.io.datastore.HadoopLocationBased class. For an example, see Add Support for Hadoop.

You can use the wildcard character (*) when specifying location. Specifying this character includes all matching files or all files in the matching folders in the file-set object.

If the files are not available locally, then the full path of the files or folders must be a uniform resource locator (URL), such as
hdfs://hostname:portnumber/path_to_file.

Data Types: char | cell | string | struct

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: bs = matlab.io.datastore.BlockedFileSet(location,'IncludeSubfolders',true)

File extensions, specified as a character vector, cell array of character vectors, or string array. You can use the empty quotes '' to represent files without extensions.

If 'FileExtensions' is not specified, then BlockedFileSet automatically includes all file extensions.

Example: 'FileExtensions','.jpg'

Example: 'FileExtensions',{'.txt','.csv'}

Subfolder inclusion flag, specified as a numeric or logical 1 (true) or 0 (false). Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

Example: 'IncludeSubfolders',true

Properties

expand all

Block size in bytes to be used to split file information, specified as one of these values:

  • 'file' — Use size of next file in the collection.

  • numeric scalar — Use specified value in bytes.

Example: 'BlockSize',2000

Alternate file system root paths, specified as a string array or a cell array. Use 'AlternateFileSystemRoots' when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use 'AlternateFileSystemRoots' to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify 'AlternateFileSystemRoots' as a string array. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify 'AlternateFileSystemRoots' as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string array or a cell array of character vectors. For example:

    • Specify 'AlternateFileSystemRoots' as a cell array of string arrays.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify 'AlternateFileSystemRoots' as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of 'AlternateFileSystemRoots' must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

This property is read-only.

Number of blocks in the blocked file-set object, specified as a numeric scalar.

Example: bs.NumBlocks

Data Types: double

This property is read-only.

Number of blocks read from the BlockedFileSet object, specified as a numeric scalar.

Example: bs.NumBlocksRead

Data Types: double

This property is read-only.

Information about blocks in the BlockedFileSet object, returned as a BlockInfo object with these properties:

  • Filename — Name of the file in the BlockedFileSet object. The name contains the full path of the file.

  • FileSize — Size of the file in number of bytes.

  • Offset — Starting offset within the file to be read.

  • BlockSize — Size of the block in number of bytes.

For information about a specific block, specify the block index. For example, bs.BlockInfo(2) returns information for the second block. If you call bs.BlockInfo without specifying an index, it returns information for all of the blocks.

Example: bs.BlockInfo(2)

Object Functions

hasPreviousBlockDetermine if blocked file-set has previous block
previousblock Information on previous block in blocked file-set
hasNextBlockDetermine if blocked file-set has another block
nextblock Information on next block in blocked file-set
progress Determine how many blocks or files have been read
maxpartitions Maximum number of partitions
partition Partition file-set object
subsetCreate subset of datastore or file-set
reset Reset the file-set object

Examples

collapse all

Create a blocked file-set and query information for specific blocks in the blocked file-set.

Create a blocked file-set bs for a collection of files and specify the block size.

folder = {'accidents.mat','airlineResults.mat','census.mat','earth.mat'}
folder = 1x4 cell
  Columns 1 through 3

    {'accidents.mat'}    {'airlineResults...'}    {'census.mat'}

  Column 4

    {'earth.mat'}

bs = matlab.io.datastore.BlockedFileSet(folder,'BlockSize',2000)
bs = 
  BlockedFileSet with properties:

                   NumBlocks: 98
               NumBlocksRead: 0
                   BlockSize: 2000
                   BlockInfo: Show BlockInfo for all 98 blocks
    AlternateFileSystemRoots: {}

Obtain information for specific blocks using either the nextblock function or by querying the BlockInfo property and specifying an index. Obtain information for consecutive blocks using nextblock. For example, obtain information for the first two blocks in the set.

blk1 = nextblock(bs)
blk1 = 
  1x1 BlockInfo
                                       Filename                                       FileSize    Offset    BlockSize
    ______________________________________________________________________________    ________    ______    _________

    "/mathworks/devel/bat/Bdoc20b/build/matlab/toolbox/matlab/demos/accidents.mat"      7343        0         2000   

blk2 = nextblock(bs)
blk2 = 
  1x1 BlockInfo
                                       Filename                                       FileSize    Offset    BlockSize
    ______________________________________________________________________________    ________    ______    _________

    "/mathworks/devel/bat/Bdoc20b/build/matlab/toolbox/matlab/demos/accidents.mat"      7343       2000       2000   

Query the BlockInfo property to get information about the last block in the set.

lastblk = bs.BlockInfo(98)
lastblk = 
  1x1 BlockInfo
                                     Filename                                     FileSize    Offset    BlockSize
    __________________________________________________________________________    ________    ______    _________

    "/mathworks/devel/bat/Bdoc20b/build/matlab/toolbox/matlab/demos/earth.mat"     32522      32000        522   

Introduced in R2020a