parallel.cluster.Spark

Spark cluster for mapreducer, mapreduce and tall arrays

Since R2022b

Description

A parallel.cluster.Spark object represents and provides access to a Spark™ cluster. Use the parallel.cluster.Spark object as input to the mapreduce and mapreducer functions, for specifying the Spark cluster as the parallel execution environment for tall arrays and mapreduce.

Creation

Use the parcluster function to create a parallel.cluster.Spark cluster object from a Spark cluster profile. Alternatively, use the parallel.cluster.Spark function (described here) to create a Spark cluster object.

Syntax

sparkCluster = parallel.cluster.Spark

sparkCluster = parallel.cluster.Spark(Name,Value)

Description

sparkCluster = parallel.cluster.Spark creates a parallel.cluster.Spark object representing the Spark cluster.

sparkCluster = parallel.cluster.Spark(Name,Value) sets the optional properties using one or more name-value arguments on the parallel.cluster.Spark object. For example, to change the Spark install folder, use SparkInstallFolder="/share/spark/spark-3.3.0". For a list of valid properties, see Properties.

example

Properties

expand all

`AdditionalPaths` — Folders to add to MATLAB^® search path
character vector | string | string array | cell array of character vectors

Folders to add to MATLAB search path of workers, specified as a character vector, string or string array, or cell array of character vectors.

When you offload computations to workers, any files that the client needs for computations must also be available on workers. By default, the client attempts to detect and attach these files. To turn off automatic detection, set the AutoAttachFiles property to false. If the software cannot find all the files, or if sending files from client to worker is slow, use one of these options.

If the files are in a folder that is not accessible on the workers, set the AttachedFiles property. The cluster copies each file you specify from the client to the workers.
If the files are in a folder that is accessible on the workers, you can set the AdditionalPaths property instead. Use the AdditionalPaths property to add paths to the MATLAB search path for each worker and avoid copying files unnecessarily from the client to the workers.

`AttachedFiles` — Files and folders sent to workers
character vector | string | string array | cell array of character vectors

Files and folders sent to workers during a mapreduce call, specified as a character vector, string or string array, or cell array of character vectors

`AutoAttachFiles` — Specify whether to automatically attach files
true (default) | false

Specify whether to automatically detect and attach files on the client.

Data Types: logical

`ClusterMatlabRoot` — Path to MATLAB for workers
character vector

Path to MATLAB for workers, specified as the comma-separated pair consisting of 'ClusterMatlabRoot' and a character vector. This points to the installation of MATLAB Parallel Server™ for the workers, whether local to each machine or on a network share.

Data Types: string

`Jobs` — Jobs contained within cluster
Read-only: `parallel.job` object | array of `parallel.Job` objects

Since R2024b

This property is read-only.

Jobs contained within the cluster, returned as a parallel.Job object or an array of parallel.Job objects. When the cluster contains more than one job, MATLAB sorts the jobs in the array by their ID property. This sorting is consistent with the order in which you create the jobs, regardless of the values of the State property of each job.

`LicenseNumber` — License number to use
integer

License number to use with online licensing.

`Modified` — Logical true if the cluster has been modified
Read-only: true | false

Since R2024a

This property is read-only.

Logical true if any properties in this cluster have been modified compared to the cluster profile, returned as a logical true (1) if you have modified the cluster properties and logical false (0) otherwise.

Data Types: logical

`NumThreads` — Number of computational threads for workers
nonnegative integer

Since R2024a

Number of computational threads for workers, specified as a nonnegative integer.

`OperatingSystem` — Operating system of cluster worker machines
`"windows"` | `"unix"` | `"mixed"`

Since R2024a

Operating system of the cluster worker machines, specified as one of these values:

"windows"
"unix"
"mixed"

`Profile` — Name of profile used to create cluster object
character vector

Since R2024a

Name of the profile used to create cluster object, specified as a character vector.

Data Types: char

`RequiresOnlineLicensing` — Specify whether cluster requires online licensing
false (default) | true

Specify whether the Spark cluster uses online licensing.

Data Types: logical

`SparkInstallFolder` — Path to Spark installation on client machine
character vector | string array

Path to the Spark installation on client machine, specified as the comma-separated pair consisting of SparkInstallFolder and a character vector or string array. If this property is not set, the default is the value specified by the environment variable SPARK_PREFIX, or if that is not set, then SPARK_HOME.

Data Types: char

`SparkProperties` — Map of Spark name-value property pairs
character vector

Map of Spark name-value property pairs to be given to the Spark cluster.

SparkProperties allows you to override configuration properties for Spark. See the list of properties in the Spark documentation.

`Type` — Cluster type
Read-only: `'Spark'`

Since R2024a

This property is read-only.

Type of this cluster, returned as 'Spark'.

`UserData` — Data associated with cluster object in current session
any type

Since R2024a

Data associated with the cluster object in the current session, specified as any MATLAB data type.

Object Functions

`mapreduce`	Programming technique for analyzing data sets that do not fit in memory
`mapreducer`	Define parallel execution environment for mapreduce and tall arrays
`saveAsProfile`	Save cluster properties to specified profile
`saveProfile`	Save modified cluster properties to its current profile

Examples

collapse all

Create Spark Cluster from Spark Cluster Profile

Since R2024a

Create and use a parallel.cluster.Spark object from a Spark cluster profile.

To learn how to create a profile for your Spark cluster, see Client Configuration (MATLAB Parallel Server).

sparkCluster = parcluster("SparkProfile")
mr = mapreducer(sparkCluster)

cluster = 

 Spark Cluster

    Properties: 

                      Type: Spark
                   Profile: SparkProfile
                  Modified: false
                NumThreads: 1
   RequiresOnlineLicensing: false
         ClusterMatlabRoot: /network/installs/MATLAB/R2024a/matlab

        SparkInstallFolder: /network/installs/spark/3.0.2-3.2
           SparkProperties: [1x1 parallel.cluster.SparkProperties]

Manually Create Cluster Object for Spark Cluster

Manually create and use a parallel.cluster.Spark object.

Create the cluster object by specifying the Spark installation on your machine, and set the Spark cluster as the mapreduce parallel execution environment.

sparkCluster = parallel.cluster.Spark(SparkInstallFolder="/host/spark-install");
mr = mapreducer(sparkCluster)

Limitations

Spark cluster profiles do not support being set as the default profile.
Spark clusters do not support parallel pools and batch jobs.

Tips

Spark clusters place limits on how much memory is available. You must adjust the size of the data to gather to support your workflow.

The amount of data gathered to the client is limited by the Spark properties:

spark.driver.memory
spark.executor.memory

The default value of the spark.executor.memory property of a Spark job submitted from MATLAB is 2560 MB.

The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.

If these properties are set too small, you see an error like the following.

Error using tall/gather (line 50)
Out of memory; unable to gather a partition of size 300m from Spark.
Adjust the values of the Spark properties spark.driver.memory and 
spark.executor.memory to fit this partition.

The error message also specifies the property settings you need.

Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, you can add these Spark properties to the SparkProperties table of the Spark cluster profile.

Name	Value	Type
`spark.driver.memory`	2048m	`String`
`spark.executor.memory`	2048m	`String`

You can also edit the Spark cluster object.

cluster = parcluster("SparkProfile");
cluster.SparkProperties('spark.driver.memory') = '2048m';
cluster.SparkProperties('spark.executor.memory') = '2048m';
mapreducer(cluster);

Version History

Introduced in R2022b

expand all

R2024a: Create and use Spark cluster profiles

You can now create and validate cluster profiles for Spark based clusters integrated with MATLAB Parallel Server.

parallel.cluster.Spark

Description

Creation

Syntax

Description

Properties

`AdditionalPaths` — Folders to add to MATLAB^® search path
character vector | string | string array | cell array of character vectors

`AttachedFiles` — Files and folders sent to workers
character vector | string | string array | cell array of character vectors

`AutoAttachFiles` — Specify whether to automatically attach files
true (default) | false

`ClusterMatlabRoot` — Path to MATLAB for workers
character vector

`Jobs` — Jobs contained within cluster
Read-only: `parallel.job` object | array of `parallel.Job` objects

`LicenseNumber` — License number to use
integer

`Modified` — Logical true if the cluster has been modified
Read-only: true | false

`NumThreads` — Number of computational threads for workers
nonnegative integer

`OperatingSystem` — Operating system of cluster worker machines
`"windows"` | `"unix"` | `"mixed"`

`Profile` — Name of profile used to create cluster object
character vector

`RequiresOnlineLicensing` — Specify whether cluster requires online licensing
false (default) | true

`SparkInstallFolder` — Path to Spark installation on client machine
character vector | string array

`SparkProperties` — Map of Spark name-value property pairs
character vector

`Type` — Cluster type
Read-only: `'Spark'`

`UserData` — Data associated with cluster object in current session
any type

Object Functions

Examples

Create Spark Cluster from Spark Cluster Profile

Manually Create Cluster Object for Spark Cluster

Limitations

Tips

Version History

R2024a: Create and use Spark cluster profiles

See Also

Topics

parallel.cluster.Spark

Description

Creation

Syntax

Description

Properties

AdditionalPaths — Folders to add to MATLAB® search path character vector | string | string array | cell array of character vectors

AttachedFiles — Files and folders sent to workers character vector | string | string array | cell array of character vectors

AutoAttachFiles — Specify whether to automatically attach files true (default) | false

ClusterMatlabRoot — Path to MATLAB for workers character vector

Jobs — Jobs contained within cluster Read-only: parallel.job object | array of parallel.Job objects

LicenseNumber — License number to use integer

Modified — Logical true if the cluster has been modified Read-only: true | false

NumThreads — Number of computational threads for workers nonnegative integer

OperatingSystem — Operating system of cluster worker machines "windows" | "unix" | "mixed"

Profile — Name of profile used to create cluster object character vector

RequiresOnlineLicensing — Specify whether cluster requires online licensing false (default) | true

SparkInstallFolder — Path to Spark installation on client machine character vector | string array

SparkProperties — Map of Spark name-value property pairs character vector

Type — Cluster type Read-only: 'Spark'

UserData — Data associated with cluster object in current session any type

Object Functions

Examples

Create Spark Cluster from Spark Cluster Profile

Manually Create Cluster Object for Spark Cluster

Limitations

Tips

Version History

R2024a: Create and use Spark cluster profiles

See Also

Topics

`AdditionalPaths` — Folders to add to MATLAB^® search path
character vector | string | string array | cell array of character vectors

`AttachedFiles` — Files and folders sent to workers
character vector | string | string array | cell array of character vectors

`AutoAttachFiles` — Specify whether to automatically attach files
true (default) | false

`ClusterMatlabRoot` — Path to MATLAB for workers
character vector

`Jobs` — Jobs contained within cluster
Read-only: `parallel.job` object | array of `parallel.Job` objects

`LicenseNumber` — License number to use
integer

`Modified` — Logical true if the cluster has been modified
Read-only: true | false

`NumThreads` — Number of computational threads for workers
nonnegative integer

`OperatingSystem` — Operating system of cluster worker machines
`"windows"` | `"unix"` | `"mixed"`

`Profile` — Name of profile used to create cluster object
character vector

`RequiresOnlineLicensing` — Specify whether cluster requires online licensing
false (default) | true

`SparkInstallFolder` — Path to Spark installation on client machine
character vector | string array

`SparkProperties` — Map of Spark name-value property pairs
character vector

`Type` — Cluster type
Read-only: `'Spark'`

`UserData` — Data associated with cluster object in current session
any type