Main Content

parallel.cluster.Spark

Spark cluster for mapreducer, mapreduce and tall arrays

Since R2022b

    Description

    A parallel.cluster.Spark object represents and provides access to a Spark™ cluster. Use the parallel.cluster.Spark object as input to the mapreduce and mapreducer functions, for specifying the Spark cluster as the parallel execution environment for tall arrays and mapreduce.

    Creation

    Use parallel.cluster.Spark to create a Spark cluster object.

    Description

    sparkCluster = parallel.cluster.Spark creates a parallel.cluster.Spark object representing the Spark cluster.

    example

    sparkCluster = parallel.cluster.Spark(Name,Value) sets the optional ClusterMatlabRoot and SparkInstallFolder properties using one or more name-value arguments on the parallel.cluster.Spark object. For example, to change the Spark install folder, use 'SparkInstallFolder','/share/spark/spark-3.3.0'.

    Properties

    expand all

    Folders to add to MATLAB search path of workers, specified as a character vector, string or string array, or cell array of character vectors.

    Files and folders sent to workers during a mapreduce call, specified as a character vector, string or string array, or cell array of character vectors

    Specify whether to automatically detect and attach files on the client.

    Data Types: logical

    Path to MATLAB for workers, specified as the comma-separated pair consisting of 'ClusterMatlabRoot' and a character vector. This points to the installation of MATLAB Parallel Server™ for the workers, whether local to each machine or on a network share.

    Data Types: string

    License number to use with online licensing.

    Specify whether the Spark cluster uses online licensing.

    Data Types: logical

    Path to the Spark installation on worker machines, specified as the comma-separated pair consisting of 'SparkInstallFolder' and a character vector. If this property is not set, the default is the value specified by the environment variable SPARK_PREFIX, or if that is not set, then SPARK_HOME.

    Data Types: string

    Map of Spark name-value property pairs to be given to the Spark cluster.

    SparkProperties allows you to override configuration properties for Spark. See the list of properties in the Spark documentation.

    When you offload computations to workers, any files that the client needs for computations must also be available on workers. By default, the client attempts to detect and attach these files. To turn off automatic detection, set the AutoAttachFiles property to false. If the software cannot find all the files, or if sending files from client to worker is slow, use one of these options.

    • If the files are in a folder that is not accessible on the workers, set the AttachedFiles property. The cluster copies each file you specify from the client to the workers.

    • If the files are in a folder that is accessible on the workers, you can set the AdditionalPaths property instead. Use the AdditionalPaths property to add paths to the MATLAB search path for each worker and avoid copying files unnecessarily from the client to the workers.

    Object Functions

    mapreduceProgramming technique for analyzing data sets that do not fit in memory
    mapreducerDefine parallel execution environment for mapreduce and tall arrays

    Examples

    collapse all

    This example shows how to create and use a parallel.cluster.Spark object to set a Spark cluster as the mapreduce parallel execution environment.

    sparkCluster = parallel.cluster.Spark('SparkInstallFolder','/host/spark-install');
    mr = mapreducer(sparkCluster)

    Tips

    Spark clusters place limits on how much memory is available. You must adjust the size of the data to gather to support your workflow.

    The amount of data gathered to the client is limited by the Spark properties:

    • spark.driver.memory

    • spark.executor.memory

    The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.

    If these properties are set too small, you see an error like the following.

    Error using tall/gather (line 50)
    Out of memory; unable to gather a partition of size 300m from Spark.
    Adjust the values of the Spark properties spark.driver.memory and 
    spark.executor.memory to fit this partition.

    The error message also specifies the property settings you need.

    Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:

    cluster = parallel.cluster.Spark;
    cluster.SparkProperties('spark.driver.memory') = '2048m';
    cluster.SparkProperties('spark.executor.memory') = '2048m';
    mapreducer(cluster);

    Version History

    Introduced in R2022b