Configure a Spark Cluster
Submit parallel MATLAB® code that contains
tall (MATLAB) arrays and
functions to a Spark™ cluster from suitably configured MATLAB clients.
To configure the client to run MATLAB code on the cluster, you must already be able to submit to the cluster from the intended client machine. The client machine must have a Spark installation that can access the cluster outside of MATLAB.
Many Spark distributions do not support direct access of Linux® based clusters from Windows® clients. Users of Windows clients typically need to set up a Linux gateway node that can be accessed from the Windows client via SSH or VNC. The cluster can then be accessed from this gateway node.
Integrate MATLAB Parallel Server™ with your cluster infrastructure. For instructions, see Install and Configure MATLAB Parallel Server for Third-Party Schedulers.
MATLAB Parallel Server supports Spark clusters running in different environments such as Spark Standalone or Databricks™.
If your cluster requires Kerberos authentication, ensure your MATLAB Parallel Server installation have been configured correctly. For instructions, see Kerberos Authentication.
Ensure your client can access the Spark cluster outside MATLAB.
Ensure your client MATLAB installation has been configured for Kerberos authentication if your cluster requires it. For instructions, see Kerberos Authentication.
To access the cluster from within MATLAB, set up a
parallel.cluster.Spark (Parallel Computing Toolbox) object using the following
cluster = parallel.cluster.Spark('/path/to/spark/install');
mapreducer (MATLAB) to specify
mapreduce to run
on the Spark cluster object.
For examples of how to run parallel MATLAB code on your Spark cluster, see Use Tall Arrays on a Spark Cluster (Parallel Computing Toolbox).
If the cluster uses Kerberos authentication that requires the Oracle® Java® Cryptography Extension, you must configure all installations of MATLAB and MATLAB Parallel Server. If you are using Spark on Hadoop®, for example Cloudera® distributions, it is likely that you need to complete these configuration steps.
The configuration instructions are the same for client and worker MATLAB installations.
Starting in R2018b, configure your MATLAB installation by enabling the appropriate security policy in the Java installation.
In the MATLAB Editor, open the file
Change the lineto
Spark Version Support
Spark 2.2 or later supports MATLAB
mapreduce, tall arrays and parallel usage of datastores. For
the client, you can use tall arrays on Spark clusters supporting all architectures, while supporting Linux and Mac architectures for the cluster. This includes cross-platform
parallel.cluster.Spark (Parallel Computing Toolbox)
- Install and Configure MATLAB Parallel Server for Third-Party Schedulers
- Use Tall Arrays on a Spark Cluster (Parallel Computing Toolbox)