Deploy Tall Arrays to a CLOUDERA Spark Enabled Hadoop Cluster
This example shows how to deploy a MATLAB® application containing tall arrays to a CLOUDERA® Spark™ enabled Hadoop® cluster.
Deploying MATLAB applications against a CLOUDERA distribution of Spark requires a special wrapper type that you generate using the
mcc command. This wrapper type generates a
jar file as well as a shell script which calls
spark-submit. The spark-submit script in
the Spark
bin directory is used to start applications on a cluster. It
supports both yarn-client mode and
yarn-cluster mode.
The inputs to the application are:
master— URL to the Spark clusterinputFile— the file containing the input dataoutputFile— the file containing the results of the computation
Note
The complete code for this example is in the file meanArrivalDemo.m,
as shown below.
Prerequisites
Install the MATLAB Runtime in the default location on the desktop. This example uses
/usr/local/MATLAB/MATLAB_Runtime/R2025aas the default location for the MATLAB Runtime. If you don’t have MATLAB Runtime, see Download and Install MATLAB Runtime for installation instructions.Install the MATLAB Runtime on every worker node.
Copy the
airlinesmall.csvfrom foldertoolbox/matlab/demosof your MATLAB install area into Hadoop Distributed File System (HDFS™) folder/datasets/airlinemod.
Deploy Tall Arrays
At the MATLAB command prompt, use the
mcccommand to generate ajarfile and shell script for the MATLAB applicationmeanArrivalDemo.m.>> mcc -vCW 'Spark:meanArrivalDemoApp' meanArrivalDemo.mThis action creates a
jarfile namedmeanArrivalDempApp.jarand a shell script namedrun_meanArrivalDemoApp.sh.Note
To use the shell script, set up the environment variables
HADOOP_PREIX,HADOOP_CONF_DIRandSPARK_HOME.Execute the shell script in either
yarn-clientmode oryarn-clustermode. Inyarn-clientmode, the driver runs on the desktop. Inyarn-clustermode, the driver runs in the Application Master process in the cluster.The general syntax to execute the shell script is:
./run_meanArrivalDemoApp.sh <runtime install root> [Spark arguments] [Application arguments]
yarn-clientmodeRun the following command from a Linux® terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/R2025a \ yarn-client \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResultTo examine the result, enter the following from the MATLAB command prompt:
>> ds = datastore('hdfs:///user/someuser/meanArrivalResult/*'); >> readall(ds)yarn-clustermodeRun the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/R2025a \ --deploy-mode cluster --master yarn yarn-cluster \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
In
yarn-clustermode, since the driver is running on a worker node in the cluster, any standard output from the MATLAB function is not displayed on your desktop. In addition, files can be saved anywhere. To prevent such behavior, this example uses thewritefunction to explicitly save the results to a particular location in HDFS.
