In R2016b, MATLAB Compiler supports running MATLAB applications as standalone executables against a Spark enabled cluster. The ability to deploy MATLAB applications against a Cloudera Spark distribution requires an alternate workflow that is undocumented in the release documentation.
To deploy MATLAB applications against a Cloudera distribution of Spark requires a new wrapper type that can be generated using the mcc command. Using this new wrapper type generates a jar file as well as a shell script which calls spark_submit. The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It supports both yarn-client mode and yarn-cluster mode.
MATLAB applications that use tall arrays or the MATLAB API for Spark can be deployed using this workflow.
Example 1:
Deploy Tall Arrays to a Cloudera Spark Enabled Hadoop Cluster
This example shows you how to deploy a MATLAB application that uses tall arrays to a Cloudera Spark enabled Hadoop cluster. The application meanArrivalDemo.m computes the mean arrival delay from airline data. The inputs to the application are:
master—URL to the Spark cluster.
inputFile—the file containing the input data.
outputFile—the file containing the results of the computation.
Prerequisites:
- Install the MATLAB Runtime in the default location on the desktop. This example uses as the default location for the MATLAB Runtime.
- Install the MATLAB Runtime on every worker node.
- Copy the airlinesmall.csv from folder of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.
If you don't have the MATLAB Runtime, you can download it from the website at: https://www.mathworks.com/products/compiler/matlab-runtime.html.
Procedure:
1. At the MATLAB command prompt, use the mcc command to generate a jar file and shell script for the MATLAB application meanArrivalDemo.m
>> mcc -vCW 'Spark:meanArrivalDemoApp' meanArrivalDemo.m
or, if using Spark version 2:
>> mcc -vCW 'Spark:meanArrivalDemoApp, 2' meanArrivalDemo.m
This creates a jar file named meanArrivalDempApp.jar and a shell script named run_meanArrivalDemoApp.sh.
Note: In order to use the shell script, you need the environment variables HADOOP_PREIX, HADOOP_CONF_DIR and SPARK_HOME to be set up correctly.
2. You can execute the shell script in yarn-client mode or yarn-cluster mode. In yarn-client mode, the driver runs on the desktop. In yarn-cluster mode, the driver runs in the Application Master process in the cluster.
The general syntax to execute the shell script is:
./run_meanArrivalDemoApp.sh <runtime install root> [Spark arguments] [Application arguments]
a. yarn-client mode
Run the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ yarn-client \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
To examine the result, enter the following from the MATLAB command prompt:
>> ds = … datastore('hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult/*');>> readall(ds)
b. yarn-cluster mode
Run the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ --deploy-mode cluster --master yarn yarn-cluster \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
In yarn-cluster mode, since the driver is running on some worker node in the cluster, any standard output from the MATLAB function will not be displayed on your desktop. In addition, files can end up being saved anywhere. In order to prevent such behavior, this example uses the write function to explicitly save the results to a particular location in HDFS.
Example 2:
Deploy Applications Using the MATLAB API for Spark
This example shows you how to deploy a MATLAB application developed using the MATLAB API for Spark against a Cloudera Spark enabled Hadoop cluster. The application flightsByCarrierDemo.m computes the number of airline carrier types from airline data. The inputs to the application are:
master—URL to the Spark cluster.
inputFile—the file containing the input data.
Prerequisites:
- Install the MATLAB Runtime in the default location on the desktop. This example uses as the default location for the MATLAB Runtime.
- Install the MATLAB Runtime on every worker node.
- Copy the airlinesmall.csv from folder of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.
If you don't have the MATLAB Runtime, you can download it from the website at: https://www.mathworks.com/products/compiler/matlab-runtime.htmlProcedure:1. At the MATLAB command prompt, use the mcc command to generate a jar file and shell script for the MATLAB application flightsByCarrierDemo.m
>> mcc -C -W 'Spark:flightsByCarrierDemoApp' flightsByCarrierDemo.m
This creates a jar file named flightsByCarrierDemoApp.jar and a shell script named run_flightsByCarrierDemoApp.sh.
2. You can execute the shell script in yarn-client mode or yarn-cluster mode. In yarn-client mode, the driver runs on the desktop. In yarn-cluster mode, the driver runs in the Application Master process in the cluster. The results of the computation in both cases are saved to a text file on HDFS by calling the saveAsTextFile method on the RDD.
a. yarn-client mode
Run the following command from a Linux terminal:
$ ./run_flightsByCarrierDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ yarn-client \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv
To examine the results, enter the following from a Linux terminal.
$ hadoop fs -cat flightsByCarrierResults/*
b. yarn-cluster mode
Run the following command from a Linux terminal:
$ ./run_flightsByCarrierDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ --deploy-mode cluster --master yarn yarn-cluster \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv