Why am I unable to validate my TORQUE/PBS Pro configuration in the Parallel Computing Toolbox?

3 views (last 30 days)
I have MATLAB Distributed Computing Server (MDCS) set up on a Linux cluster running TORQUE/PBSPro. When I attempt to validate the cluster configuration it fails.

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 6 Mar 2019
Edited: MathWorks Support Team on 6 Mar 2019
There are several issues that can prevent the validation of the cluster. Run the following tests below to make sure that your configuration is setup properly. If at any point you receive an error message, you can submit a request to Installation support using the link at the bottom of the page. When submitting a request, be sure to include the following:
- Your license number
- The release of MATLAB on the client and the cluster
- The output of your validation (click details to get the full information)
- The results of the tests below
1) Test the licensing of MDCS
The first step is to ensure that the licensing for MATLAB Distributed Computing Server works on your cluster. This will also test to see if MATLAB is crashing on startup on your cluster. To test this, go to one of the cluster nodes and open up a Terminal window, then run the following commands:
cd $MATLAB/bin (where $MATLAB is the installation folder for MATLAB on the cluster)
./matlab -dmlworker -nodisplay -logfile /var/tmp/output.txt -r "ver;exit"
This will generate an output.txt file in /var/tmp that contains the ver output on the cluster. If the log file contains a network license manager error, this is the issue. In that case, check MATLAB Answers for the license manager error number and take the appropriate action to resolve the license error before proceeding.
2) Check the releases of MATLAB on the cluster and the client where you validated
If you get the output of the "ver" command in the log file, check the releases (R20XXx) of all the products in the list. The release of each product should match for all the products. Additionally, the release should match the release that is installed on the client where you ran the validation. To check the release on the client, run the ver command in MATLAB's command window. If the release of Parallel Computing Toolbox and MATLAB do not match the release of MATLAB and Distributed Computing Server on the cluster, you will not be able to use this configuration until the installations are at the same release.
3) Check to make sure that your configuration meets the scheduler requirements
In order to use MATLAB Distributed Computing with TORQUE/PBS Pro, there are some additional requirements in the setup. Check the scheduler requirements page here for more details:
Additionally, this configuration requires the following:
- The scheduler binaries need to be accessible from the MATLAB client that runs Parallel Computing Toolbox. If the client does not have the binaries, it is recommended to remotely access one of the cluster nodes to run the MATLAB client.
- Your cluster should be completely homogeneous. Mixing different platforms or distributions is not recommended especially for parallel computation.
- This configuration requires that the data for the jobs be stored on a shared file space between the clients and the cluster nodes. When creating the configuration, set the "JobStorageLocation" property to be a path that is accessible to all computers.
- The client machine should be the same type of operating system as the cluster (generally Linux). PBS Pro does support Windows clustering. In that case it is appropriate to use a Windows client for PCT.
- Since the "JobStorageLocation" variable needs to be accessible by the same path from all computers, you cannot use a client machine of a different platform (Ex: Running a Windows client to access a Linux cluster).
- If you intend to use parallel jobs with a TORQUE cluster, passwordless SSH access between compute nodes must be setup in order to submit the job.
If the requirements above are not met, the default TORQUE/PBS Pro configurations are not supported. However, it might still possible to submit jobs to the cluster. For this setup, see the related solution: http://www.mathworks.com/matlabcentral/answers/91693
4) Check to ensure you have correctly configured the client configuration
In your client MATLAB, go to the Parallel menu to Manage Cluster Profiles. Click on your TORQUE/PBS Pro configuration and then click Edit. You must set the appropriate values for ClusterMatlabRoot (the directory where is MATLAB installed on the cluster), JobStorageLocation (where the data will be stored, NOTE: This must be accessible from the same path from all computers),and HasSharedFilesystem (should be set to True).
If you have confirmed all of the settings above, do all stages fail during validation, or just parallel and matlabpool/parpool? If you are able to pass the Distributed Job phase, the validation may be reporting false errors. To confirm you can manually validate your cluster follow the instructions in the following article:
If the manual tests passed, your configuration is working and you should be able to submit jobs.
If you are still having an issue, contact Installation support here:

More Answers (0)

Categories

Find more on Introduction to Installation and Licensing in Help Center and File Exchange

Tags

No tags entered yet.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!