Why do I get MPI_Abort errors when trying to submit a parallel job?

5 views (last 30 days)
The core of my job submission code is below:
jopt.email_notif = 0;
jopt.toggleleft = left_list(j);
jopt.toggleCausalDir = dir_list(k);
jopt.toggleChoice = choice(l);
jopt.od_number = od_list(i);
jopt.connectivity = 1;
sched = findResource('scheduler', 'configuration', 'NeuroEcon.local')
set(sched,'SubmitArguments', '-l walltime=0:20:00')
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'multiDCMset1.m'})
set(pjob, 'MaximumNumberOfWorkers', 1)
set(pjob, 'MinimumNumberOfWorkers', 1)
t = createTask(pjob, @multiDCMset1, 1, {jopt})
t_all{1,jj}=t; jj=jj+1;
submit(pjob);
---------------------------------------
The following is the error message I get in the job submission log, after the job finishes running. I don't understand the error or what could cause it. I do know that the same script runs fine on another person's computer. Do I need some specific settings to submit parallel jobs?
------------------
Node file: /opt/torque/aux//2075983.neuroecon.caltech.edu
Starting SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -s -phrase MATLAB -port 25983
All SMPDs launched
"/opt/matlab//bin/mw_mpiexec" -phrase MATLAB -port 25983 -l -n 1
-machinefile /opt/torque/aux//2075983.neuroecon.caltech.edu -genvlist
MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE _CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG
"/opt/matlab/bin/worker" -parallel
[0]which: no shopt in
(/opt/matlab/bin:/usr/kerberos/bin:/usr/java/latest/bin:/opt /intel/itac/7.1/bin:/opt/intel/fce/10.1.018/bin:/opt/intel /idbe/10.1.018/bin:/opt/intel/cce/10.1.018/bin:/usr/local /bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt /openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin: /opt/rocks/bin:/opt/rocks/sbin)
[0] < M A T L A B (R) >
[0] Copyright 1984-2009 The MathWorks, Inc.
[0] Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
[0] February 12, 2009
[0]
[0] To get started, type one of these: helpwin, helpdesk, or demo.
[0] For product information, visit www.mathworks.com.
[0]
job aborted:
rank: node: exit code[: error message]
0: compute-1-30: -2: application called MPI_Abort(MPI_COMM_WORLD, 42) -
process 0
Stopping SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -shutdown -phrase MATLAB -port
25983
Exiting with code: 42
  1 Comment
Edric Ellis
Edric Ellis on 23 May 2014
Is there any error in the task of the job? Check using:
pjob.Tasks(1).Error
or even
getReport(pjob.Tasks(1).Error)

Sign in to comment.

Answers (0)

Categories

Find more on Cluster Configuration in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!