Puzzling performance trends in FOR/PARFOR/FOR-DRANGE

6 views (last 30 days)
I'm seeing some puzzlingly poor speed performance when trying to accelerate a certain for-loop using Parallel Computing Toolbox functions. The comparison below is done in a 12-worker matlabpool with 24GB RAM available.
The first for-loop is the one that I am trying to accelerate/parallelize.
The second implementation is an attempt using parfor. It is considerably slower than a plain serial for-loop. Note that because of constraints in parfor usage, I need to make a fresh local copy of Xtypical (called "input") in each pass through the loop, to get the same effect.
Initially, I thought that this additional copy was responsible for the slow performance, but when I implement the same code using a plain serial for-loop (the 3rd implementation) I see only a marginal slow-down, nowhere near as slow as the parfor implementation.
The 4th implementation uses a for-drange loop inside an spmd block. Using it, I was able to avoid making an extra copy of Xtypical. I know for-drange loops are not as well optimized as parfor, however, I was hoping the reduced memory copying and reduced data broadcasting might offset this. Regardless, I could not have been prepared for the factor of 4 inferior performance of for-drange, especially when the Task Manager showed 100% CPU usage.
Can someone help me understand these performance differences? I can't see what's slowing things down.
P=100;
fun=@(X)imrotate(X,30,'crop');
Xtypical=zeros(P);
M=P^2;
N=P^2;
%%FOR implementation
% Elapsed time is 7.022314 seconds.
A=zeros(M,N);
tic;
for j=1:N
Xtypical(j)=1;
T=fun(Xtypical); %#ok<*PFBNS>
A(:,j)=T(:);
Xtypical(j)=0;
end
toc;
%%PARFOR implementation
% Elapsed time is 30.397402 seconds.
A=zeros(M,N);
tic;
parfor j=1:N
input=Xtypical;
input(j)=1;
T=fun(input); %#ok<*PFBNS>
A(:,j)=T(:);
end
toc;
%%FOR implementation 2
% Elapsed time is 7.205893 seconds.
A=zeros(M,N);
tic;
for j=1:N
input=Xtypical;
input(j)=1;
T=fun(input); %#ok<*PFBNS>
A(:,j)=T(:);
end
toc;
%%SPMD implementation
% Elapsed time is 110.855077 seconds.
A=distributed.zeros(M,N);
tic;
spmd
for j=drange(1:N)
Xtypical(j)=1; %#ok<*SPRIX>
T=fun(Xtypical);
A(:,j)=T(:);
Xtypical(j)=0;
end
end
toc;
  2 Comments
Edric Ellis
Edric Ellis on 11 Dec 2013
How many "real" cores does your machine have - i.e. what's the default size for a local pool? You may be seeing poor performance of PARFOR if you oversubscribe the real cores.
What OS and version of MATLAB are you using?
On my 6-core GLNXA64 machine, I see reasonable speedup from PARFOR. (Even oversubsribing to 12 workers, I still get some speedup).
Unfortunately, your SPMD version is being hit by the relatively poor performance of indexing into a codistributed array. I would not recommend using codistributed for stuff like this.
Matt J
Matt J on 11 Dec 2013
Edited: Matt J on 11 Dec 2013
Hi Edric,
This is on a dual hexacore machine. Default matlabpool size is 12, which I am using for these experiments. Windows 7, R2012b.
Unfortunately, your SPMD version is being hit by the relatively poor performance of indexing into a codistributed array. I would not recommend using codistributed for stuff like this.
It seems like a terribly troublesome limitation of codistributed data, Edric. There doesn't seem to be any other way in the PCT for different loop iterations (processed by the same lab) to share data. Is there an alternative data-sharing strategy that I'm not seeing?

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!