SPMD vs PARFOR and Memory Usage

5 views (last 30 days)
Cem
Cem on 30 Sep 2014
Commented: Cem on 1 Oct 2014
Dear all,
I am trying to migrate some of the code I have which utilizes parfor to spmd in order to use the codistributed array features and save considerable memory (because parfor is copying the huge matrix var2 into all the workers). So I was hoping to distribute the matrix var2 between all the workers and end up saving some memory. I am running a toy example on our linux servers using the following function:
function accum_results=try_memory_smpd()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results_temp=zeros(100,600000);
upper_bound=100;
lower_bound=1;
spmd
D = codistributed(var1,codistributor('1d', 2));
temp=getLocalPart(D);
globalInd=globalIndices(D, 2);
local_lower_bound = find(globalInd == lower_bound, 1);
if ~isempty(local_lower_bound)
fprintf('The lower bound found in Lab %d, indice %d\n',labindex,local_lower_bound);
end
if isempty(local_lower_bound) && min(globalInd)>lower_bound
local_lower_bound=1;
end
local_upper_bound = find(globalInd == upper_bound, 1);
if ~isempty(local_upper_bound)
fprintf('The upper bound found in Lab %d, indice %d\n',labindex,local_upper_bound);
end
if isempty(local_upper_bound) && max(globalInd)<upper_bound
local_upper_bound=size(temp,2);
end
if ~(isempty(local_upper_bound) || isempty(local_lower_bound))
for j = local_lower_bound:local_upper_bound
accum_results_temp = accum_results_temp+bsxfun(@times,var2,temp(:,j));
end
fprintf('Lab %d works between indice %d and %d \n',labindex,local_lower_bound,local_upper_bound);
else
fprintf('No work for Lab %d!!\n',labindex);
end
D=[];
temp=[];
var2=[];
end
accum_results=zeros(100,600000);
for cell_ind=1:length(accum_results_temp)
accum_results=accum_results+accum_results_temp{cell_ind};
end
toc
end
Note that the sizes are fairly large and you may need to change the matrix sizes. Anyway, when I profile the code, it seems the bottleneck is mainly caused by the final for loop which adds up all the cell entries in the composite object returned by the spmd block (therefore the resulting matrix is 100x600000). I also note that the PARFOR implementation of the same code finishes in about half the time. The additional functionality added in the spmd code (if-checks etc) has no visible impact on performance. Using methods such as cell2mat etc. will defeat the purpose of the spmd implementation since it will create a copy of the data already stored on the workers. I'd be very grateful if someone can give me an idea/inspiration such that I can get away with using parallel code without excessive memory usage. Thanks in advance.
Cem
P.S. Here's the PARFOR implementation:
function accum_results=try_memory()
tic
var1=repmat(linspace(1,100,100),100,1);
var2=2*linspace(0.001,0.02,600000);
accum_results=zeros(100,600000);
parfor i=1:100
accum_results=accum_results+bsxfun(@times,var2,var1(:,i));
end
toc
end

Accepted Answer

Edric Ellis
Edric Ellis on 30 Sep 2014
Here's a version that I've reworked quite a bit to use codistributed arrays hopefully a little more effectively (it runs about twice as quickly on my machine here).
tic
N = 100;
M = 600000;
spmd
% Build 'var1' on the workers directly to avoid communication
var1=linspace(1,N,N);
% Build 'var2' directly in codistributed form to save memory
var2=2 * codistributed.linspace(0.001,0.02,M);
%
% We will operate on the local part of var2
var2_lp = getLocalPart(var2);
var2_cod = getCodistributor(var2);
%
% Build accum_results as a codistributed array directly, and ensure
% it uses the same codistributor as var2 so that we can operate
% on the local parts of the arrays together.
accum_results = codistributed.zeros(N, M, var2_cod);
%
% Get the local part out of accum_results so that we can operate on it directly.
ar_local = getLocalPart(accum_results);
ar_codist = getCodistributor(accum_results);
%
% Loop 1:N applying the BSXFUN to the relevant local parts
for idx = 1:N
ar_local = ar_local + bsxfun(@times, var2_lp, repmat(var1(idx), N, 1));
end
% Put accum_results back together
accum_results = codistributed.build(ar_local, ar_codist);
end
% Gather the results back to the host
accum_results=gather(accum_results);
toc
  3 Comments
Edric Ellis
Edric Ellis on 1 Oct 2014
Actually, the different labs are working on different slices of var2 and putting their results into their own slice of accum_results (via the local part ar_local). The codistributed.build takes the local part slices and reconstitutes a whole array.
Cem
Cem on 1 Oct 2014
Yes it makes so much sense, by doing so we don't need to create the huge accum_array on each of the workers with full size! Thanks once more.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!