Efficiency of for vs parfor loop for iterating a moderately large matrix operation

Question

Brian Moore on 12 Jun 2014

1
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/133872-efficiency-of-for-vs-parfor-loop-for-iterating-a-moderately-large-matrix-operation

Edited: Matt J on 13 Jun 2014

Hi.

Running a program at the moment that takes several hours to complete. Roughly half of that time is spent on one line of code (repeated hundreds of thousands of times), which does multiplication, comparison, and addition of ~300x300 matrix (not sparse), and ~1x300 vectors.

I wanted to look into parallel-processing to speed up the program a bit. Technically the program is supposed to run linearly, where each iteration relies on output from the last, but I can do the for loop iterations in batches with the same initial values. However that's more of a follow-up question.

The question I have pertains to the results I'm getting with using a for loop, vs a parfor loop for essentially just executing this one line of code several thousand times.

The test code I have below runs two for-loops, and records the time to execute each forloop. The second for loop runs the mathematical process in one line. The first for loop runs the same algorithm, but separates it into four smaller operations, and then combining them together to complete the algorithm. Each forloop runs 10,000 times. This is repeated 20 times, and the mean time required for both for-loops is recorded. Then the program is run with "parfor" replacing the for loops, with the core limit set to 1,2,3,4, and 6.

The code, and results are below:

%%%%parfor test
%%%%figure out how the hell to use parfor
L_rate = .01;
cd = 1;
Vl0 = rand(1,256);
Hl0 = rand(1,301);
Vlcd = rand(1,257);
Hlcd = rand(1,301);
dwnet = zeros(257,301);
%%%calculation to perform:
%%%dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
for k = 1:20
tic
parfor (n=1:10000,1)
    dwnet = zeros(257,301);
    v1 = L_rate/sqrt(cd);
    v2 = [Vl0,1]';
    v3 = rand(size(Hl0))<Hl0;
    v4 = Vlcd'*Hlcd;
      dwnet = v1*(v2*v3-v4);
  end
  time1(k,1) = toc;
  %toc;
tic
parfor (n=1:10000,1)
    dwnet = zeros(257,301);
    dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
end
time2(k,1) = toc;
%toc;
end
plot([time1,time2])
mean([time1,time2])
%%%%Time results

forloop_1 foorloop_2

%%%for          14.89sec       13.89sec
%%%parfor,1      9.95sec       10.98sec
%%%parfor,2      9.74sec       10.60sec
%%%parfor,3      9.91sec       10.90sec
%%%parfor,4      9.92sec       10.95sec
%%%parfor,6      9.93sec       10.93sec
%%%Run on Intel Core i7-Q720 (first-gen i-series; 4 cores, 8 logic threads)

So, as I expected, using a regular for-loop it is in fact faster to do everything with one line. This makes sense, as I'm adding time by writing v1 v2 v3 and v4 on top of ultimately doing the same calculation with dwnet.

But what fails to make sense to me is that when I use parfor, suddenly that becomes MORE efficient, by roughly 10%. Also, increasing the core-limit while using parfor actually seems to slightly slow the algorithm down - though that could just be a result of my poor understanding of parfor.

Question: Why is the first for-loop more efficient than the 2nd when using parfor? Feel free to expound upon parfor as much as you care to.

Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch). Any advice on how to do that with parfor? I need to get the cumulative sum of dwnets from each iteration to update Vl0 and Hl0, so I'd have to add up the dwnets for each worker, and then add them all up between the workers. But I'm fuzzy on how that's handled, and how parfor handles variables.

Any help answering this question is greatly appreciated.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Matt J on 12 Jun 2014

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/133872-efficiency-of-for-vs-parfor-loop-for-iterating-a-moderately-large-matrix-operation#answer_140545

Edited: Matt J on 12 Jun 2014

The statements

dwnet = zeros(257,301);

are unnecessary overhead and are clouding the issue. I re-implemented your code as below, removing all 3 occurrences of these statements and also handling v1 more efficiently. On my machine, the results of a plain for-loop are not significantly different. I expected no difference.

When running with parfor, I find that loop 1 is indeed about 25% faster than loop 2. I assume it's because parfor pre-parses the contents of the loop, analyzing which variables are temporary and which have other roles, see Classification of Variables. Because of this, I imagine parfor can pre-allocate temporary variables or optimize their use in some other obscure way.

     v1 = L_rate/sqrt(cd);
     Hlcd=v1*Hlcd;
    for k = 1:5
    tic
    parfor (n=1:10000,2)
    %for n=1:1e4
        v2 = v1*[Vl0,1].';
        v3 = rand(size(Hl0))<Hl0;
        v4 = Vlcd'*Hlcd;
          dwnet = v2*v3-v4;
      end
      time1(k,1) = toc;
      %toc;
    tic
     parfor (n=1:10000,2)
     %for n=1:1e4 
        dwnet = (v1*[Vl0,1]')*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd;
    end
    time2(k,1) = toc;
    %toc;
    end

Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch).

You would probably have to use a double for-loop

 for i=1:N
  parfor j=1:M
  end
 end

where the inner parfor loop loops over the batches over which Vl0 and Hl0 remain constant. It might not be worthwhile, though. Batches that small could probably be vectorized, circumventing a for-loop altogether. At least, your simplified example certain can be.

You also have to factor in communication to/from the labs. A normal parfor loop would have to broadcast its data to/from the client every time it begins and ends, so in your case at the beginning and end of every i-th iteration of the outer for-loop. It may help, however, to use Edric Ellis' Worker Object Wrapper, which can make data persist between parfor-loops.

2 Comments
Show NoneHide None

Brian Moore on 12 Jun 2014

Hi. Thanks for responding so quickly.

I left the dwnet=zeros() terms in there to make sure the program didn't catch on that the elements in dwnet didn't need changing. Though I just tested the run-through with multiplying v1 by "n", so that dwnet is different each iteration, with essentially the same result. So you're right, that's unnecessary.

The link you provided is a great break-down. I agree that's probably where the efficiency is gained - letting it do some of the calculations assigned to temporary variables.

A side note: you actually can't implement v1 more efficiently. For this equation, it's dwnet = v1*(v2*v3-v4), and you replaced it with dwnet = (v1*v2*v3-v4). v4 needs to be scaled by v1. But that's not much of an issue.

Followup: Yeah, the batches would definitely be a nested forloop with 10-100 iterations each. It isn't obvious, because I only included the test code for optimizing dwnet, but the algorithm for generating Vl0 and Hl0, which runs within the same forloop, already involves two matrices being multiplied - so it's already as vectorized as it can get.

That link you provided mentions a Reduction type variable within parfor, which is probably what I'm looking for. It would accumulate the sum of the dwnets generated each iteration, in no particular order. Which should probably work fine. I was worried I have to make something like: dwnet1, dwnet2, dwnet3... for each worker somehow, and then dwnet=dwnet+dwnet1+dwnet2... after each batch. Thankfully it seems like matlab handles that on its own.

Last thing - any reason changing the number of max cores (parfor,2 vs parfor,4) didn't change the processing time significantly?

I'll run a few tests, but I think I've got a good idea on this now. Thanks for your input.

Matt J on 13 Jun 2014

Edited: Matt J on 13 Jun 2014

A side note: you actually can't implement v1 more efficiently... v4 needs to be scaled by v1.

Notice that I pre-scaled H1cd prior to the loop

Hlcd=v1*Hlcd;

This translates into the appropriate scaling on v4. Even if H1cd were changing inside the loop, it makes sense to pre-scale it there. To put it another way, you have a difference of two vector outer products. a*b'- c*d'. Scaling as

v1*(a*b'- c*d')

will require O(N^2) multiplies with v1 whereas scaling as

(v1*a)*b.' - (v1*c)*d'

will require only O(N).

That link you provided mentions a Reduction type variable within parfor, which is probably what I'm looking for. It would accumulate the sum of the dwnets generated each iteration

I'm very doubtful about whether this is the best approach, though my advice is of course clouded by not seeing your full true computation. In general, though, it is rather inefficient to be adding up outer products, or differences of outer products. Imagine for example that you had matrices V2 and V3 with respective dimensions 257x10000 and 301x10000 whose columns are the individual vectors v2 and v3 that you are generating within the parfor loop. Then the summation of all the v2*v3.' terms is equivalent to the single matrix multiplication V2*V3.', which MATLAB can do very fast, is already internally multi-threaded for you, and doesn't involve allocating memory for a 257x301 matrix 10000 times.

So, a better alternative for you might be to generate the columns V2(:,n) and V3(:,n) with parfor (see "Sliced Variables" in the link I gave you) and then post-multiply after the parfor loop. You would do the same thing with the summation of the outer products v4 and subtract the two results.

Last thing - any reason changing the number of max cores (parfor,2 vs parfor,4) didn't change the processing time significantly?

I haven't had a chance to test on a machine with more than 2 cores. I wonder if that would still be true with the operations

dwnet = zeros(257,301)

removed. Basically, to get the best advantage out of parallel cores, you want to be doing a lot more computation than memory transfer. Concatenation operations like [Vl0,1]' might be expensive. They force MATLAB to allocate memory of apriori unknown size and probably involve some RAM traffic. Since Vl0 is constant throughout the loop, you should probably do that concatenation prior to parfor. I speculate that function calls like zeros() and rand() might also slow things down. I wonder if it wouldn't be better to generate a 257x10000 random array prior to the parfor-loop and then broadcast it as a sliced variable.

I'll run a few tests, but I think I've got a good idea on this now. Thanks for your input.

OK. Make sure you're aware of what this button is meant for

and use if appropriate.

Sign in to comment.

Efficiency of for vs parfor loop for iterating a moderately large matrix operation

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Efficiency of for vs parfor loop for iterating a moderately large matrix operation

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None