Vectorizing nonlinear matrix operation on many small matrices

Question

1 vote

I am trying to optimize the following generic matrix operation:

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
    a = A(:,ii);
    r = a*a';
    B = B + kron(r,r);
end
% return B

On my computer the above takes ~7s. By compiling to a MEX file with MATLAB Coder I can improve this by ~15x. I have tried compiling to CUDA with GPU Coder, but this seems to be quite inefficient.

I think the difficulty comes from two different sources:

1) I am not sure of an efficient way to vectorize the creation of the "r" matrices from the columns of the A matrix, and so have to resort to the outer for loop approach

2) I think the Kronecker product is inefficient to implement on the gpu due to the small matrix size

The speedup from compiling to MEX is nice, but I just have this feeling that I am still doing something quite inefficiently. I would appreciate if anyone has any ideas on how to optimize the above calculation, either along the lines of the two difficulties I outlined above, or via a different approach.

2 Comments
Show None Hide None

David Goodmanson on 19 Dec 2020

Open in MATLAB Online

Hi Adam,

if you replace

B = B + kron(r,r);

with

r = r(:);
BB = BB + r*r';

the loop runs about 5 times faster. (The actual substitution runs faster than that, but the nonchanged steps in the loop still of course have to be included).

Matt J on 19 Dec 2020

@Adam,

It may be important to know what you plan to do with B, once you've computed it.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 19 Dec 2020

Edited: Matt J on 19 Dec 2020

Open in MATLAB Online

2 votes

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
tic;
    B = zeros(m^2,m^2);
    for ii = 1:size(A,2)
        a = A(:,ii);
        r = a*a';
        B = B + kron(r,r);
    end
toc;
Elapsed time is 6.800329 seconds.
tic;
    C=reshape(A,m,1,n).*reshape(A,1,m,n);
    
    C=reshape(C,m^2,n);
    
    B=C*C.';
toc;
Elapsed time is 0.081757 seconds.

7 Comments
Show 5 older comments Hide 5 older comments

Adam Shaw on 19 Dec 2020

Edited: Adam Shaw on 19 Dec 2020

Open in MATLAB Online

I normally do

tic; CODE; wait(gpu); toc;

when evaluating gpu speed, which I believe is equivalent to doing gputimeit? I still am seeing about the same speedups using your gputimeit code above (using a 1660ti). I guess more specifically tcpu is ~40 ms, and tgpu is ~5 ms. Which is ~>1000x faster than the ~6 s the original method took.

Matt J on 19 Dec 2020

Ah well, you have a really good GPU...!

Sign in to comment.

Vectorizing nonlinear matrix operation on many small matrices

2 Comments
Show None Hide None

Accepted Answer

7 Comments
Show 5 older comments Hide 5 older comments

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Vectorizing nonlinear matrix operation on many small matrices

2 Comments Show None Hide None

Accepted Answer

7 Comments Show 5 older comments Hide 5 older comments

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

7 Comments
Show 5 older comments Hide 5 older comments