Vectorizing nonlinear matrix operation on many small matrices

I am trying to optimize the following generic matrix operation:
m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
a = A(:,ii);
r = a*a';
B = B + kron(r,r);
end
% return B
On my computer the above takes ~7s. By compiling to a MEX file with MATLAB Coder I can improve this by ~15x. I have tried compiling to CUDA with GPU Coder, but this seems to be quite inefficient.
I think the difficulty comes from two different sources:
1) I am not sure of an efficient way to vectorize the creation of the "r" matrices from the columns of the A matrix, and so have to resort to the outer for loop approach
2) I think the Kronecker product is inefficient to implement on the gpu due to the small matrix size
The speedup from compiling to MEX is nice, but I just have this feeling that I am still doing something quite inefficiently. I would appreciate if anyone has any ideas on how to optimize the above calculation, either along the lines of the two difficulties I outlined above, or via a different approach.

2 Comments

Hi Adam,
if you replace
B = B + kron(r,r);
with
r = r(:);
BB = BB + r*r';
the loop runs about 5 times faster. (The actual substitution runs faster than that, but the nonchanged steps in the loop still of course have to be included).
@Adam,
It may be important to know what you plan to do with B, once you've computed it.

Sign in to comment.

 Accepted Answer

m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
tic;
B = zeros(m^2,m^2);
for ii = 1:size(A,2)
a = A(:,ii);
r = a*a';
B = B + kron(r,r);
end
toc;
Elapsed time is 6.800329 seconds.
tic;
C=reshape(A,m,1,n).*reshape(A,1,m,n);
C=reshape(C,m^2,n);
B=C*C.';
toc;
Elapsed time is 0.081757 seconds.

7 Comments

HI Matt,
what a remarkable solution. The most elegant I have seen on this site in 2020.
Very kind of you, David. I still wonder if the OP really needs B. Outer-products are generally inefficient things in most scenarios.
Matt, this really is such an elegant solution. I see an ~85x improvement, same as you, and a ~1000x improvement when putting it on the GPU. So thanks very much. As for the ultimate goal, I need the eigenvalues of this B matrix - if theres a quick fix to find them without actually constructing the B matrix I'm all ears!
I see an ~85x improvement, same as you, and a ~1000x improvement when putting it on the GPU.
Unfortunately, I doubt you're really seeing a 12x improvement when going from CPU to the GPU. Most likely, you have applied tic...toc without synchronizing the GPU, which will give misleading timing results. On the GTX 1080 Ti, I see only about a 3.5x improvement.
m = 3; % small number in general
n = 2^20; % large power of 2 in general
A = rand(m,n);
tcpu=timeit(@() runit(A,m,n));
A=gpuArray(A);
tgpu=gputimeit(@() runit(A,m,n));
speedup = tcpu/tgpu
function runit(A,m,n)
C=reshape(A,m,1,n).*reshape(A,1,m,n);
C=reshape(C,m^2,n);
B=C*C.';
end
I normally do
tic; CODE; wait(gpu); toc;
when evaluating gpu speed, which I believe is equivalent to doing gputimeit? I still am seeing about the same speedups using your gputimeit code above (using a 1660ti). I guess more specifically tcpu is ~40 ms, and tgpu is ~5 ms. Which is ~>1000x faster than the ~6 s the original method took.
Ah well, you have a really good GPU...!

Sign in to comment.

More Answers (0)

Products

Release

R2019b

Asked:

on 18 Dec 2020

Commented:

on 19 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!