Speedup of mldivide by parallelisation

8 views (last 30 days)

Show older comments

Joerg Pfannmoeller on 14 Mar 2022

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/1671614-speedup-of-mldivide-by-parallelisation

Commented: Walter Roberson on 20 Mar 2022

Accepted Answer: Matt J

Open in MATLAB Online

Dear All,

in my applications we repeatedly use mldivide to solve linear systems. It would be very important to speed the solvers up. Since they run only with 2 cpus (linux: top shows less than 200% usage) I guess parallelisation would be the right technique to achieve an improvement. I found a solution online which accelerated by more than a factor of 2, which would be a tremendous improvement. Unfortunately it does not work in my case. What do I need to change? Here is the example:

%%
clear all; clc; close all;
%%
delete(gcp('nocreate'));
parpool();  
A = distributed.rand(442195,442195);
b = sum(A, 2);
tic
x_distributed = A\b;
x = gather(x_distributed);
toc

The only difference to the original example was the size of the variable A:

A = distributed.rand(20000,2000);

What do I need to change to get this effective again? Another solver?

Thanks Joerg

1 Comment
Show -1 older commentsHide -1 older comments

Catalytic on 15 Mar 2022

A matrix size of the you've shown, it would consume 182GB. Do you actually have a system that will hold such a large matrix?

Accepted Answer

Matt J on 14 Mar 2022

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/1671614-speedup-of-mldivide-by-parallelisation#answer_917549

Edited: Matt J on 14 Mar 2022

mldivide is already highly well-optimized internally, without the need for an explicit parpool. It is not clear how distributing the columns of A across parpool workers would ever be superior. Perhaps if you were to distribute the columns of b (assuming it had multiple columns) instead of A, but even then, I would be surprised if this improved upon mldivide's default parallelization. My recommendation would be to look at making A and b into gpuArrays, if you want to get some speed-up, assuming of course that you have a good graphics card.

24 Comments
Show 22 older commentsHide 22 older comments

Joss Knight on 20 Mar 2022

The answer depends on how interested you are in Numerical Methods and Analysis.

GPU parallelism is data parallelism. Simplistically, I want to put all cores to work factorizing the matrix to get the best speedup. CPU parallelism is more flexible. Most CPUs have a vector processing unit on which some vector operations (e.g. 128-bit or 256-bit) can be performed, but the physical cores are mostly independent and can happily be doing different things on small amounts of data.

I've never written this algorithm myself, and maybe an academic can step in and expand for me. But I know on paper how the algorithm works, and it seems likely that as I'm computing the factors I can be immediately using the results to produce partial solutions to the equation, which I can update continuously as I compute the factors. On the CPU this could be done efficiently by giving different cores different chunks of the problem, whereas on a GPU this would no doubt be inefficient with memory and resources. The GPU is much better utilized when the fewest kernels are running on the most data, and very badly utilized if you need those kernels to communicate as we would in this case.

By "sequentially" I am, as you surmise, referring to the dependency between the different steps of the algorithm. This is kind of orthogonal to the above discussion. I'm just trying to say, in simple terms, that where results have to be computed in sequence, a CPU can still be well utilized while a GPU will generally struggle because the raw FLOPs of a single core are not good.

I hope that helps...but perhaps it doesn't really matter. If you really want to know how these algorithms are computed I could recommend some books if you like.

Walter Roberson on 20 Mar 2022

The number of simultaneous kernels you can run has varied a fair bit over time, and can depend on the exact model. Some models offer more control modules that control fewer cores, with each control module potentially being independent. And whether you are using single precision or double precision can make important differences in conjunction with the exact model.

So, potentially you you picked out just the right NVIDIA model, you might be able to do well on sparse computations, but on many (most) NVIDIA models, it could be pretty inefficient.

Products

Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!