Using gpuArrays to speed up a simulation (utilizing an NVIDIA GPU)

3 views (last 30 days)
I have a Matlab simulation which updates an array :
Array=zeros(1,1000)
as follows:
for j=1:100000
Array=Array+rand(1,1000)
end
My question is the following: This loop is linear, so it cannot be parralelized for each slot in the array, but different slots are updated independently. So, naturally Matlab performs array operations such as this in parralell using all the cores of the CPU.
I wish to get the same calculation performed on my NVIDIA GPU, in order to speed it up (utilizing the larger number of cores there).
The problem is: that naively doing
tic
Array=gpuArray(zeros(1,1000));
for j=1:100000
Array=Array+gpuArray(rand(1,1000));
end
toc
results in the calculation time being 8 times longer!
What am I doing wrong?

Accepted Answer

Jan
Jan on 17 Jan 2018
rand(1, 1000) is created on the CPU and than copied to the graphics board. This communication is slow. Better create the random values directly on the GPU: https://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html
Nevertheless, the code is not meaningful with random numbers. It might be useful to show us the real problem.
  2 Comments
Erez
Erez on 17 Jan 2018
Edited: Erez on 17 Jan 2018
Thank you. It is still not so clear to me, after reading your link, how to generate just a simple array of 1000 uniformly distributed random numbers (between [0,1]) for each iteration of the loop above? Can you please demonstrate what exactly would the adapted simple code look like? My purpose is to understand the basics at the moment.
Jan
Jan on 17 Jan 2018
@Erez: Sorry if I ask, but did you read the link? There you find:
Typically these numbers are generated using the functions rand,
randi, and randn. Parallel Computing Toolbox™ provides three
corresponding functions for generating random numbers directly
on a GPU: gpuArray.rand, gpuArray.randi, and gpuArray.randn.
Try:
tic
Array = gpuArray(zeros(1,1000));
for j = 1:100000
Array = Array + gpuArray.rand(1,1000);
end
toc
Concerning your original question: "So, naturally Matlab performs array operations such as this in parralell using all the cores of the CPU." This is not true under some circumstances. You can check this with the TaskManager: Adding a [1 x 1000] vector to another is a very cheap job. Doing this with AVX code can add multiple doubles in each instruction. Starting a thread on each core of the CPU would be far too expensive. Therefore I assume that the loop is processed on one core only:
for j = 1:100000
Array = Array + rand(1, 1000);
end
This might be different for rand(16, 10000). For e.g. the sum() command the limit is 88999: While sum(1, 88999) uses one core only, sum(1, 89000) runs on 2 cores - and is slower on some machines. This limit is a rough guess only and it depends on the CPU how much time starting a thread needs.
Note that your operation Array + rand(1, 1000) spends the most time with the creation of random numbers, what is much more expensive than the cheap addition. Therefore the huge number of GPU cores might not be a substantial boost.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!