Strange: Execution time per iteration increases, when GPU arrays are being used

Question

Maurice on 10 Feb 2014

2
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used

Edited: Joss Knight on 19 Mar 2014

Hi,

the following code computes 20,000 times the product of 2 rather big 1d arrays on the GPU:

reset(gpuDevice);
a = 1:1:(256*256*100);
b = a;
c = a;
a = gpuArray(a);
b = gpuArray(b);
c = gpuArray(c);
tic
for z=1:20000
      tstart(z) = tic;
c =a .* b;
      telapsed(z) = toc(tstart(z));    
end
toc

If lines 5-7 are commented out, the product is then computed on the CPU.

Here are the execution times per iteration:

1. CPU usage - execution time: 386.48 seconds

2. GPU usage (driver 332.21 which can be found on the nvidia page) - execution time: 17.11 seconds

3. GPU usage (driver 332.21 which is included in the current cuda toolkit version 5.5) - execution time: 16.99 seconds

Here are my settings:

no memory leakage (I've chekced that - memory on the GPU is constant)
no other CPU or GPU processes are running or interferring the computation
GPU: GeForce GTX 570

The execution times for case 2. and 3. are very strange.

Do you know why the execution time increases in case 2?
Do you know why there is a 'pattern' in case 2 and 3?

It would be perfect if the execution time in case 3 would stay at 1e-4s per iteration (like the first 6000 iterations).

Im running the stuff on windows 7. I know that it is hard to measure times beyond 1e-2s per iteration on windows. However, I can confirm that in case 3 the overall execution time for 6000 iterations is 0.32 seconds. Therefore 20,000 iterations should be computed in less than 1 second (I've measured 17 seconds). This confirms that the measurements are correct.

Is this a known bug of Matlab? I don't see any reason why the execution time per iteration should increase.

Thanks for your advice!

Best,

Maurice

4 Comments
Show 2 older commentsHide 2 older comments

Maurice on 14 Feb 2014

thanks for the hints, I've updated the title and other places where I mixed this up.

Joss Knight on 14 Mar 2014

Maurice, what is the difference between 2 and 3? Your comments imply the driver is different, but the version number is the same.

Sign in to comment.

Sign in to answer this question.

Answer 1

Anton on 4 Mar 2014

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_127023

I have the same problem using gpuArray (but still no solution). I also would guess it is a memory problem. Maybe there exists an input transfer buffer on the nvidia graphics card. If someone has an solution, please post it!

3 Comments
Show 1 older commentHide 1 older comment

Maurice on 5 Mar 2014

Jill, you're right the timing using tic and toc may be not accurate. However one can analyze the total running time as I wrote above: 6000 iterations on the GPU are running in 0.32 seconds (in total). This is 0.32/6000 = 5.3*10^-5 seconds per iteration, however 20,000 iterations run in approx. 17 seconds (in total). In this case the time per iteration is more than 1 order of magnitude lower (approx. 8.5 * 10^-4 seconds and the execution speed decreases with further iterations. Best,

Maurice

Maurice on 14 Mar 2014

push

Sign in to comment.

Answer 2

Joss Knight on 14 Mar 2014

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_128372

Edited: Joss Knight on 19 Mar 2014

Open in MATLAB Online

As Jill points out, your timings are basically meaningless because you are using tic and toc. c = a.*b returns as soon as the GPU kernel is queued. All you've done is queue up a huge number of kernels - the step change is just when the queue is full and so you've started actually measuring the kernel execution time.

The only fair way to measure the time is to divide the overall time by the number of iterations, or to use gputimeit. You could use wait(gpuDevice) inside the loop, but this unfairly adds the cost of waiting for the kernel to complete and return control to MATLAB before queuing the next one. I ran your code using both gputimeit and wait(gpuDevice) and the execution time was dead flat throughout.

You also need to pre-allocate your timing array, otherwise you're including the time taken to grow the array. This will necessarily take more and more time each time more memory is allocated because of the cost of copying the existing data to the new location. This is probably the reason for your growing execution time.

Try this instead and let me know whether it gives you flat timings:

telapsed = zeros(20000,1);  % Pre-allocate
for z=1:20000
  telapsed(z) = gputimeit(@()a.*b);
end

It will take a long time to execute because gputimeit runs the code at least 10 times to get an average value. You might want to reduce the number of iterations. If you don't think this code is testing the timing properly, do it with wait(gpuDevice) - just be aware that the actual execution time is much less:

telapsed = zeros(20000,1);  % Pre-allocate
for z=1:20000
  tstart = tic;
  c = a .* b;
  wait(gpuDevice);
  telapsed(z) = toc(tstart);
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

Iain on 10 Feb 2014

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/115545-strange-execution-time-per-iteration-increases-when-gpu-arrays-are-being-used#answer_123850

I suspect that it has to do with caching.

A slow main DRAM of 1GB or whatever will have "slow" write access times (tens-hundreds of ns) . A cached SRAM of 16MB (or whatever) will have "fast" write access times (sub-ns)

Once the cached SRAM fills, it needs to start unloading to the main DRAM, which may restrict your ability to write to the SRAM and say "done".

2 Comments
Show NoneHide None

Maurice on 14 Feb 2014

There is no write back to the main ram during the iterations, everything is computed on either CPU or GPU

Iain on 5 Mar 2014

On a GPU, you'll still have some local static ram running at very high clock rates (a few MB is likely), and a much bigger ram space (hundreds/thousands of MB), still on the graphics card, and then a further, main ram on the motherboard (ones-tens of GB), and then a further massive "RAM" located on the machine's hard drive (hundreds-thousands of GB)

It might not be writing back to motherboard ram, but its almost certain to be writing back to a slow, large scale RAM on the graphics card.

Sign in to comment.

Strange: Execution time per iteration increases, when GPU arrays are being used

4 Comments
Show 2 older commentsHide 2 older comments

Answers (3)

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

Strange: Execution time per iteration increases, when GPU arrays are being used

4 Comments Show 2 older commentsHide 2 older comments

Answers (3)

3 Comments Show 1 older commentHide 1 older comment

0 Comments Show -2 older commentsHide -2 older comments

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None