Strange: Execution time per iteration increases, when GPU arrays are being used

10 views (last 30 days)
Hi,
the following code computes 20,000 times the product of 2 rather big 1d arrays on the GPU:
1 reset(gpuDevice);
2 a = 1:1:(256*256*100);
3 b = a;
4 c = a;
5 a = gpuArray(a);
6 b = gpuArray(b);
7 c = gpuArray(c);
8 tic
9 for z=1:20000
10 tstart(z) = tic;
11 c =a .* b;
12 telapsed(z) = toc(tstart(z));
13 end
14 toc
If lines 5-7 are commented out, the product is then computed on the CPU.
Here are the execution times per iteration:
1. CPU usage - execution time: 386.48 seconds
2. GPU usage (driver 332.21 which can be found on the nvidia page) - execution time: 17.11 seconds
3. GPU usage (driver 332.21 which is included in the current cuda toolkit version 5.5) - execution time: 16.99 seconds
Here are my settings:
  • no memory leakage (I've chekced that - memory on the GPU is constant)
  • no other CPU or GPU processes are running or interferring the computation
  • GPU: GeForce GTX 570
The execution times for case 2. and 3. are very strange.
  1. Do you know why the execution time increases in case 2?
  2. Do you know why there is a 'pattern' in case 2 and 3?
It would be perfect if the execution time in case 3 would stay at 1e-4s per iteration (like the first 6000 iterations).
Im running the stuff on windows 7. I know that it is hard to measure times beyond 1e-2s per iteration on windows. However, I can confirm that in case 3 the overall execution time for 6000 iterations is 0.32 seconds. Therefore 20,000 iterations should be computed in less than 1 second (I've measured 17 seconds). This confirms that the measurements are correct.
Is this a known bug of Matlab? I don't see any reason why the execution time per iteration should increase.
Thanks for your advice!
Best,
Maurice
  4 Comments
Maurice
Maurice on 14 Feb 2014
thanks for the hints, I've updated the title and other places where I mixed this up.
Joss Knight
Joss Knight on 14 Mar 2014
Maurice, what is the difference between 2 and 3? Your comments imply the driver is different, but the version number is the same.

Sign in to comment.

Answers (3)

Anton
Anton on 4 Mar 2014
I have the same problem using gpuArray (but still no solution). I also would guess it is a memory problem. Maybe there exists an input transfer buffer on the nvidia graphics card. If someone has an solution, please post it!
  3 Comments
Maurice
Maurice on 5 Mar 2014
Jill, you're right the timing using tic and toc may be not accurate. However one can analyze the total running time as I wrote above: 6000 iterations on the GPU are running in 0.32 seconds (in total). This is 0.32/6000 = 5.3*10^-5 seconds per iteration, however 20,000 iterations run in approx. 17 seconds (in total). In this case the time per iteration is more than 1 order of magnitude lower (approx. 8.5 * 10^-4 seconds and the execution speed decreases with further iterations. Best,
Maurice

Sign in to comment.


Joss Knight
Joss Knight on 14 Mar 2014
Edited: Joss Knight on 19 Mar 2014
As Jill points out, your timings are basically meaningless because you are using tic and toc. c = a.*b returns as soon as the GPU kernel is queued. All you've done is queue up a huge number of kernels - the step change is just when the queue is full and so you've started actually measuring the kernel execution time.
The only fair way to measure the time is to divide the overall time by the number of iterations, or to use gputimeit. You could use wait(gpuDevice) inside the loop, but this unfairly adds the cost of waiting for the kernel to complete and return control to MATLAB before queuing the next one. I ran your code using both gputimeit and wait(gpuDevice) and the execution time was dead flat throughout.
You also need to pre-allocate your timing array, otherwise you're including the time taken to grow the array. This will necessarily take more and more time each time more memory is allocated because of the cost of copying the existing data to the new location. This is probably the reason for your growing execution time.
Try this instead and let me know whether it gives you flat timings:
telapsed = zeros(20000,1); % Pre-allocate
for z=1:20000
telapsed(z) = gputimeit(@()a.*b);
end
It will take a long time to execute because gputimeit runs the code at least 10 times to get an average value. You might want to reduce the number of iterations. If you don't think this code is testing the timing properly, do it with wait(gpuDevice) - just be aware that the actual execution time is much less:
telapsed = zeros(20000,1); % Pre-allocate
for z=1:20000
tstart = tic;
c = a .* b;
wait(gpuDevice);
telapsed(z) = toc(tstart);
end

Iain
Iain on 10 Feb 2014
I suspect that it has to do with caching.
A slow main DRAM of 1GB or whatever will have "slow" write access times (tens-hundreds of ns) . A cached SRAM of 16MB (or whatever) will have "fast" write access times (sub-ns)
Once the cached SRAM fills, it needs to start unloading to the main DRAM, which may restrict your ability to write to the SRAM and say "done".
  2 Comments
Maurice
Maurice on 14 Feb 2014
There is no write back to the main ram during the iterations, everything is computed on either CPU or GPU
Iain
Iain on 5 Mar 2014
On a GPU, you'll still have some local static ram running at very high clock rates (a few MB is likely), and a much bigger ram space (hundreds/thousands of MB), still on the graphics card, and then a further, main ram on the motherboard (ones-tens of GB), and then a further massive "RAM" located on the machine's hard drive (hundreds-thousands of GB)
It might not be writing back to motherboard ram, but its almost certain to be writing back to a slow, large scale RAM on the graphics card.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!