Difference in GPU processing inside parfor between R2011a and R2013a

Question

Kevin Stone on 23 Dec 2013

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/110559-difference-in-gpu-processing-inside-parfor-between-r2011a-and-r2013a

Commented: Kevin Stone on 24 Dec 2013

I have two MATLAB functions which use the GPU for processing (through the PCT). These functions are called -- along with many other functions -- from a parent function, which is itself called many times inside a parfor loop (each time with a different image that has been loaded from disk).

When I run this code in R2011a everything works fine. That is, all of the workers are processing, their disk i/o read activity is high, the GPU is processing (GPU-Z shows GPU load ~70%, memory controller load ~30%, memory usage fluctuating between 60-90%), the code runs much faster with the GPU than without, etc... I have run this code many times on multiple systems running Windows 7 64-bit each with a different graphics card -- GTX 580, GTX 680, and GTX 780.

After moving to R2013a this exact same code literally takes ages to process a single frame. Taskmgr shows the workers with high cpu usage and GPU-Z shows high GPU load, but the GPU memory controller load is zero and the disk i/o read activity for the cpu workers is zero. Meanwhile, the GPU memory usage is pegged at 99% the entire time. If I run the code using a regular for loop in R2013a everything works fine, and the GPU load/memory usage mirrors that of R2011a with a regular for loop.

Any idea as to what changed between R2011a and R2013a that would be causing the difference? Or how I could get the behavior of R2011a in R2013a? I am thinking that in R2011a GPU access between the workers is being serialized, but is not in R2013a. I tried rewriting the GPU code so that it would require less memory (using for loops to break up big matrix calculations), and was able to cut the GPU memory usage significantly in the regular for loop case. However, it didn't seem to make any difference in the parfor case even though the GPU should have plenty of memory for the workers (based on GPU memory usage with regular for loop).

2 Comments
Show NoneHide None

Jill Reese on 23 Dec 2013

In R2012b there was a change made in the area of parfor and GPU, but without knowing how many workers and GPUs you are running on and how you want workers to be mapped to your GPU(s) I cannot tell if this is the issue. However, this is what the R2012b release notes have to say:

Main point: Automatic detection and selection of specific GPUs on a cluster node when multiple GPUs are available on the node

Details: When multiple workers run on a single compute node with multiple GPU devices, the devices are automatically divided up among the workers. If there are more workers than GPU devices on the node, multiple workers share the same GPU device. If you put a GPU device in 'exclusive' mode, only one worker uses that device. As in previous releases, you can change the device used by any particular worker with the gpuDevice function.

Kevin Stone on 24 Dec 2013

Sorry, I should have made it more clear, all of these stations are single GPU with a multicore (6-8) cpu. The GPU computation accounts for ~15-20% of the total computation per frame. A different frame is processed by each worker. It is not clear to me from the documentation how the case of a single GPU and multiple workers is handled internally by MATLAB, but something seems to have changed between R2011 and R2013. The answer from http://www.mathworks.com/matlabcentral/answers/98849 says that the GPU computations are serialized across workers which IIRC is what happens by default with CUDA when multiple processes use the same GPU. Perhaps this has changed in newer CUDA toolkits? I haven't done any CUDA programming since 4.2.

Sign in to comment.

Sign in to answer this question.