Can't get speed up !
Show older comments
Hi,
I've just learned about Matlab Parallel Computing Toolbox. I'm studying about it. In the beginning and for taking motivation i tried some simple codes to get speed up. But all of my parallel results were worse than serial ones. I know about overhead of data communication between cores. So i wrote a code that have least data communication but likewise before the parallel execution time was longer than serial. I execute parallel code on two workers. My CPU is Intel Core 2 Duo 2.26 Ghz. CPU usage is 100% while running parallel code and 50% while running serial code.
I also tried a code that i found in net. The writer had claimed speed of for the code is 1.92 using 2 workers. But i got 0.96 !
I'm so disturbed!
2 Comments
Walter Roberson
on 1 Dec 2012
If you have hyperthreading enabled, turn it off.
john
on 2 Dec 2012
Answers (2)
john
on 2 Dec 2012
2 Comments
Walter Roberson
on 2 Dec 2012
Try reversing the order of the subscripts, producing a 4 million by 2 output, so that there would not be any cache-line contention. Also, try vectorizing, e.g.,
A(j, :) = sin(j + (1:4000000));
with no "for k" loop.
Jan
on 29 Dec 2012
Slightly fast: sin(j + 1:j + 4000000)
6 Comments
Walter Roberson
on 2 Dec 2012
Elements are arranged in memory going down columns, but you are writing across rows. So for any given K, A(1,K) and A(2,K) are adjacent in memory. When items are adjacent in memory, hardware considerations can require that one process temporarily be blocked from writing until the other finishes. That access was being negotiated each iteration of the "k" loop. When vectorized forms are used instead, the negotiation between threads is handled in chunks, reducing the overhead.
Also, even things as simple as adding a constant (j in this case) to a vector can be handled more efficiently as chunks rather than one-by-one.
Bradley Stiritz
on 29 Dec 2012
Walter, you have amazingly deep understanding of parallel code execution! I was just looking around Amazon this evening for a good reference on multi-threaded algorithms. I gave up & decided to browse here instead!
Are there any basic citations or references you could mention, please, that might give more detail & further examples of software / hardware considerations, as in your explanation?
Thanks & happy holidays, Brad
Walter Roberson
on 29 Dec 2012
Thanks, Brad, but there is a fair bit about parallel processing that I do not know. I have not had a chance to use Mathwork's Parallel Computing Toolbox.
I learned most of what I know about parallel code informally.
One of the tools that did help me was SGI's IRIX APO (Automatic Parallelization Option) for their Fortran and C compilers. The warnings and diagnostics from it were helpful in learning which patterns worked and which did not. Points such as cache coherency were important in that environment because SGI's machines were designed for unified memory access across up to 65535 processors -- designed for "fine grained" parallelism, tightly communicating. Most systems these days are designed for loosely-coupled communications where processes can run for fair chunks of time before having to synchronize.
Bradley Stiritz
on 29 Dec 2012
OK, thanks Walter for the background info. I'm developing strictly in MATLAB though, so unfortunately I won't be able to check out those compilers. I'll do some further digging & maybe post a new Question specifically about this, or a Service Request. Will post back to this thread if I come up with anything.
Walter Roberson
on 29 Dec 2012
Those machines haven't been sold for a number of years. And MATLAB has not been supported on them for a fair number of releases.
Bradley Stiritz
on 22 Jan 2013
Regarding Walter's 12/2/2012 comment, I requested documentation reference from Mathworks support & received this reply:
------------------------------------------------
"While working with MATLAB in general and also with PCT, row-wise access in same column will be faster than column-wise access in same row. This is because, in MATLAB, matrix elements belonging to the same column are located in consecutive locations of memory, while elements belonging to the same row of the matrix are located in non-consecutive locations of memory.
The following link from the Mathworks website highlights the above mentioned fact and also provides additional information on "Speeding up MATLAB Applications"
Also, please see the following MATLAB Documentation link for additional details on profiling and improving parallel code:
Unfortunately, we cannot recommend any books. However, it would be highly recommended to go through the webinars mentioned in the link below:
------------------------------------------------
Hope this helps..
Categories
Find more on AI for Signals in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!