Can't get speed up !

Hi,
I've just learned about Matlab Parallel Computing Toolbox. I'm studying about it. In the beginning and for taking motivation i tried some simple codes to get speed up. But all of my parallel results were worse than serial ones. I know about overhead of data communication between cores. So i wrote a code that have least data communication but likewise before the parallel execution time was longer than serial. I execute parallel code on two workers. My CPU is Intel Core 2 Duo 2.26 Ghz. CPU usage is 100% while running parallel code and 50% while running serial code.
I also tried a code that i found in net. The writer had claimed speed of for the code is 1.92 using 2 workers. But i got 0.96 !
I'm so disturbed!

2 Comments

If you have hyperthreading enabled, turn it off.
john
john on 2 Dec 2012
I checked the BIOS and didn't find CPU settings and hyperthreading. I think the CPU (Core 2 Duo) doesn't have hyperthreading feature.
Any other suggestion? Please help me

Sign in to comment.

Answers (2)

john
john on 2 Dec 2012
This is my serial code :
clear
A = zeros(2,4000000);
tic
for j = 1:2
for k = 1:4000000
A(j,k) = sin(j + k);
end
end
toc
And the parallel one :
clear
A = zeros(2,4000000);
tic
parfor j = 1:2
for k = 1:4000000
A(j,k) = sin(j + k);
end
end
toc
Very simple! The parfor has just two iterations and i expect that each of the iterations is executed by one core and get speed up about 2. But run time of the first is 4 seconds and the second is 14!

2 Comments

Try reversing the order of the subscripts, producing a 4 million by 2 output, so that there would not be any cache-line contention. Also, try vectorizing, e.g.,
A(j, :) = sin(j + (1:4000000));
with no "for k" loop.
Jan
Jan on 29 Dec 2012
Slightly fast: sin(j + 1:j + 4000000)

Sign in to comment.

john
john on 2 Dec 2012
Edited: john on 2 Dec 2012
Thanks! I tried this code
A(j, :) = sin(j + (1:4000000));
and got speed up about 1 ! This wasn't disappointing like before samples. Then i tried weighting each iteration
A(j, :) = sin(j + (1:4000000)) .* sin(j - (1:4000000)) ...
.* cos(j - (1:4000000)) .* cos(j + (1:4000000));
and speed up was about 1.3!
Can you please explain what effect the code that you mentioned has? (A(j , :) = ...) why was my first code bad?
Regards

6 Comments

Elements are arranged in memory going down columns, but you are writing across rows. So for any given K, A(1,K) and A(2,K) are adjacent in memory. When items are adjacent in memory, hardware considerations can require that one process temporarily be blocked from writing until the other finishes. That access was being negotiated each iteration of the "k" loop. When vectorized forms are used instead, the negotiation between threads is handled in chunks, reducing the overhead.
Also, even things as simple as adding a constant (j in this case) to a vector can be handled more efficiently as chunks rather than one-by-one.
Walter, you have amazingly deep understanding of parallel code execution! I was just looking around Amazon this evening for a good reference on multi-threaded algorithms. I gave up & decided to browse here instead!
Are there any basic citations or references you could mention, please, that might give more detail & further examples of software / hardware considerations, as in your explanation?
Thanks & happy holidays, Brad
Thanks, Brad, but there is a fair bit about parallel processing that I do not know. I have not had a chance to use Mathwork's Parallel Computing Toolbox.
I learned most of what I know about parallel code informally.
One of the tools that did help me was SGI's IRIX APO (Automatic Parallelization Option) for their Fortran and C compilers. The warnings and diagnostics from it were helpful in learning which patterns worked and which did not. Points such as cache coherency were important in that environment because SGI's machines were designed for unified memory access across up to 65535 processors -- designed for "fine grained" parallelism, tightly communicating. Most systems these days are designed for loosely-coupled communications where processes can run for fair chunks of time before having to synchronize.
OK, thanks Walter for the background info. I'm developing strictly in MATLAB though, so unfortunately I won't be able to check out those compilers. I'll do some further digging & maybe post a new Question specifically about this, or a Service Request. Will post back to this thread if I come up with anything.
Those machines haven't been sold for a number of years. And MATLAB has not been supported on them for a fair number of releases.
Regarding Walter's 12/2/2012 comment, I requested documentation reference from Mathworks support & received this reply:
------------------------------------------------
"While working with MATLAB in general and also with PCT, row-wise access in same column will be faster than column-wise access in same row. This is because, in MATLAB, matrix elements belonging to the same column are located in consecutive locations of memory, while elements belonging to the same row of the matrix are located in non-consecutive locations of memory.
The following link from the Mathworks website highlights the above mentioned fact and also provides additional information on "Speeding up MATLAB Applications"
Also, please see the following MATLAB Documentation link for additional details on profiling and improving parallel code:
Unfortunately, we cannot recommend any books. However, it would be highly recommended to go through the webinars mentioned in the link below:
------------------------------------------------
Hope this helps..

Sign in to comment.

Asked:

on 1 Dec 2012

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!