Fastest Possible Way to convert a table containing Only 2 strings to numbers

1 view (last 30 days)
Hello all,
I have an NxM array of strings. Most of the cells are empty. however, of those that are not empty, they contain only 2 values, 'Het' or 'Hom' for hetero vs homozygous.
I want to:
1. Create an NxM matrix 2. Put a 1 into the matrix at a position (i,j) for every instance of the string 'Het' in position (i,j) in the array 3. Put a 2 into the matrix at a position (i,j) for every instance of the string 'Het' in position (i,j) in the array (the number one and two should be a number, not a string)
EXAMPLE:
Array = 'Het' '' '' 'Het' '' 'Hom'
'' 'Het' 'Hom' '' '' ''
would become
Matrix = [1 0 0 1 0 2; 0 1 2 0 0 0] (could be NaN instead of 0, that doesnt matter to me)
Now, I can think of a bunch of work around for this.
I could call strfind ton of times. I could use uint8, then divide that output by set number or something etc.
But all the workarounds I can think of are slow.
What is the fastest way to make this conversion on a very large array?
I do have parallel computing toolbox in principle, but I have never used it so I would need clear instructions...
Thank you very much in advance!

Accepted Answer

Edric Ellis
Edric Ellis on 5 Feb 2014
You can do this with UNIQUE or ISMEMBER.
values = {'Het', '', '', 'Het', '', 'Hom'; '', 'Het', 'Hom', '', '', ''}
% use third output of UNIQUE directly:
[u, ~, idx] = unique(values);
% idx is the wrong shape, so reshape it:
out = reshape(idx, size(values))
% Or, using the second output of ISMEMBER:
[~, idx] = ismember(values, u)
You could next parallelize it by using PARFOR over the rows of values. For example, let's make a larger 'values' by replicating it
values = repmat(values, 100, 10)
parfor rowIdx = 1:size(values,1)
[~, out(rowIdx, :)] = ismember(values(rowIdx, :), u);
end
It's not clear to me whether applying PARFOR like this would make things faster though - in general, PARFOR works well when you are doing lots of work per amount of data transferred, and I'm not convinced that's the case here.
  1 Comment
Sarutahiko
Sarutahiko on 6 Feb 2014
Edric - Thank you so much for this post.
I am self taught so at times I just completely miss things like ismember. Ill check it out!

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 6 Feb 2014
No-one knows the fastest possible way. It is going to depend upon your exact computer details including the processor, architecture, amount of primary cache, kind of connection to your secondary cache, secondary cache speed, third-level cache speed, amount of RAM, which version of MATLAB you are using, what else is running on your system, and other like details. And upon doing a lot of analysis about the most efficient possible algorithm for the task, considering the processor details such as pre-fetch, cache-line size,speculative execution, hyperthreading, out-of-order execution, pipelining....
The fastest possible method might involve sending the data over to an FPGA and having it do the calculations. Or perhaps you could do even better with custom ASICs.
You should also be considering writing a mex routine to do the analysis.
One thing that I would point out is that if the string is empty you know the output immediately (0), and if it is not empty then you only need to check the second character: if it is 'o' then the result is 2 and otherwise the result is 1. You do not need ismember() or to process other parts of the string.
  2 Comments
Sarutahiko
Sarutahiko on 7 Feb 2014
why does 90% of your answer focus on 1% of the words on my post?
Just a question.
Walter Roberson
Walter Roberson on 7 Feb 2014
Figuring out "the fastest possible way" to do something is often much much more than 90% of the time involved in a problem. Consider, for example, the amount of human effort that has gone into finding "the fastest possible way" to factor large numbers.

Sign in to comment.

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!