How to SPEED UP the computation time of a fuzzy string similarity algo? (need the code to be at least 10X FASTER)

4 views (last 30 days)
Hi,
I need to make the following code much faster. It basically computes a string similarity score. The problem is that, given a string as input, I need to compute the similarity score between the input string and a list of about 40000 strings, which takes about 40 seconds. I am using cellfun rather than a loop which is speeding up the computation time, but I need it to be faster: it should take max a few seconds.
The similarity score is computed by combining two different measures: one is the Damerau–Levenshtein distance (which takes approximately 1/3 of the computation time) and the second one is the Dice Coefficient (which takes about 2/3 of the computation time).
Any idea? Thanks in advance for any help.
Rossella
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%MAIN FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MySTRING = string that the user wants to find the possible matches for
% TargetList = cellarray with 40000 rows with a list of strings
[DamLev, DiceCoeff] = ComputeStringDistance(MySTRING,TargetList);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%ComputeStringDistance FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ComputeStringDistance calculates the Damerau-Levenshtein distance and the DICE COEFFICIENT
function [NormalizedDistance, DiceCoefficient] = ComputeStringDistance(MySTRING, TargetList)
Nk = numel(key);
NormalizedDistance = cellfun(@(x) ComputeLevDamDistance(x, key, Nk), TargetList);
words1 = GetSingleWords(key);
% for each word, find all bigrams
bigrams1 = cellfun(@GetBigrams, words1, 'UniformOutput',0);
% put them together on one string
big_key = [];
for b = 1:numel(bigrams1)
big_key = [big_key, bigrams1{b}];
end
N_big_key = numel(big_key);
DiceCoefficient = cellfun(@(x) FastDiceCoefficient(x, big_key, N_big_key),TargetList);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%ComputeLevDamDistance FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function NormalizedDistance = ComputeLevDamDistance(Target, key, Nk)
Nt = numel(Target);
% compute demerau lev. disdtance
d = zeros(Nt+1,Nk+1);
% initialize distance matrix
d(1,:) = 0:Nk;
d(:,1) = 0:Nt;
for i=2:Nt+1
for j=2:Nk+1
if key(j-1) == Target(i-1)
cost = 0;
else
cost = 1;
end
d(i,j)=min([ ...
d(i, j-1) + 1, ... % deletion
d(i-1, j) + 1, ... % insertion
d(i-1,j-1) + cost ... % substitution
]);
if (i-1)>1 && (j-1)>1 && i-1 <= Nk && j-1 <= Nt && key((i-1))==Target((j-1)-1) && key((i-1)-1) == Target(j-1)
% transposition
d(i,j) = min([d(i,j), d(i-1,j-1)]);
end
end
end
% normalize the distance
NormalizedDistance = 1 - (d(Nt+1,Nk+1)/max(Nt, Nk));
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%GetSingleWords FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function words = GetSingleWords(s)
space_s = [0 regexp(s,'\s') length(s)+1];
words = cell(1,length(space_s) - 1);
for w = 1:length(space_s)-1
words{w} = s(space_s(w)+1:space_s(w+1)-1);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%GetBigrams FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function bigram = GetBigrams(s)
bigram{1} = s(1:end);
for i = 1:length(s)-1
bigram{i} = s(i:i+1);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%FastDiceCoefficient FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function DiceCoefficient = FastDiceCoefficient( Target, big_key, N_big_key)
words2 = GetSingleWords(Target);
% for each word, find all bigrams
bigrams2 = cellfun(@GetBigrams, words2, 'UniformOutput',0);
% put them together on one cell
big2 = [];
for b = 1:numel(bigrams2)
big2 = [big2, bigrams2{b}];
end
DiceCoefficient = sum([ismember(big2,big_key), ismember(big_key,big2)])/(N_big_key + numel(big2));

Answers (0)

Categories

Find more on Fuzzy Logic Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!