Merging the content of two tables

5 views (last 30 days)
William Schaefer
William Schaefer on 18 Oct 2017
Edited: Life is Wonderful on 15 May 2020
I have two tables Ta and Tb that have different number of rows and different number of columns. The content is all cell text, but maybe in the future it could also contain cell number.
.
I want to merge the content of these tables together under the following set of rules:
  • Take the value of Ta(i,j) if Tb(i*,j*) is empty and vice versa.
  • If both are available, then take the value of Ta(i,j) (or optionally, check whether they are the same).
.
The tricky part however is that we do not have unique row keys, we only have unique column keys. Note above that I make a distinction between i* and i. The reason is that the row in Ta can be at a different index than Tb, same holds for the columns j* and j. The implications are:
  • that we first need to identify which row for Ta corresponds to the row of Tb and vice versa. We can do this by trying to crossmatch any of the columns that the tables share in common.
To make it also harder:
  • the number of rows are not equal for the tables (sometimes there are rows in Ta but not in Tb and vice versa).
.
How can we merge the content of these two tables together in the most efficient way?
.
Example variables to play with:
Ta = cell2table({...
'a1', 'b1', 'c1'; ...
'a2', 'b2', 'c2'}, ...
'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
'b2*', 'c2', 'd2'; ...
'b3', 'c3', 'd3'; ...
'b4', 'c4', 'd4'}, ...
'VariableNames', {'B','C', 'D'})
The resulting table Tc should be something like this:
Tc = cell2table({...
'a1' 'b1' 'c1' ''; ...
'a2' 'b2' 'c2' 'd2'; ...
'' 'b3' 'c3' 'd3'; ...
'' 'b4' 'c4' 'd4'}, ...
'VariableNames', {'A', 'B','C', 'D'})
  6 Comments
Guillaume
Guillaume on 19 Oct 2017
My understanding is that you get a match if any key in the keyset match, not just all the keys (or all the non-null keys which would be more manageable). When one key matches but not the other you take any of the value as the value for the result.
William Schaefer
William Schaefer on 19 Oct 2017
Edited: William Schaefer on 19 Oct 2017
@Peter Perkins:
the basic goal is to combine information coming from two tables as efficient as possible.
The crossmatch was just an idea to identify which row in table 1 corresponds to row in table 2. The reason why we need to do this is that we do not have unique row keys, and we also cannot construct ones from our data: we often have missing values in either 1 or 2, making it hard to use one of the columns as row key identifiers.
As for the values, you can use whatever you want foo, bla, bar, x1, x5, the idea was just to show that the values are supposed to be different, despite that c2 is here used to identify that we are talking about some second "row key", but we have different values for the column B.
.
@Guillaume: that is correct, we take one or the other.

Sign in to comment.

Answers (1)

KSSV
KSSV on 18 Oct 2017
Tc = outerjoin(Ta,Tb,'MergeKeys', 1)
  6 Comments
William Schaefer
William Schaefer on 18 Oct 2017
Edited: William Schaefer on 18 Oct 2017
Okay that makes sense. I see more what you mean but I am not entirely getting how to merge all rows which have at least one identical value in same column, regardless of the values in other column.
As for the indexing, I am not sure how to index it as there is not really a unique row key. Sometimes it is the ISIN, sometimes the SECCODE, it really depends on what is available and within the table there can be many gaps. There is not one column that is available for each row, and especially not cross-table.
That being said, I have tried to minimize the for loops in the code. This is my best effort so far (reduced the time to 2.7 minutes):
function Tc = mergecontentTables2(Ta, Tb, colsToCheck)
% A generic function to merge the content of the tables. This is a work
% under progress. Current best run time: 4.5 minutes for 16000 x 29.
%
% The following inputs are relevant:
%
% - Ta/Tb: a table with independent dimensions
%
% - colsToCheck: which columsn to check for common rows (e.g. identifiers
% SECCODE and TICKER_BB)
% outer join the tables
Tbig = outerjoin(Ta, Tb, 'MergeKeys', true);
colsAll = Tbig.Properties.VariableNames;
if nargin < 3
colsToCheck = colsAll;
end
t = 1;
T = height(Tbig);
nbMerged = 1;
while t < T
% check if we can find a common row
if mod(nbMerged, 100)==0; fprintf(' * %4.2f (%.0f merged)\n',t/T, nbMerged); end
commonrows = {};
for c=1:length(colsToCheck)
colname = colsToCheck{c};
value = Tbig.(colname)(t);
if ~isempty(char(value))
commonrows = find(strcmp(Tbig.(colname), value));
if length(commonrows)>1
break
end
end
end
% for the common row, merge the diff, delete the row and append the
% merged
LCR = length(commonrows);
if LCR>1
mergedRow = cell2table( repmat({''},1,length(colsAll)), 'VariableNames', colsAll);
for c=1:length(colsAll)
colname = colsAll{c};
uvalues = unique(Tbig.(colname)(commonrows));
binzero = strcmp(uvalues,{''});
uvalues(binzero)=[];
if ~isempty(uvalues)
mergedRow.(colname) = uvalues(1);
end
end
% update table
Tbig(commonrows,:) = [];
T = T - LCR;
t = t - 1;
Tbig = [Tbig; mergedRow]; %#ok<AGROW>
T = T + 1;
nbMerged = nbMerged + LCR;
end
% ++
t = t + 1;
end
Tc = Tbig;
end
Life is Wonderful
Life is Wonderful on 14 May 2020
Edited: Life is Wonderful on 15 May 2020
Suggestion :
Original Requirement was
A B C D
__________ ______ ______ __________
{'a1' } {'b1'} {'c1'} {0×0 char}
{'a2' } {'b2'} {'c2'} {'d2' }
{0×0 char} {'b3'} {'c3'} {'d3' }
{0×0 char} {'b4'} {'c4'} {'d4' }
Actual Output of
Tbig =
4×4 table
A B C D
__________ ______ ______ __________
{'a1' } {'b1'} {'c1'} {0×0 char}
{0×0 char} {'b3'} {'c3'} {'d3' }
{0×0 char} {'b4'} {'c4'} {'d4' }
{'a2' } {'b2'} {'c2'} {'d2' }
The incorrect/Bug outcome is from
Tbig = [Tbig; mergedRow]; %#ok<AGROW>
Suggestion and proposal is
Tbig = outerjoin(Tbig,mergedRow,'MergeKeys', true);
You get the desired outcome
A B C D
__________ ______ ______ __________
{'a1' } {'b1'} {'c1'} {0×0 char}
{'a2' } {'b2'} {'c2'} {'d2' }
{0×0 char} {'b3'} {'c3'} {'d3' }
{0×0 char} {'b4'} {'c4'} {'d4' }

Sign in to comment.

Categories

Find more on Creating, Deleting, and Querying Graphics Objects in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!