Convert cell array of mostly strings into matrix of numbers

3 views (last 30 days)
I'm trying to write a general file loading tool, that loads up large .csv data files of unknown type. So I know they are data files with headers, units and big table of data. But I don't know the delimiter, or the number of header lines etc.
So I load up the files line by line using:
NewLine = fgets(CurrentFileID);
I then strtrim each line and save it to WholeFile.
This means WholeFile is a large 1 column cell array that I can process.
I work through WholeFile and identify the delimiter, header row, units row etc. I then get to the point where I want to convert the bulk of WholeFile into a matrix of numbers. A couple of problems:
1) WholeFile is 1 column of data separated by delimiters, so it needs to separated out. The delimiter is defined by the string called: 'Delimiter' whose contents is the delimiter string.
2) Some of WholeFile is non-numbers that need to be replaced with 0.
I've tried:
1)
%Looping line by line and running:
CurrentRow = char(NonDelimitedData(ii));
CurrentRowSplit = strsplit(CurrentRow, sprintf(Delimiter));
cell2mat(CurrentRowSplit) %But this just puts everything together into 1 long cell, not separate doubles.
2)
text scan such as this:
Temp = textscan(CurrentRow, '%s','delimiter',sprintf(Delimiter))';
Data(ii,1:nColumns) = Temp{1,1}'
but I cant convert the data type into doubles or 0 if its not a number.
3)
looping line by line and entry by entry! Which works perfectly but takes ages!
NonDelimitedData = WholeFile(nFirstDataRow:end,:); %INPUT - From end of header info to end of file.
Data = zeros(length(NonDelimitedData),nColumns); %OUTPUT - Prealocated
%Loop line by line through data
for ii = 1 : length(NonDelimitedData)
CurrentRow = char(NonDelimitedData(ii));
CurrentRowSplit = strsplit(CurrentRow, sprintf(Delimiter));
%Loop entry by entry through each line
for iii = 1 : nColumns
CurrentEntry = str2double(CurrentRowSplit{1,iii});
if isempty(CurrentEntry)
CurrentEntry = 0;
end
Data(ii,iii) = CurrentEntry;
end
end
Can anyone work out a better way of doing what option 3 above does? As it just takes far to long.
  3 Comments
Gregory Smith
Gregory Smith on 20 Feb 2017
Edited: Walter Roberson on 20 Feb 2017
Thanks Walter that's helped. I've implemented and simplified the code a little. It still required a loop within a loop through which is the problem.
%Extract data
NonDelimitedData = WholeFile(nFirstDataRow:end,:); %INPUT - From header info to end of file.
Data = zeros(length(NonDelimitedData),nColumns); %OUTPUT - Prealocated
%Loop line by line through data
for ii = 1 : length(NonDelimitedData)
CurrentRow = char(NonDelimitedData(ii));
CurrentRowSplit = strsplit(CurrentRow, sprintf(Delimiter));
%Loop entry by entry through each line
for iii = 1 : nColumns
CurrentEntry = str2double(CurrentRowSplit{1,iii});
Data(ii,iii) = CurrentEntry;
end
end
%Replace any NaN's with 0's
Data(isnan(Data)) = 0;
Gregory Smith
Gregory Smith on 20 Feb 2017
Edited: Walter Roberson on 20 Feb 2017
I have also found that textscan used like this works when looping line by line. It avoids looping through each element within each line but somehow takes longer. So I guess it doesn't really get my anywhere.
Temp = textscan(CurrentRow, '%s','delimiter',sprintf(Delimiter));
Data(ii,:) = str2double(Temp{1,1})';

Sign in to comment.

Answers (2)

Guillaume
Guillaume on 17 Feb 2017
Have you tried using readtable which is usually clever enough to figure out the format on its own and would avoid you having to do any work?

Walter Roberson
Walter Roberson on 20 Feb 2017
filecontent = fileread('NameOfYourFile.txt');
content_as_lines = regexp(filecontent, '\r?\n', 'split');
content_as_fields = regexp(content_as_lines, Delimiter, 'split');
nline = length(content_as_fields);
val_cell = cell(nline, 1);
for K = 1 : nline
line_as_numeric = str2double(content_as_fields{K});
mask =
line_as_numeric(isnan(line_as_numeric)) = 0;
val_cell{K} = line_as_numeric;
end
You cannot get away from looping over the lines because you could have a different number of columns for each line and str2double() can only handle a single string or a rectangular cell array (because it wants to create a purely numeric matrix and purely numeric matrices must be uniform size.)
  3 Comments
Gregory Smith
Gregory Smith on 21 Feb 2017
Edited: Walter Roberson on 22 Feb 2017
Walter,
For example, this works:
NonDelimitedData = WholeFile(nFirstDataRow:end,:); %Cut out headers
DelimitedData = regexp(NonDelimitedData,sprintf(Delimiter),'split'); %Apply delimiter
% Convert single column of 1*nColumns cells into n*nColumns cell array of strings
for ii = 1 : length(DelimitedData)
SeperatedArray(ii,1:nColumns) = DelimitedData{ii};
end
Data = str2double(SeperatedArray);
Data(isnan(Data)) = 0;
But there must be away of getting rid of that loop in the middle. I'm trying to convert single column of 1*nColumns cells into n*nColumns cell array of strings. There must be a way to do this without using a loop.
Thanks for your help.
Walter Roberson
Walter Roberson on 22 Feb 2017
"I don't understand why I have to go line by line, as there is a constant number of columns"
That is new and important information for this purpose. You had indicated that the files were of "unknown type". csv files are not restricted to having a constant number of fields. csv files are also required to use "," as the delimiter, whereas you have indicated that the delimiter is flexible and available in the variable Delimiter. csv files that are output from spreadsheet software do typically have a consistent number of fields per line, but that is not a hard requirement. (And, interestingly, in .xlsx format, cells that are empty and have no special formatting or macro information are literally not output in the .xlsx...)
In the case where S represents the newline-delimited string of file content with the headers removed, and there are a consistent number of columns, then:
temp = regexp(regexp(S,'\r?\n', 'split'), Delimiter, 'split');
results = str2double( vertcat(temp{:}) );

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!