Reading a very large text file of an almost regular data with empty value

4 views (last 30 days)
Hello everybody,
I am trying to import an almost regular matrix into matlab. I used textscan with EmptyValue option to do it.
But it always give a error message 'badly formated string'. I do not understand why. Could you please give me a hand.
Below is the data file. The problem with the text file is:
first, there is empty in it. It would be better if I can get a NaN or 0 to replace the empty at the end
Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult.
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
The matrix I wanna obtain is
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00 NaN
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01 NaN
4.417E-03 1.000E+00 2.200E+05 11135 2.543452E+00 NaN
4.417E-03 1.000E+00 2.200E+05 11138 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11141 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+05 11351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2350 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2356 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2522 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2711 NaN NaN
Ps. the file is very large. Data contains thousands of rows and columns. Should I do some optimisation for reading files? Thanks in advance to help me. Thank you very much.
[EDITED]
Format='%*10E %10E %9f %5d %E %E'
opt = {'EmptyValue',NaN,'CollectOutput',1};
tmp = textscan(fid,Format,'Delimiter','','Whitespace','',opt{:});

Accepted Answer

Jacob Palczynski
Jacob Palczynski on 25 Aug 2011
Since your matrix is just almost regular, you will not be able to work with textscan easily. I guess the problem with EmptyValue is that there are no delimiters for the empty fields in your file.
I would suggest to use fgetl, and extract digits from the resulting char array with regexp. Then you get a cell array of char arrays, which you can convert to doubles.
For example:
clear
fid = fopen('test.txt','rt');
% maximum number of columns
maxlength = 6;
% preallocation
step = 2;
tm = nan(step,maxlength);
% reading line by line
k = 1;
while ~feof(fid)
thisline = fgetl(fid);
thisline = regexp(thisline,'\d*','match');
thisline = cellfun(@str2num,thisline);
tm(k,1:length(thisline)) = thisline;
r = size(tm,1);
% need to preallocate more?
if k == r
tm((r+1):(r+step), 1:maxlength) = nan;
end
k = k + 1;
end
fclose(fid);
For large files it makes sense to increase step.
  8 Comments
gringoire
gringoire on 26 Aug 2011
Otherwise, do you know how we can do the optimization for the time to read the text? Since the file has 500*10000 data to be read...
Jacob Palczynski
Jacob Palczynski on 26 Aug 2011
When I run the script in the profiler with a 10000*6-data-file, most time (~20 seconds) is spend in the cellfun line. Only 7 seconds are needed to read the file.
Hence, the conversion which we need due to the data format takes most time.

Sign in to comment.

More Answers (3)

Oleg Komarov
Oleg Komarov on 25 Aug 2011
The txt I created is (with a newline character at the end, otherwise won't read properly):
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
.
fid = fopen('test.txt');
% Import 3rd and 4th column as fixed char, the other directly as doubles
out = textscan(fid,'%f%f%9c%5c%f%f','EmptyValue',NaN);
% Line of blank to pad char fields
pad = repmat(' ',1,numel(out{1}));
% Read in the fixed width char fields as double
out{3} = cell2mat(textscan([pad; out{3}.'],'%f'));
out{4} = cell2mat(textscan([pad; out{4}.'],'%f'));
fclose(fid);
cell2mat(out)
  4 Comments
Oleg Komarov
Oleg Komarov on 26 Aug 2011
Works also on the data posted by gringoire except I have to add a newline at its end.
Soni huu
Soni huu on 29 Jun 2012
Edited: Soni huu on 29 Jun 2012
what about this, can you solve?? this data is 9 column
02:38:00 R- .026 065.457 01** 4862 0097 0074 +19
02:39:00 NaN .000 065.457 01** 4862 0101 0074 +19
02:40:00 NaN .000 065.457 01** 4862 0099 0074 +19
02:41:00 NaN .000 065.457 01** 4862 0097 0074 +19
02:42:00 R- .129 065.459 01** 4862 0111 0074 +19
02:43:00 R- .051 065.460 01** 4862 0099 0074 +19
note: NaN is empty or 2space(" ")
thanks b4

Sign in to comment.


Jan
Jan on 25 Aug 2011
Please post the command, which fails. It is impossible to guess exactly, what you have done.
This is a serious problem: "Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult." It is not trivial to split the string "2.200E+0511135". Without additional restirctions, it is even impossible. I assume, the exponent is limited to 2 digits - can you confirm this? Actually values > 1.0e100 are valid, but then it I'd consider the data file as damaged.
The file uses 4 significant digits per number in the leading columns. This is rather inprecise. I'd never draw any conclusions about the regularity of a large matrix with millions of elements, if the data are represented with such a low accuracy. Do you see any possibilities to obtain the values with SINGLE or better DOUBLE precision?
It is impossible to use TEXTSCAN due to the touching numbers. Therefore you have to insert a space after the 29th character at first and create a new file:
fidIN = fopen(FileName1, 'r');
if fidIN < 0, error('Cannot open file %s', FileName1); end
fidOUT = fopen(FileName2, 'w'); % EDITED: 'r'->'w'
if fidOUT < 0, error('Cannot open file %s', FileName2); end
while 1
S = fgetl(fidIN);
if ~ischar(S)
break;
end
% Split '4.417E-03 1.000E+00 2.200E+0511135'
S = [S(1:29), ' ', S(30:end)];
fwrite(fidOUT, S, 'uchar');
fwrite(fidOUT, 10, 'uchar'); % Or [13,10] for DOS line breaks
end
fclose(fidIN);
fclose(fidOUT);
Then TEXTSCAN has a chance to read the data. You need the 'MultipleDelimsAsOne' and the 'EmptyValue' flags.
PS. Please tell the programer who has created the file, that the format is trashy. It is near to be unusable. The strategy to format ASCII files are easy: Let the file be readable with the minimum number of rules. But you have repeated delimiters, missing data, touching columns and a large data set with low precision.
  5 Comments
Jan
Jan on 25 Aug 2011
@gringoire: Talking in English about MATLAB is not easy.
Look at "help textscan" -> Supported FORMAT specifiers: http://www.mathworks.de/help/techdoc/ref/textscan.html
There is no "%E" specifier. You can define the width as in "%10E" in newer MATLAB versions only - e,g, not in 2009a.
Your format specifier belong to FSCANF - which is btw. also a good idea to read the file line by line.

Sign in to comment.


Antti
Antti on 25 Aug 2011
Hi,
I think Jan's example code is good, but I suppose 3rd line should be:
fidOUT = fopen(FileName2, 'w'); % 'w' like write

Categories

Find more on Data Import and Export in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!