Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Strings from a text file to a matrix containing double precision floating numbers

Asked by Thomas on 15 Jan 2013

Hi

I have a text file containing a text header, and rows containing numeric values, with varying numbers of values, characters and numeric formats:

# Bundle file v0.3
9 2532
6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002
9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001
194 144 45
5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000 163.0000 4 2122 173.0000 142.0000 0 911 148.5000 165.5000
2.4321163035e+000 -9.1469082482e-001 -6.6122261943e+000
219 194 76

I want to remove the header and store each of the numeric values in a matrix (padded out with NaNs to compensate for the dimensional differential). At present, I am using this code:

    % open file and save contents to cell array, c
    fid = fopen('C:\transform\bundle.out','r');
    c = textscan(fid,'%s','delimiter', '','whitespace','');
    fclose(fid);
    %create m x 1 cell C and remove the header
    C = c{1};
    C(1,:)=[];
    % convert C to a matrix using cell2mat / cellfun
    maxLength=max(cellfun(@(x)numel(x),C));
    out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),C,'UniformOutput',false));

The problem with this approach is that it creates a character array where each row is a string meaning that I cannot use str2num or str2double to convert the numeric values to discrete doubles (i.e. it gives [] / NaN due to not passing the arithmetic number test). I.e. it produces:

   '9 2532                                                 ';
   '6.8302313857e+002 -1.4826175815e-001 8.1715222947e-002 ';
   '9.3709731863e-001 -2.8772865743e-001 -1.9763814183e-001';

rather than:

   '9' '2532';
   '6.8302313857e+002' '-1.4826175815e-001' '8.1715222947e-002';
   '9.3709731863e-001' '-2.8772865743e-001' '-1.9763814183e-001';

I can work around this using by seperating each row into a row vector (e.g. out1,..,outn then using:

    splitstring = textscan(out1,'%s');
    splitstring = splitstring{1};

Then use str2double and flipdim or similar to return rows of doubles, then use vertcat and pad with NaNs to get the desired matrix, but this seems to be very wieldy in the coding department. Can anyone suggest a more simple way of getting the desired output? Any suggestions would be appreciated.

Thomas

0 Comments

Thomas

Products

3 Answers

Answer by Thomas on 16 Jan 2013
Accepted answer

I have worked out the answer for those with a similar problem:

I use textscan and cellfun to split the strings, de-nest and rearrange the output using vertcat and cellfun/transpose, then convert the single strings to doubles using cellfun/str2double:

    fid = fopen('C:\transform\bundle.out','r');
    c = textscan(fid,'%s','delimiter', '','whitespace','', 'HeaderLines', 1);
    fclose(fid);
    C = c{1};
    C = cellfun(@(x) textscan(x,'%s','Delimiter', ' ')',C ,'UniformOutput',false);
    Y = vertcat(C{:}); 
    X = cellfun(@transpose,Y,'UniformOutput',false);
    Z = cellfun(@str2double,X,'UniformOutput',false);

The output can be gained using cellfun/cell2mat using a max row length id (maxLength):

    maxLength=max(cellfun(@(x)numel(x),Z));
    out = cell2mat(cellfun(@(x)cat(2,x,zeros(1,maxLength-length(x))),Z,'UniformOutput',false));

Note this code pads out the values with zeros rather than NaNs.

0 Comments

Thomas
Answer by per isakson on 15 Jan 2013
Edited by per isakson on 17 Jan 2013

If the file isn't huge (compared to available RAM and address space) and you have an idea of the maximum number of columns "columns" and rows, then I guess the simplest way is to loop over all rows.

    M = nan( nrow, ncol ); % allocate memory
    fid = fopen( ... );
    str = getl( fid ); % header line
    row = 0;
    while not( eof(fid) )
       row = row + 1;
       str = fgetl( fid );
       val = fscanf( str, '%f' );
       M( row, 1:numel(val) ) = val;
    end    

And trim M. Something like this.

.

[Edit: 2013-01-16]

Working code

Here is a comparison between three solutions. The two first, cssm and cssm1 are along my out-line above. The last, OP, is the one proposed by OP. I run this script a few times.

    %% read ragged text file
    clc
    tic, M1 = cssm; toc
    tic, M2 = cssm1(  10000,  100 ); toc
    tic, M3 = cssm1( 100000, 1000 ); toc
    tic, M4 = OP(); toc

which return

    Elapsed time is 0.238691 seconds.
    Elapsed time is 0.131869 seconds.
    Elapsed time is 0.960397 seconds.
    Elapsed time is 0.709025 seconds.

The output is

    >> whos
      Name         Size             Bytes  Class     Attributes
      M1        2464x21            413952  double              
      M2        2464x21            413952  double              
      M3        2464x21            413952  double              
      M4        2464x21            413952  double              

.

In cssm.m the required number of rows and columns are determined in two separate steps. Each step reads the file. Thus, the function, cssm, reads the file three time.

With cssm1 the number of rows and columns are guessed. In one case the "guesses" are 4x the actual size and in the other 40x.

The function, OP, is OP's code made into a function and ZEROS replaced by NAN to honor the question.

With 2500 rows cssm is three times faster than the loop-free code (OP). cssm is five times faster when allocating 4x4 times more memory than needed and a bit slower than the loop-free code when allocating 40x40 timed more memory.

Conclusions:

  • Loops are not always slow
  • Reading from the file cache is fast.
  • Code with loops are often easier to make and understand (IMO).
  • Don't hesitate to use the RAM if it is available

.

The files involved are

    function  M = cssm()
        fid = fopen( 'cssm.txt' );
        cup = onCleanup( @() fclose( fid ) );
        cac  = textscan( fid, '%s', 'Delimiter', '\n', 'HeaderLines', 1 );
        nrow = numel( cac{:} ); 
        clear cup
        fid = fopen( 'cssm.txt' );
        cup = onCleanup( @() fclose( fid ) );
        [~] = fgetl( fid );         % header line
        ncol = 0;
        while not( feof( fid ) )
           ncol = max( ncol, numel( sscanf( fgetl(fid), '%f' ) ) );
        end
        clear cup
        M = cssm_( nrow, ncol );
    end
    function  M = cssm_( nrow, ncol )
        M  = nan( nrow, ncol );      % allocate memory
        fid = fopen( 'cssm.txt' );
        cup = onCleanup( @() fclose( fid ) );
        [~] = fgetl( fid );         % header line
        row = 0;
        while not( feof( fid ) )
           row = row + 1;
           val = sscanf( fgetl(fid), '%f' );
           M( row, 1:numel(val) ) = val;
        end    
    end

and

    function  M = cssm1( nrow, ncol )
        M   = nan( nrow, ncol );      % allocate memory
        fid = fopen( 'cssm.txt' );
        cup = onCleanup( @() fclose( fid ) );
        [~] = fgetl( fid );         % header line
        row = 0;
        while not( feof( fid ) )
           row = row + 1;
           val = sscanf( fgetl(fid), '%f' );
           M( row, 1:numel(val) ) = val;
        end    
        M( :, all( isnan( M ), 1 )    ) = [];
        M(    all( isnan( M ), 2 ), : ) = [];
    end

The text file, cssm.txt,contains 2465 line; repetitions of OP's data.

2 Comments

Thomas on 16 Jan 2013

Thanks for your response

Unfortunately, the number of rows is unknown, as is the number of variables and characters in each row (i.e. the example in the original question). A for loop may work, though acting on the cell array might be more RAM friendly. I'll have a look at a possible solution.

per isakson on 16 Jan 2013

I have added working code above to illustrate the approach I proposed.

per isakson
Answer by Ryan Livingston on 15 Jan 2013

Will think more about the harder question of formatting the numeric data but you could use the properties 'CommentStyle' and/or 'HeaderLines' to skip your header.

Missing numeric fields are indeed padded with NaNs by default so doing:

a = textscan(fid, '%f %f %f\n',1,'HeaderLines',1)

returns:

a = 
    [9]    [2532]    [NaN]

This is controlled by the property 'EmptyValue'. Getting the right format string and properties will do all of the padding for you.

Could you elaborate on the desired format of the output array? Are you viewing the text file as a matrix and you would like the dimensions to be number_of_lines - by - max_number_of_values (8 -by- 16 in this example) or something else?

1 Comment

Thomas on 16 Jan 2013

Hi

The desired output would be number of rows (unknown) by maximum number of values:

[9 2532 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN;
6.83e+002 -1.48e-001 8.17e-002 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
9.37e-001 -2.87e-001 -1.97e-001 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
194 144 45 NaN NaN NaN NaN NaN NaN NaN NaN NaN;
5 6 1496 289.0000 199.0000 7 1235 308.0000 125.0000 5 1614 285.0000]

With the maximum number of values in this case being 11 (< max row padded with NaN).

Ryan Livingston

Contact us