Reading a file and deleting lines of data that were written incorrectly

Question

Ryan on 13 Jun 2014

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/134184-reading-a-file-and-deleting-lines-of-data-that-were-written-incorrectly

Commented: dpb on 17 Jun 2014

I am processing data from an instrument that periodically has lines (csv data file) where the data was written incorrectly. I have been deleting these by hand in a text editor but would like to automate the quality control.

I have tried to look at the fopen command, but can't determine from the help files how to simply delete one line out of the code.

I had so far produced some code as follows, that attempts to load the file, and then reads the errant line number from the error message:

 while exit_i==1
    try
        exit_i=0;
       raw_data=load(counts_file); 
    catch err
        space_array = isspace(err.message);
        space_index = find(space_array>0);
        line_num = str2num(err.message(space_index(5):space_index(6)));
        fileID = fopen(counts_file, 'rt+');
        exit_i=1;
    end
    end

Now I would like to open the file and delete that line, and then repeat the loop until it clears through the try sequence and the file is completely cleaned up.

So, my question is, what combination of fopen etc. commands can accomplish this with information on line number?

Thanks, Ryan

4 Comments
Show 2 older commentsHide 2 older comments

Cedric on 16 Jun 2014

Edited: Cedric on 17 Jun 2014

Open in MATLAB Online

Did dpb's answer solve your issue? If not, we really need a true sample, that you can attach to the post using the staple icon. Then copy-paste two lines as a comment that you format using code formatting (no problem if lines wrap), and indicate why one is correct and the other not. About your example, do you have a stream in your file which looks like what you mention (single line) or to you have one sequence per row like in what follows?

 072752.24, 5,6,7,8,9,10 
 7,8,9,10
 072753.24, 5,6,7,8,9,10

If you have a stream, how to identify sequences starts? Are these those floating point numbers? How to identify sequences length(s)? Are they constant (or should be)? If sequence starts are floats (implying that all other values in a sequence are integers), your example seems to indicate that there can be missing sequence starts. Could there also be missing integers? Or both? In such case, how would you eliminate cases where a sequence is truncated from its end, the next sequence is truncated from its start, and the two appended match the correct length for sequences?

dpb on 16 Jun 2014

The solution I posted would not work for such a file; one would have to include the test on number of fields in a parsed line as well besides simply whether it was a well-formed csv record which is all the previous checks.

As you point out, in general with all the possibilities of an error it's quite complicated to clean up data.

Once upon a time many moons ago we got data back from the plant computers (nuclear commercial power stations) on punch paper tape. It was notoriously flaky for dropped punches and or reading. Had a terrible time developing code to try to salvage as much of the data as possible as at the time (late '60s/early '70s) it was our only way to get data other than by hand or the line printer out of the system.

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 13 Jun 2014

3
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/134184-reading-a-file-and-deleting-lines-of-data-that-were-written-incorrectly#answer_140819

Edited: dpb on 15 Jun 2014

Open in MATLAB Online

Sequential files are, well, "sequential".

There is no way to open a sequential file at given record and Matlab doesn't implement direct-access files, so you can't do it with fopen.

What you can do is to read the full file into memory as a cell array, one record per cell. Then, if you can identify the lines/records that are in error, simply delete them and then rewrite the file.

There's been an extensive thread on a similar problem of deleting unwanted records in a file at http://www.mathworks.com/matlabcentral/answers/133545-remove-elements-appearing-sequentially-in-a-larger-text#answer_140284. There I posted an example code that reads a file, builds an index array to delete, does that and rewrites the corrected file. Slight modification of that should work here.

Also Cedric points out that perhaps regexp might help solve the problem...

ADDENDUM

An implementation of Joseph's process of parsing line-by-line could look something like--

fid=fopen(files_count,'r');
while ~feof(fid)
  l=fgetl(fid);
  [c,pos]=textscan(l,'%f','delimiter',',','collectoutput',1);
  if pos<length(l), continue, end   % bum line, skip 
  % either write _l_ to a new file, collect c in memory, whatever...
end
fid=fclose(fid);

Worked for a very small set of trial cases here...like two :)

Assumes if textscan can parse a full line w/ comma delimiter specified, it's good; otherwise not. That's basically what load ends up doing with text files internally; I didn't precisely figure out how it generated the error line.

ADDENDUM 2

...what combination of ... commands can accomplish this with information on line number?

I guess to answer the question as posed...

file=textread(files_count,'%s','delimiter','\n','whitespace','');
file(line_num)=[];
fid=fopen(files_count,'w');
for i=1:length(file)
  fprintf(fid,'%s\n',file{i});
end
fid=fclose(fid);

NB: The above rewrites the original file; be sure to have a backup while debugging...

I don't recommend the above solution, btw...unless the files are pretty small it's nearly the most overhead-intensive solution possible it would seem other than reading/writing a character at a time. But, it does make use of your rather cute code finding the line number. Too bad can't think of a way to get load to pick up from where left off and thus build an array of bad lines w/o having to reconfigure the file before can use the test again but I came up blank other than the kind of idea posted above.

1 Comment
Show -1 older commentsHide -1 older comments

Joseph Cheng on 13 Jun 2014

Perhaps what you can do is use fopen and read in line by line whilst writing the desired lines to another file that has prefix of '*Mod_(original file).csv'. Unless you absolutely know what is causing the faulty data; I personally wouldn't overwrite the original raw data.

Sign in to comment.

Answer 2

Ryan on 17 Jun 2014

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/134184-reading-a-file-and-deleting-lines-of-data-that-were-written-incorrectly#answer_141351

Open in MATLAB Online

Here is my final solution after some iterations:

I read the line, maintained it as a string because of different data types. Ran checks on if it had the right number of cells (looking for the number of commas) and also the position of the first delimiter, as the first cell has a fixed width. The I concatenated all of the strings of valid lines together with carraige returns added back in, closed the file, reopened it in writeable mode and and wrote over the file and closed it.

    try
       raw_data=load(counts_file);
    catch err
        fid=fopen(counts_file, 'r');
        raw_data=0;
        raw_data_idx =0;
        raw_data_string = [];
        while ~feof(fid)
           l=fgetl(fid);
           Del_array = strfind(l,',');
           if length(Del_array)~=61, continue, end %skip line if it is not the right lenght (has the right number of commas)
           if Del_array(1)~=10, continue, end %skip line if the timestamp has been truncated (the first digit is missing
           raw_data_temp = raw_data_string;
           raw_data_string=[raw_data_temp,l,'\r\n'];
       end
        fid=fclose(fid); 
        fid=fopen(counts_file, 'w');
        fprintf(fid,raw_data_string);
        fclose(fid);
    end

Thank you all for your help!

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 17 Jun 2014

The one thing you might do would be to include your previous finding of the first failing line number; they you could use textscan or the like to slurp up the lines up to that point in one pass, saving the fgetl line-by-line stuff until need it. If the files are relatively small, the difference may not matter enough to bother.

Sign in to comment.

Reading a file and deleting lines of data that were written incorrectly

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Reading a file and deleting lines of data that were written incorrectly

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

1 Comment
Show -1 older commentsHide -1 older comments