Reading a file and deleting lines of data that were written incorrectly

8 views (last 30 days)
I am processing data from an instrument that periodically has lines (csv data file) where the data was written incorrectly. I have been deleting these by hand in a text editor but would like to automate the quality control.
I have tried to look at the fopen command, but can't determine from the help files how to simply delete one line out of the code.
I had so far produced some code as follows, that attempts to load the file, and then reads the errant line number from the error message:
while exit_i==1
try
exit_i=0;
raw_data=load(counts_file);
catch err
space_array = isspace(err.message);
space_index = find(space_array>0);
line_num = str2num(err.message(space_index(5):space_index(6)));
fileID = fopen(counts_file, 'rt+');
exit_i=1;
end
end
Now I would like to open the file and delete that line, and then repeat the loop until it clears through the try sequence and the file is completely cleaned up.
So, my question is, what combination of fopen etc. commands can accomplish this with information on line number?
Thanks, Ryan
  4 Comments
Cedric
Cedric on 16 Jun 2014
Edited: Cedric on 17 Jun 2014
Did dpb's answer solve your issue? If not, we really need a true sample, that you can attach to the post using the staple icon. Then copy-paste two lines as a comment that you format using code formatting (no problem if lines wrap), and indicate why one is correct and the other not. About your example, do you have a stream in your file which looks like what you mention (single line) or to you have one sequence per row like in what follows?
072752.24, 5,6,7,8,9,10
7,8,9,10
072753.24, 5,6,7,8,9,10
If you have a stream, how to identify sequences starts? Are these those floating point numbers? How to identify sequences length(s)? Are they constant (or should be)? If sequence starts are floats (implying that all other values in a sequence are integers), your example seems to indicate that there can be missing sequence starts. Could there also be missing integers? Or both? In such case, how would you eliminate cases where a sequence is truncated from its end, the next sequence is truncated from its start, and the two appended match the correct length for sequences?
dpb
dpb on 16 Jun 2014
The solution I posted would not work for such a file; one would have to include the test on number of fields in a parsed line as well besides simply whether it was a well-formed csv record which is all the previous checks.
As you point out, in general with all the possibilities of an error it's quite complicated to clean up data.
Once upon a time many moons ago we got data back from the plant computers (nuclear commercial power stations) on punch paper tape. It was notoriously flaky for dropped punches and or reading. Had a terrible time developing code to try to salvage as much of the data as possible as at the time (late '60s/early '70s) it was our only way to get data other than by hand or the line printer out of the system.

Sign in to comment.

Accepted Answer

dpb
dpb on 13 Jun 2014
Edited: dpb on 15 Jun 2014
Sequential files are, well, "sequential".
There is no way to open a sequential file at given record and Matlab doesn't implement direct-access files, so you can't do it with fopen.
What you can do is to read the full file into memory as a cell array, one record per cell. Then, if you can identify the lines/records that are in error, simply delete them and then rewrite the file.
There's been an extensive thread on a similar problem of deleting unwanted records in a file at http://www.mathworks.com/matlabcentral/answers/133545-remove-elements-appearing-sequentially-in-a-larger-text#answer_140284. There I posted an example code that reads a file, builds an index array to delete, does that and rewrites the corrected file. Slight modification of that should work here.
Also Cedric points out that perhaps regexp might help solve the problem...
ADDENDUM
An implementation of Joseph's process of parsing line-by-line could look something like--
fid=fopen(files_count,'r');
while ~feof(fid)
l=fgetl(fid);
[c,pos]=textscan(l,'%f','delimiter',',','collectoutput',1);
if pos<length(l), continue, end % bum line, skip
% either write _l_ to a new file, collect c in memory, whatever...
end
fid=fclose(fid);
Worked for a very small set of trial cases here...like two :)
Assumes if textscan can parse a full line w/ comma delimiter specified, it's good; otherwise not. That's basically what load ends up doing with text files internally; I didn't precisely figure out how it generated the error line.
ADDENDUM 2
...what combination of ... commands can accomplish this with information on line number?
I guess to answer the question as posed...
file=textread(files_count,'%s','delimiter','\n','whitespace','');
file(line_num)=[];
fid=fopen(files_count,'w');
for i=1:length(file)
fprintf(fid,'%s\n',file{i});
end
fid=fclose(fid);
NB: The above rewrites the original file; be sure to have a backup while debugging...
I don't recommend the above solution, btw...unless the files are pretty small it's nearly the most overhead-intensive solution possible it would seem other than reading/writing a character at a time. But, it does make use of your rather cute code finding the line number. Too bad can't think of a way to get load to pick up from where left off and thus build an array of bad lines w/o having to reconfigure the file before can use the test again but I came up blank other than the kind of idea posted above.
  1 Comment
Joseph Cheng
Joseph Cheng on 13 Jun 2014
Perhaps what you can do is use fopen and read in line by line whilst writing the desired lines to another file that has prefix of '*Mod_(original file).csv'. Unless you absolutely know what is causing the faulty data; I personally wouldn't overwrite the original raw data.

Sign in to comment.

More Answers (1)

Ryan
Ryan on 17 Jun 2014
Here is my final solution after some iterations:
I read the line, maintained it as a string because of different data types. Ran checks on if it had the right number of cells (looking for the number of commas) and also the position of the first delimiter, as the first cell has a fixed width. The I concatenated all of the strings of valid lines together with carraige returns added back in, closed the file, reopened it in writeable mode and and wrote over the file and closed it.
try
raw_data=load(counts_file);
catch err
fid=fopen(counts_file, 'r');
raw_data=0;
raw_data_idx =0;
raw_data_string = [];
while ~feof(fid)
l=fgetl(fid);
Del_array = strfind(l,',');
if length(Del_array)~=61, continue, end %skip line if it is not the right lenght (has the right number of commas)
if Del_array(1)~=10, continue, end %skip line if the timestamp has been truncated (the first digit is missing
raw_data_temp = raw_data_string;
raw_data_string=[raw_data_temp,l,'\r\n'];
end
fid=fclose(fid);
fid=fopen(counts_file, 'w');
fprintf(fid,raw_data_string);
fclose(fid);
end
Thank you all for your help!
  1 Comment
dpb
dpb on 17 Jun 2014
The one thing you might do would be to include your previous finding of the first failing line number; they you could use textscan or the like to slurp up the lines up to that point in one pass, saving the fgetl line-by-line stuff until need it. If the files are relatively small, the difference may not matter enough to bother.

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!