How to inclusively extract rows of a large cell array between cells given start and end patterns?
You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Show older comments
Hello Folks,
I am searching for the most efficent method to parse a large text file (typically 2-4 GB) for ocurrences of a message. I have to search ~100 large files for dozens of messages so efficiency will be quite significant. I have attached a sample_input.txt with two occurrences of a message specified in the considerations below.
Considerations:
1) start of the message is: 'Hello_Message.pdf'
2) end of the message is: '&&&'
3) store all lines of each occurence of the message to an array within a structure
5) all messages have a header pattern '.*\.[a-zA-Z]{3}\n\r' and end with pattern '&&&\n\r'
4) hopeful to avoid for loops by filtering using a function for example extractBetween, Contains, regexpPattern, or other function(s)
The code below does not work but hopefully it provides an idea of what I was thinking...
clear
close all
clc
Input_fid = fopen(sample_input.txt);
ftext = textscan(Input_fid,'%s','Delimeter','\n\r');
fclose(Input_fid)
% I want to inclusively capture the start of the message 'Hello_Message.pdf' and the end
% of the message '&&&' along with all rows between the start and end of each ocurrence
% of the message
for check = 1:height(ftext{1})
HelloMsgs.Occurrences(check) = extractBetween(ftext{1},regexpPattern('Hello_Message.pdf.*\n\r'),regexpPattern('&&&\n\r'));
end
Desired Output:
HelloMsgs.Occurrences(1) <--- cell array of all lines of first occurrence of the Hello_Message in its
own row cell
HelloMsgs.Occurrences(2) <--- cell array of all lines of second occurrence of the Hello_Message in its
own row cell
HelloMsgs.Occurrences(3) <--- cell array of all lines of third occurrence of the Hello_Message in its
own row cell
Thank you in advance for your time. I am new to posting a coding question in a forum so hopefully I explained
the problem well enough.
4 Comments
There are anomalies in the file that keep this approach from working correctly.
My attempt —
type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt')
fidi = 3
k = 1;
while ~feof(fidi)
Line{k,:} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
k
k = 92
Line
Line = 91×1 cell array
{0×0 char }
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
{'&&&' }
{0×0 char }
{0×0 char }
{0×0 char }
{'Hello_Message.txt' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
for k1 = 1:k-1
if ~isempty(Line{k1,:})
if strmatch(Line{k1,:},'Hello_Message.pdf')
% Start(k1) = 1
sprintf('Start = %2d',k1)
end
if strmatch(Line{k1}, '&&&')
% End(k1) = 1;
sprintf('End = %2d',k1)
end
end
end
ans = 'Start = 2'
ans = 'End = 21'
ans = 'End = 44'
ans = 'End = 65'
ans = 'Start = 72'
ans = 'End = 91'
.
You specifically stated:
‘start of the message is: 'Hello_Message.pdf'’
so that is all I considered. If you want to get all of them, there are ways to do that, fopr example the extractBefore function, and then compare only the part up to the end of the file prefix. I changed it in my posted Answer.
Jude
on 18 Oct 2023
Hi Star Strider,
Thank you very much for your time and patience with me. Looks like I could have done better with how I explained the problem. I am reviewing your solution.
Thank you.
I substituted extractBetween for extractBefore since that gives the appropriate result in my ‘Extract’ cell array.
Accepted Answer
Star Strider
on 18 Oct 2023
Edited: Star Strider
on 18 Oct 2023
type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt');
fidi = 3
k = 1;
while ~feof(fidi)
Line{k,:} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
k
k = 92
% Line
Line = 91×1 cell array
{0×0 char }
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
{'&&&' }
{0×0 char }
{0×0 char }
{0×0 char }
{'Hello_Message.txt' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
for k1 = 1:k-1
if ~isempty(Line{k1,:})
Lc = strfind(extractBetween(Line{k1,:},'_','.'),'Message');
if ~isempty(Lc)
Start(k1) = 1;
% sprintf('Start = %2d',k1)
end
if strfind(Line{k1}, '&&&')
End(k1) = 1;
% sprintf('End = %2d',k1)
end
end
end
StartIdx = find(Start)
StartIdx = 1×4
2 25 46 72
EndIdx = find(End)
EndIdx = 1×4
21 44 65 91
for k = 1:numel(StartIdx)
Extract{k,:} = Line(StartIdx(k):EndIdx(k));
end
Extract{1}
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
{'&&&' }
Extract{end}
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
{'&&&' }
EDIT — (18 Oct 2023 at 03:42)
I am a bit lost with respect to ‘start key’ and ‘stop key’. My code defines ‘StartIdx’ and ‘StopIdx’ as the indices that define the ‘Message’ and ‘&&&’ entries. The ‘Extract’ cell arrays are those lines and all the lines between them.
My initial approach was to use the fileread function and then do ‘logical indexing’, however that failed so the loop was the only other available option.
My code here is the same code I posted as a Comment, changed to test for all the ‘Message’ lines and not only ‘Hello_Message.pdf’ that was initially specified.
.
6 Comments
Jude
on 18 Oct 2023
Hi Star Strider. I reviewed your code above to make certain I understand what is happening. I am not sure how I would modify the line with Lc to only get the 'Hello_Message.pdf' messages, everything I have tried still extracts all the messages.
I made a modified version, code below, that only pulls out the 'Hello_Message.pdf' messages. I greatly appreciate your time! The code below looks to be inefficient in that the first message at StartIdx = 2 & EndIdx = 21, the line at "for k1 = 1:k-1" jumps all the way back to StartIdx = 3 instead of continuing from StartIdx = 22 which would not be so inefficient.
I would like to try to get the results of the code below but with your method used in the code above.
clear
clc
fidi = fopen('sample_input.txt','rt');
k=1;
while ~feof(fidi)
Line{k,1} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
for k1 = 1:k-1
if strfind(Line{k1,1}, 'Hello_Message.pdf')
Start(k1)=1;
StartIdx = find(Start);
for k2 = k1+1:k-1
if strfind(Line{k2,1}, '&&&')
End(k2) = 1;
EndIdx = find(End);
break
end
end
end
end
for k = 1:numel(StartIdx)
Extract{k,1} = Line(StartIdx(k):EndIdx(k)); % contains the solution I am looking for
end
Filtered_Msgs_StartEnd = {StartIdx,EndIdx}
Cnt_Filtered_Msgs = numel(Filtered_Msgs_StartEnd)
Extract{:}
I just now ran this and it seems to do what you want.
What specifically would you want to change?
clear
clc
fidi = fopen('sample_input.txt','rt');
k=1;
while ~feof(fidi)
Line{k,1} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
for k1 = 1:k-1
if strfind(Line{k1,1}, 'Hello_Message.pdf')
Start(k1)=1;
StartIdx = find(Start);
for k2 = k1+1:k-1
if strfind(Line{k2,1}, '&&&')
End(k2) = 1;
EndIdx = find(End);
break
end
end
end
end
for k = 1:numel(StartIdx)
Extract{k,1} = Line(StartIdx(k):EndIdx(k)); % contains the solution I am looking for
end
Filtered_Msgs_StartEnd = {StartIdx,EndIdx}
Filtered_Msgs_StartEnd = 1×2 cell array
{[2 72]} {[21 91]}
Cnt_Filtered_Msgs = numel(Filtered_Msgs_StartEnd)
Cnt_Filtered_Msgs = 2
Extract{:}
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
{'&&&' }
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
{'&&&' }
.
Jude
on 19 Oct 2023
The output for the code above is correct but the way I am iterating is a bit off and I would like to make it more efficient. The first instance of the 'Hello_Message.pdf' is on sample_input.txt lines 2-21 with the second instance of the 'Hello_Message.pdf' occurring between lines 72-91...
On the first iteration of the first for loop the first message is extracted as intended, however at the start of the second iteration of that first for loop K1 is back at line 3. I can't figure out how to not repeat lines 3-21 in the first for loop and proceed from the line after the first extracted message so that I don't spend any time reparsing lines of the first extracted message. Maybe a better way of explaining what I mean is that for iteration 2 of the first for loop I'd prefer that, instead of starting from line 3, I would like it start from line 23 to conitue seaching for the next instance of the StartIdx so that previously parsed lines don't get reparsed.
Perhaps it would be better for me to just modify the Lc variable from your response on (on 18 Oct 2023 at 2:12), but nothing I have tried seems to result in the same output as the code from my most recent response?
Without altering my previous code significantly, I added an extra for loop to extract the next ‘EndIdx’ value greater than the preceeding ‘StartIdx’ value, and then saving those values (initially assigned to ‘NextEnd’) to ‘EndIdx’ afterwards. That produced two sets of consecutive ‘’StartIdx’ and ‘EndIdx’ values for each section.
type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh
Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt');
k = 1;
while ~feof(fidi)
Line{k,:} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
k
k = 92
% Line
for k1 = 1:k-1
if ~isempty(Line{k1,:})
Lc = strfind(Line{k1,:}, 'Hello_Message.pdf');
% Lc = strfind(extractBetween(Line{k1,:},'_','.'),'Message');
if ~isempty(Lc)
Start(k1) = 1;
% sprintf('Start = %2d',k1)
end
if strfind(Line{k1}, '&&&')
End(k1) = 1;
% sprintf('End = %2d',k1)
end
end
end
StartIdx = find(Start);
EndIdx = find(End);
for k = 1:numel(StartIdx)
NextEnd(k) = EndIdx(find(EndIdx > StartIdx(k), 1));
end
StartIdx
StartIdx = 1×2
2 72
EndIdx = NextEnd
EndIdx = 1×2
21 91
for k = 1:numel(StartIdx)
Extract{k,:} = Line(StartIdx(k):EndIdx(k));
end
Extract{1}
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
{'&&&' }
Extract{end}
ans = 20×1 cell array
{'Hello_Message.pdf' }
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
{'&&&' }
This approach (adding an additional loop) is likely the most efficient way to choose the correct ‘EndIdx’ for each ‘StartIdx’.
.
Jude
on 20 Oct 2023
Thank you Star Strider, your solution works great. Thanks again for your time and patience with me.
As always, my pleasure!
More Answers (1)
Why do you want to avoid loops? Reading the file completely to apply vectorized methods requires 8 GB of contiguous free RAM for a 4 GB file (16 bit per char). I'd choose such an approach only on computers with >= 32 GB RAM, while a loop method is less demanding concering the RAM. In addition a filtering during the reading avoid to keep the complete text in the RAM.
S = ParseFile("sample_input.txt");
S{1}
ans = 18×1 cell array
{'2341234342 3214234 ert' }
{'2341234342 3214234 abc' }
{'2341234342 3214234' }
{'Some_ting' }
{'23453425' }
{'Blah_bleh' }
{'Sadf_5' }
{'Ouch 4' }
{'TEST' }
{' ' }
{' ' }
{' ' }
{'Asdff: sdf_sdf' }
{'Is_sdf: asdf' }
{'IS_ssg: sadf' }
{'NJ_T: adfgh' }
{0×0 char }
{'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
function S = ParseFile(File)
startKey = "Hello_Message.pdf";
stopKey = "&&&";
fid = fopen(File, 'r');
assert(fid > 0, "Cannot open file: %s", File);
bS = 1000; % Pre-allocate output in blocks
nS = bS;
iS = 0;
S = cell(1, nS);
buffer = cell(20, 1); % Grows iteratively at first
ibuffer = 0;
doGrab = false;
while ~feof(fid)
Line = fgetl(fid);
if startsWith(Line, startKey)
buffer(:) = {[]}; % Clear the buffer
ibuffer = 0;
doGrab = true; % Start grabbing in next line
elseif startsWith(Line, stopKey)
doGrab = false; % Stop grabbing
iS = iS + 1; % Expand output S in blocks on demand
if iS > nS
nS = nS + bS;
S{nS} = [];
end
S{iS} = buffer(1:ibuffer); % Store the buffer
elseif doGrab
ibuffer = ibuffer + 1;
buffer{ibuffer} = Line;
end
end
fclose(fid);
if doGrab % Store last buffer, if stopKey is missing?!?
iS = iS + 1;
S{iS} = Line;
end
S = S(1:iS); % Crop pre-allocated output cells
end
1 Comment
Jude
on 18 Oct 2023
Hi Jan,
With regards to my reason for wanting to avoid for loops, I "assumed" there could be a more resource/time efficient way to accomplish what I was trying to do. The input files are maintained on a network and not stored locally on the machine (64GB RAM) where matlab is being executed.
I do like your approach a lot and will be looking at it in detail so that I understand what is happening...
How would your solution/code be modified to so that the startkey and stopkey for the messages are included in the cell arrays captured by S?
Perhaps the startkey would need to be defined as regexpPattern('.*\.[a-z]{3}') then a filter for the message where line1 is equal to "Hello_Message.pdf" applied?
Thank you for your time.
Categories
Find more on Language Support in Help Center and File Exchange
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)