How to inclusively extract rows of a large cell array between cells given start and end patterns?

Question

0 votes

sample_input.txt

Hello Folks,

I am searching for the most efficent method to parse a large text file (typically 2-4 GB) for ocurrences of a message. I have to search ~100 large files for dozens of messages so efficiency will be quite significant. I have attached a sample_input.txt with two occurrences of a message specified in the considerations below.

Considerations:

1) start of the message is: 'Hello_Message.pdf'

2) end of the message is: '&&&'

3) store all lines of each occurence of the message to an array within a structure

5) all messages have a header pattern '.*\.[a-zA-Z]{3}\n\r' and end with pattern '&&&\n\r'

4) hopeful to avoid for loops by filtering using a function for example extractBetween, Contains, regexpPattern, or other function(s)

The code below does not work but hopefully it provides an idea of what I was thinking...

clear

close all

clc

Input_fid = fopen(sample_input.txt);

ftext = textscan(Input_fid,'%s','Delimeter','\n\r');

fclose(Input_fid)

% I want to inclusively capture the start of the message 'Hello_Message.pdf' and the end

% of the message '&&&' along with all rows between the start and end of each ocurrence

% of the message

for check = 1:height(ftext{1})

HelloMsgs.Occurrences(check) = extractBetween(ftext{1},regexpPattern('Hello_Message.pdf.*\n\r'),regexpPattern('&&&\n\r'));

end

Desired Output:

HelloMsgs.Occurrences(1) <--- cell array of all lines of first occurrence of the Hello_Message in its

own row cell

HelloMsgs.Occurrences(2) <--- cell array of all lines of second occurrence of the Hello_Message in its

own row cell

HelloMsgs.Occurrences(3) <--- cell array of all lines of third occurrence of the Hello_Message in its

own row cell

Thank you in advance for your time. I am new to posting a coding question in a forum so hopefully I explained

the problem well enough.

4 Comments
Show 2 older comments Hide 2 older comments

Star Strider on 17 Oct 2023

Open in MATLAB Online

sample_input.txt

There are anomalies in the file that keep this approach from working correctly.

My attempt —

type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&



Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&

Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&






Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt')
fidi = 3
k = 1;
while ~feof(fidi)
    Line{k,:} = fgetl(fidi);
    k = k+1;
end
fclose(fidi);
k
k = 92
Line
Line = 91×1 cell array
    {0×0 char                                                                                                                                                             }
    {'Hello_Message.pdf'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
    {'&&&'                                                                                                                                                                }
    {0×0 char                                                                                                                                                             }
    {0×0 char                                                                                                                                                             }
    {0×0 char                                                                                                                                                             }
    {'Hello_Message.txt'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
for k1 = 1:k-1
    if ~isempty(Line{k1,:})
        if strmatch(Line{k1,:},'Hello_Message.pdf')
            % Start(k1) = 1
            sprintf('Start = %2d',k1)
        end
        if strmatch(Line{k1}, '&&&')
            % End(k1) = 1;
            sprintf('End = %2d',k1)
        end
    end
end
ans = 'Start =  2'
ans = 'End = 21'
ans = 'End = 44'
ans = 'End = 65'
ans = 'Start = 72'
ans = 'End = 91'

.

Jude on 18 Oct 2023

Hi Star Strider,

Thank you very much for your time and patience with me. Looks like I could have done better with how I explained the problem. I am reviewing your solution.

Star Strider on 18 Oct 2023

Thank you.

I substituted extractBetween for extractBefore since that gives the appropriate result in my ‘Extract’ cell array.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Star Strider on 18 Oct 2023

Edited: Star Strider on 18 Oct 2023

Open in MATLAB Online

1 vote

sample_input.txt

type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&



Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&

Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&






Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt');
fidi = 3
k = 1;
while ~feof(fidi)
    Line{k,:} = fgetl(fidi);
    k = k+1;
end
fclose(fidi);
k
k = 92
% Line
Line = 91×1 cell array
    {0×0 char                                                                                                                                                             }
    {'Hello_Message.pdf'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
    {'&&&'                                                                                                                                                                }
    {0×0 char                                                                                                                                                             }
    {0×0 char                                                                                                                                                             }
    {0×0 char                                                                                                                                                             }
    {'Hello_Message.txt'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
for k1 = 1:k-1
    if ~isempty(Line{k1,:})
        Lc = strfind(extractBetween(Line{k1,:},'_','.'),'Message');
        if ~isempty(Lc)
            Start(k1) = 1;
            % sprintf('Start = %2d',k1)
        end
        if strfind(Line{k1}, '&&&')
            End(k1) = 1;
            % sprintf('End = %2d',k1)
        end
    end
end
StartIdx = find(Start)
StartIdx = 1×4
     2    25    46    72
EndIdx = find(End)
EndIdx = 1×4
    21    44    65    91
for k = 1:numel(StartIdx)
    Extract{k,:} = Line(StartIdx(k):EndIdx(k));
end
Extract{1}
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
    {'&&&'                                                                                                                                                                }
Extract{end}
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                 }
    {'2341234342 3214234 ert'                                                                                                                                            }
    {'2341234342 3214234 abc'                                                                                                                                            }
    {'2341234342 3214234'                                                                                                                                                }
    {'Some_ting'                                                                                                                                                         }
    {'23453425'                                                                                                                                                          }
    {'Blah_bleh'                                                                                                                                                         }
    {'Sadf_5'                                                                                                                                                            }
    {'Ouch 4'                                                                                                                                                            }
    {'TEST'                                                                                                                                                              }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'Asdff: sdf_sdf'                                                                                                                                                    }
    {'Is_sdf: asdf'                                                                                                                                                      }
    {'IS_ssg: sadf'                                                                                                                                                      }
    {'NJ_T: adfgh'                                                                                                                                                       }
    {0×0 char                                                                                                                                                            }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
    {'&&&'                                                                                                                                                               }

EDIT — (18 Oct 2023 at 03:42)

I am a bit lost with respect to ‘start key’ and ‘stop key’. My code defines ‘StartIdx’ and ‘StopIdx’ as the indices that define the ‘Message’ and ‘&&&’ entries. The ‘Extract’ cell arrays are those lines and all the lines between them.

My initial approach was to use the fileread function and then do ‘logical indexing’, however that failed so the loop was the only other available option.

My code here is the same code I posted as a Comment, changed to test for all the ‘Message’ lines and not only ‘Hello_Message.pdf’ that was initially specified.

The regexp approach is not specific enough for this requirement.

.

6 Comments
Show 4 older comments Hide 4 older comments

Jude on 18 Oct 2023

Hi Star Strider. I reviewed your code above to make certain I understand what is happening. I am not sure how I would modify the line with Lc to only get the 'Hello_Message.pdf' messages, everything I have tried still extracts all the messages.

I made a modified version, code below, that only pulls out the 'Hello_Message.pdf' messages. I greatly appreciate your time! The code below looks to be inefficient in that the first message at StartIdx = 2 & EndIdx = 21, the line at "for k1 = 1:k-1" jumps all the way back to StartIdx = 3 instead of continuing from StartIdx = 22 which would not be so inefficient.

I would like to try to get the results of the code below but with your method used in the code above.

clear

clc

fidi = fopen('sample_input.txt','rt');

k=1;

while ~feof(fidi)

Line{k,1} = fgetl(fidi);

k = k+1;

end

fclose(fidi);

for k1 = 1:k-1

if strfind(Line{k1,1}, 'Hello_Message.pdf')

Start(k1)=1;

StartIdx = find(Start);

for k2 = k1+1:k-1

if strfind(Line{k2,1}, '&&&')

End(k2) = 1;

EndIdx = find(End);

break

end

for k = 1:numel(StartIdx)

Extract{k,1} = Line(StartIdx(k):EndIdx(k)); % contains the solution I am looking for

end

Filtered_Msgs_StartEnd = {StartIdx,EndIdx}

Cnt_Filtered_Msgs = numel(Filtered_Msgs_StartEnd)

Extract{:}

Star Strider on 18 Oct 2023

Open in MATLAB Online

sample_input.txt

I just now ran this and it seems to do what you want.

What specifically would you want to change?

clear
clc
fidi = fopen('sample_input.txt','rt');
k=1;
while ~feof(fidi)
    Line{k,1} = fgetl(fidi);
    k = k+1;
end
fclose(fidi);
for k1 = 1:k-1
    if strfind(Line{k1,1}, 'Hello_Message.pdf')
        Start(k1)=1;
        StartIdx = find(Start);
        for k2 = k1+1:k-1
            if strfind(Line{k2,1}, '&&&')
                End(k2) = 1;
                EndIdx = find(End);
                break
            end
        end
    end
end
for k = 1:numel(StartIdx)
    Extract{k,1} = Line(StartIdx(k):EndIdx(k)); % contains the solution I am looking for
end
Filtered_Msgs_StartEnd = {StartIdx,EndIdx}
Filtered_Msgs_StartEnd = 1×2 cell array
    {[2 72]}    {[21 91]}
Cnt_Filtered_Msgs = numel(Filtered_Msgs_StartEnd)
Cnt_Filtered_Msgs = 2
Extract{:}
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
    {'&&&'                                                                                                                                                                }
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                 }
    {'2341234342 3214234 ert'                                                                                                                                            }
    {'2341234342 3214234 abc'                                                                                                                                            }
    {'2341234342 3214234'                                                                                                                                                }
    {'Some_ting'                                                                                                                                                         }
    {'23453425'                                                                                                                                                          }
    {'Blah_bleh'                                                                                                                                                         }
    {'Sadf_5'                                                                                                                                                            }
    {'Ouch 4'                                                                                                                                                            }
    {'TEST'                                                                                                                                                              }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'Asdff: sdf_sdf'                                                                                                                                                    }
    {'Is_sdf: asdf'                                                                                                                                                      }
    {'IS_ssg: sadf'                                                                                                                                                      }
    {'NJ_T: adfgh'                                                                                                                                                       }
    {0×0 char                                                                                                                                                            }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
    {'&&&'                                                                                                                                                               }

.

Jude on 19 Oct 2023

The output for the code above is correct but the way I am iterating is a bit off and I would like to make it more efficient. The first instance of the 'Hello_Message.pdf' is on sample_input.txt lines 2-21 with the second instance of the 'Hello_Message.pdf' occurring between lines 72-91...

On the first iteration of the first for loop the first message is extracted as intended, however at the start of the second iteration of that first for loop K1 is back at line 3. I can't figure out how to not repeat lines 3-21 in the first for loop and proceed from the line after the first extracted message so that I don't spend any time reparsing lines of the first extracted message. Maybe a better way of explaining what I mean is that for iteration 2 of the first for loop I'd prefer that, instead of starting from line 3, I would like it start from line 23 to conitue seaching for the next instance of the StartIdx so that previously parsed lines don't get reparsed.

Perhaps it would be better for me to just modify the Lc variable from your response on (on 18 Oct 2023 at 2:12), but nothing I have tried seems to result in the same output as the code from my most recent response?

Star Strider on 19 Oct 2023

Open in MATLAB Online

sample_input.txt

Without altering my previous code significantly, I added an extra for loop to extract the next ‘EndIdx’ value greater than the preceeding ‘StartIdx’ value, and then saving those values (initially assigned to ‘NextEnd’) to ‘EndIdx’ afterwards. That produced two sets of consecutive ‘’StartIdx’ and ‘EndIdx’ values for each section.

type('sample_input.txt')
Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&



Hello_Message.txt
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&

Bye_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds
&&&






Hello_Message.pdf
2341234342 3214234 ert
2341234342 3214234 abc
2341234342 3214234
Some_ting
23453425
Blah_bleh
Sadf_5
Ouch 4
TEST
                                
                                
                                
Asdff: sdf_sdf
Is_sdf: asdf
IS_ssg: sadf
NJ_T: adfgh

Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\
&&&
fidi = fopen('sample_input.txt','rt');
k = 1;
while ~feof(fidi)
    Line{k,:} = fgetl(fidi);
    k = k+1;
end
fclose(fidi);
k
k = 92
% Line
for k1 = 1:k-1
    if ~isempty(Line{k1,:})
        Lc = strfind(Line{k1,:}, 'Hello_Message.pdf');
        % Lc = strfind(extractBetween(Line{k1,:},'_','.'),'Message');
        if ~isempty(Lc)
            Start(k1) = 1;
            % sprintf('Start = %2d',k1)
        end
        if strfind(Line{k1}, '&&&')
            End(k1) = 1;
            % sprintf('End = %2d',k1)
        end
    end
end
StartIdx = find(Start);
EndIdx = find(End);
for k = 1:numel(StartIdx)
    NextEnd(k) = EndIdx(find(EndIdx > StartIdx(k), 1));
end
StartIdx
StartIdx = 1×2
     2    72
EndIdx = NextEnd
EndIdx = 1×2
    21    91
for k = 1:numel(StartIdx)
    Extract{k,:} = Line(StartIdx(k):EndIdx(k));
end
Extract{1}
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                  }
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
    {'&&&'                                                                                                                                                                }
Extract{end}
ans = 20×1 cell array
    {'Hello_Message.pdf'                                                                                                                                                 }
    {'2341234342 3214234 ert'                                                                                                                                            }
    {'2341234342 3214234 abc'                                                                                                                                            }
    {'2341234342 3214234'                                                                                                                                                }
    {'Some_ting'                                                                                                                                                         }
    {'23453425'                                                                                                                                                          }
    {'Blah_bleh'                                                                                                                                                         }
    {'Sadf_5'                                                                                                                                                            }
    {'Ouch 4'                                                                                                                                                            }
    {'TEST'                                                                                                                                                              }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'                                '                                                                                                                                  }
    {'Asdff: sdf_sdf'                                                                                                                                                    }
    {'Is_sdf: asdf'                                                                                                                                                      }
    {'IS_ssg: sadf'                                                                                                                                                      }
    {'NJ_T: adfgh'                                                                                                                                                       }
    {0×0 char                                                                                                                                                            }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'}
    {'&&&'                                                                                                                                                               }

This approach (adding an additional loop) is likely the most efficient way to choose the correct ‘EndIdx’ for each ‘StartIdx’.

.

Jude on 20 Oct 2023

Thank you Star Strider, your solution works great. Thanks again for your time and patience with me.

Star Strider on 20 Oct 2023

As always, my pleasure!

Sign in to comment.

Answer 2

Jan on 18 Oct 2023

Edited: Jan on 18 Oct 2023

Open in MATLAB Online

1 vote

sample_input.txt

Why do you want to avoid loops? Reading the file completely to apply vectorized methods requires 8 GB of contiguous free RAM for a 4 GB file (16 bit per char). I'd choose such an approach only on computers with >= 32 GB RAM, while a loop method is less demanding concering the RAM. In addition a filtering during the reading avoid to keep the complete text in the RAM.

S = ParseFile("sample_input.txt");
S{1}
ans = 18×1 cell array
    {'2341234342 3214234 ert'                                                                                                                                             }
    {'2341234342 3214234 abc'                                                                                                                                             }
    {'2341234342 3214234'                                                                                                                                                 }
    {'Some_ting'                                                                                                                                                          }
    {'23453425'                                                                                                                                                           }
    {'Blah_bleh'                                                                                                                                                          }
    {'Sadf_5'                                                                                                                                                             }
    {'Ouch 4'                                                                                                                                                             }
    {'TEST'                                                                                                                                                               }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'                                '                                                                                                                                   }
    {'Asdff: sdf_sdf'                                                                                                                                                     }
    {'Is_sdf: asdf'                                                                                                                                                       }
    {'IS_ssg: sadf'                                                                                                                                                       }
    {'NJ_T: adfgh'                                                                                                                                                        }
    {0×0 char                                                                                                                                                             }
    {'Some_data_:    4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
function S = ParseFile(File)
startKey = "Hello_Message.pdf";
stopKey  = "&&&";
fid = fopen(File, 'r');
assert(fid > 0, "Cannot open file: %s", File);
bS = 1000;  % Pre-allocate output in blocks
nS = bS;
iS = 0;
S  = cell(1, nS);
buffer  = cell(20, 1);  % Grows iteratively at first
ibuffer = 0;
doGrab  = false;
while ~feof(fid)
    Line = fgetl(fid);
    if startsWith(Line, startKey)
        buffer(:) = {[]};  % Clear the buffer
        ibuffer   = 0;
        doGrab    = true;  % Start grabbing in next line
        
    elseif startsWith(Line, stopKey)
        doGrab = false;    % Stop grabbing
        
        iS = iS + 1;       % Expand output S in blocks on demand
        if iS > nS
            nS    = nS + bS;
            S{nS} = [];
        end
         S{iS} = buffer(1:ibuffer);    % Store the buffer
       
    elseif doGrab
        ibuffer         = ibuffer + 1;
        buffer{ibuffer} = Line;
    end
end
fclose(fid);
if doGrab     % Store last buffer, if stopKey is missing?!?
    iS    = iS + 1;
    S{iS} = Line;
end
S = S(1:iS);  % Crop pre-allocated output cells
end

1 Comment
Show -1 older comments Hide -1 older comments

Jude on 18 Oct 2023

Hi Jan,

With regards to my reason for wanting to avoid for loops, I "assumed" there could be a more resource/time efficient way to accomplish what I was trying to do. The input files are maintained on a network and not stored locally on the machine (64GB RAM) where matlab is being executed.

I do like your approach a lot and will be looking at it in detail so that I understand what is happening...

How would your solution/code be modified to so that the startkey and stopkey for the messages are included in the cell arrays captured by S?

Perhaps the startkey would need to be defined as regexpPattern('.*\.[a-z]{3}') then a filter for the message where line1 is equal to "Hello_Message.pdf" applied?

Thank you for your time.

Sign in to comment.

How to inclusively extract rows of a large cell array between cells given start and end patterns?

4 Comments
Show 2 older comments Hide 2 older comments

Accepted Answer

6 Comments
Show 4 older comments Hide 4 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

How to inclusively extract rows of a large cell array between cells given start and end patterns?

4 Comments Show 2 older comments Hide 2 older comments

Accepted Answer

6 Comments Show 4 older comments Hide 4 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

6 Comments
Show 4 older comments Hide 4 older comments

1 Comment
Show -1 older comments Hide -1 older comments