Trying to sort a text document by alphabetical order and count how many times a word appears

3 views (last 30 days)
Hello,
I am trying to take a .txt file, import it, and then count how many times a word appears and then how sort all the words in order.
I want to take the .txt file, make it into a string, chop the string up into individual words, and then put the words into a matrix along with there word count.
Take the following sentance as an example, "The cat is a cat and would like to have a cat."
The output would look like the following:
word:a count:2
word:and count:1
word:cat count:3
word:have count:1
word:like count:1
word:the count:1
word:to count:1
word:would count:1
Here is what I have right now.
fid = fopen('Theodore_Roosevelt_The_Duties_Of_American_Citizenship.txt');
Line = fgetl(fid);
textfile = strings(1,1);
k = 1;
while ischar(Line) textfile(k,1) = Line;
Line = fgetl(fid);
k = k+1;
end
fclose(fid);
%removing all nonAlpha characters from the text file
punc = [".",";",",","(",")","--","-",];
textfile = replace(textfile,punc," ");
textfile=lower(textfile);
%for loop is used to split string 'textFile' into individula words
words = strings(0);
for i = 1:length(textfile)
words = [words;split(textfile(i))];
end
k=convertStringsToChars(words);
h=1;
F=2;
G=1;
sorted=zeros(1);
for j = 1:length(k)
T=strncmp(k(h),k(F),1);
if T==1
%if true, put h word in sorted before word F
sorted=sorted(k(h),G);
G=G+1;
sorted=sorted(k(F),G);
end
h=h+2;
F=F+2;
end
disp(sorted)
This is the error I get when executing the code:
Error using sort
Input argument must be a cell array
of character vectors.
Error in sorted (line 28)
[ignored,index] =
sort([meshsites(:).' sites(:).']);
Error in fgetltest (line 35)
sorted=sorted(k(h),G);
This is for a homework question, but I am lost about this matrix part I want to put it in.
  3 Comments
Andy T
Andy T on 3 Nov 2019
Edited: Andy T on 3 Nov 2019
the matlab is 2019, I also forgot to put those in there when I was putting the code into the question. Let me fix that. I also forgot to even make a vector called sorted.

Sign in to comment.

Accepted Answer

Adam Danz
Adam Danz on 3 Nov 2019
Edited: Adam Danz on 3 Nov 2019
Here's a different approach. See inline comments for details.
% Read text
C = fileread('Theodore_Roosevelt_The_Duties_Of_American_Citizenship.txt');
% Split into words by spaces
words = strsplit(strtrim(C));
% Remove problematic characters
% But be careful: this removes any non-letter from each word.
% cat's turns into cats; But without this "cats" with quotes
% or cats! will not be recognized. If that's a problem you'll
% need to use a regular expression approach.
words = cellfun(@(x)x(isletter(x)), words, 'UniformOutput', false);
% make all letters lower case
words = lower(words);
% sort them into alphabetical order
words = sort(words);
% Count frequency of each word
wordList = unique(words);
wordCount = histcounts(categorical(words), categorical(wordList));
% Output table
T = table(wordList(:), wordCount(:), 'VariableNames', {'Word', 'Count'});
  6 Comments

Sign in to comment.

More Answers (0)

Categories

Find more on Data Type Identification in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!