Using TFIDF with Naive bayes
3 views (last 30 days)
Show older comments
I'm building a sentiment classification model using TFIDF and naive bayes. But the model keeps misclassifying the second class.Although I have used TFIDf with other models such as SVM and random forest and it was working fine. Below I will describe my data and steps used: I have 2000 comments (1000 positive, 1000 negative). I did the following steps: 1) data preprocessing
cleanTextData = erasePunctuation(textData);
cleanTextData = lower(cleanTextData);
words = stopWords;
cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments = removeWords(cleanDocuments,words);
cleanDocuments = normalizeWords(cleanDocuments);
cleanDocuments(1:10)
%%Bag of Words
cleanBag = bagOfWords(cleanDocuments)
cleanBag = removeInfrequentWords(cleanBag,2) % remove words with frequency less than or equal 2
%%remove emplty documents caused by preprocessing
[cleanBag,idx] = removeEmptyDocuments(cleanBag);
Then I used TFIDF
predictors = tfidf(cleanBag,'Normalized',true,'TFWeight','log','IDFWeight','smooth');
Then I passed the results to my naive bayes model
t = templateNaiveBayes('DistributionNames','mvmn');
CVMdl = fitcecoc(predictors,response,'KFold',10,'Learners',t,'FitPosterior',true,'Coding','onevsone','ResponseName','response');
But the confusion matrix will give the following results :
C1 C2
____ __
990 10
1000 0
It seems it is classifying almost all the 2000 observations to one class only. Please advice, I have tried almost all what I know and what ever suggested by others. This is related to my master thesis and I only have few weeks to submit it.
4 Comments
Christopher Creutzig
on 26 Nov 2018
Edited: Christopher Creutzig
on 26 Nov 2018
Do you have to use naïve Bayes, or did you try other models and got even worse results?
With only two classes, I do not see why you use fitcecoc, which is an interface to use multiple binary classifiers to build a multi-class one. You could use fitclinear instead, which in my experience is pretty good at the kind of high-dimensional fitting required in text analytics.
Oscar Green
on 10 May 2019
One thing I've done in the past is to aggregate/discretize into log-frequency buckets and treat those as features. It's a bit of a hack, but so is naive bayes, and it ends up working pretty well.
Answers (0)
See Also
Categories
Find more on Classification Ensembles in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!