How to eliminate values in a matrix so i have a high Rsquare linear regression

5 views (last 30 days)
Supose i have this matrix
M=[0.6900 12.4020
1.9400 15.5160
2.0090 21.8970
2.0930 31.6430
2.1780 40.9470
2.2630 48.9080
2.3480 56.8650]
X = M(:,1);
Y = M(:,2);
p = polyfit(X,Y,1)
Yfit = polyval(p,X);
yresid = Y - Yfit;
SSresid = sum(yresid.^2);
SStotal = (length(Y)-1) * var(Y);
rsq = 1 - SSresid/SStotal
i use polyfit & polival to create a linear regression and calculate Rsquare. I need something that evaluates the data in my matrix and eliminates outliers until i have a minimum Rsquare of 0.99
for example, in my matrix eliminating the first row changes the Rsquare value from 0.57 to 0.99
so my final matrix should be
M=[1.9400 15.5160
2.0090 21.8970
2.0930 31.6430
2.1780 40.9470
2.2630 48.9080
2.3480 56.8650]
i need the deletion to be done by the program, so i dont have to delete values manually., because i have a really big dataset.
Thanks

Accepted Answer

Star Strider
Star Strider on 27 Aug 2014
If you get the S structure from polyfit and ask polyval for the delta output, it will produce errors on the fitted values at the X-values. You can use those values to eliminate the outliers.
In your particular situation, I would switch the X and Y vectors to detect and eliminate the outlier:
Y = M(:,1);
X = M(:,2);
[p,S] = polyfit(X,Y,1);
[Yfit,delta] = polyval(p,X,S);
figure(1)
plot(X, Y, '+b')
hold on
plot(X,Yfit,'-r', X,Yfit+delta,'-g', X,Yfit-delta,'-g')
hold off
grid
Any value outside the delta ranges can be considered an outlier and deleted.
  6 Comments
dpb
dpb on 28 Aug 2014
I'd never thought of using the regression error as the outlier detection criterion, Star...not a bad idea that!
Star Strider
Star Strider on 28 Aug 2014
Thank you!
This is what I’ve always used to detect and eliminate outliers. I remember reading a statistical justification for it, but that was back in the cuneiform-and-abacus days. I can’t find that reference now.
I like your derivative idea, but I couldn’t arrive at a statistical test to remove outliers with it, other than polyfit and polyval, and it adds a step to those analyses. Maybe doing a t-transform (or z-transform with large enough N) on the derivative vector would work, and then eliminating those beyond a given threshold, but I only just thought of that now.
It’s good to have a statistical rationale for eliminating errant data. It’s something I always designed into every experimental protocol I ever wrote.

Sign in to comment.

More Answers (1)

dpb
dpb on 27 Aug 2014
If it's known the data are linear with an outlier or two, possibly the most sensitive test would be something based on the gradient of the raw data. In your example, looking at outliers of the fit without visualization leads to choosing the wrong point if just use the magnitude. If one looks at the shape of the residual curve it's simpler to tell what's going on but not necessarily easy to code.
Consider for your data the following, however...
>> g=diff(M(:,2))./ diff(M(:,1))
g =
2.4912
92.4783
116.0238
109.4588
93.6588
93.6118
>> g-mean(g)
ans =
-82.1292
7.8578
31.4034
24.8384
9.0384
8.9913
>>
The max() of that difference is pretty clear either in absolute or compared to the mean which location indicates the bum point is either the first or second. That the second difference is reasonable indicates it's actually the first point that's the bad one, not the second.
Alternatively, of course, one can simply begin "one at a time" rejection if datasets are small.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!