How to eliminate values in a matrix so i have a high Rsquare linear regression

Question

Saúl on 27 Aug 2014

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/152549-how-to-eliminate-values-in-a-matrix-so-i-have-a-high-rsquare-linear-regression

Commented: Star Strider on 28 Aug 2014

Supose i have this matrix

 M=[0.6900   12.4020
    1.9400   15.5160
    2.0090   21.8970
    2.0930   31.6430
    2.1780   40.9470
    2.2630   48.9080
    2.3480   56.8650]
X = M(:,1);
Y = M(:,2);
p = polyfit(X,Y,1)
Yfit = polyval(p,X);
yresid = Y - Yfit;
SSresid = sum(yresid.^2);
SStotal = (length(Y)-1) * var(Y);
rsq = 1 - SSresid/SStotal

i use polyfit & polival to create a linear regression and calculate Rsquare. I need something that evaluates the data in my matrix and eliminates outliers until i have a minimum Rsquare of 0.99

for example, in my matrix eliminating the first row changes the Rsquare value from 0.57 to 0.99

so my final matrix should be

 M=[1.9400   15.5160
0090   21.8970
0930   31.6430
1780   40.9470
2630   48.9080
3480   56.8650]

i need the deletion to be done by the program, so i dont have to delete values manually., because i have a really big dataset.

Thanks

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 27 Aug 2014

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/152549-how-to-eliminate-values-in-a-matrix-so-i-have-a-high-rsquare-linear-regression#answer_150016

Open in MATLAB Online

If you get the S structure from polyfit and ask polyval for the delta output, it will produce errors on the fitted values at the X-values. You can use those values to eliminate the outliers.

In your particular situation, I would switch the X and Y vectors to detect and eliminate the outlier:

Y = M(:,1);
X = M(:,2);
[p,S] = polyfit(X,Y,1);
[Yfit,delta] = polyval(p,X,S);
figure(1)
plot(X, Y, '+b')
hold on
plot(X,Yfit,'-r', X,Yfit+delta,'-g', X,Yfit-delta,'-g')
hold off
grid

Any value outside the delta ranges can be considered an outlier and deleted.

6 Comments
Show 4 older commentsHide 4 older comments

dpb on 28 Aug 2014

I'd never thought of using the regression error as the outlier detection criterion, Star...not a bad idea that!

Star Strider on 28 Aug 2014

Thank you!

This is what I’ve always used to detect and eliminate outliers. I remember reading a statistical justification for it, but that was back in the cuneiform-and-abacus days. I can’t find that reference now.

I like your derivative idea, but I couldn’t arrive at a statistical test to remove outliers with it, other than polyfit and polyval, and it adds a step to those analyses. Maybe doing a t-transform (or z-transform with large enough N) on the derivative vector would work, and then eliminating those beyond a given threshold, but I only just thought of that now.

It’s good to have a statistical rationale for eliminating errant data. It’s something I always designed into every experimental protocol I ever wrote.

Sign in to comment.

Answer 2

dpb on 27 Aug 2014

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/152549-how-to-eliminate-values-in-a-matrix-so-i-have-a-high-rsquare-linear-regression#answer_150021

Open in MATLAB Online

If it's known the data are linear with an outlier or two, possibly the most sensitive test would be something based on the gradient of the raw data. In your example, looking at outliers of the fit without visualization leads to choosing the wrong point if just use the magnitude. If one looks at the shape of the residual curve it's simpler to tell what's going on but not necessarily easy to code.

Consider for your data the following, however...

>> g=diff(M(:,2))./ diff(M(:,1))
g =
  2.4912
 92.4783
116.0238
109.4588
 93.6588
 93.6118
>> g-mean(g)
ans =
-82.1292
  7.8578
 31.4034
 24.8384
  9.0384
  8.9913
>>

The max() of that difference is pretty clear either in absolute or compared to the mean which location indicates the bum point is either the first or second. That the second difference is reasonable indicates it's actually the first point that's the bad one, not the second.

Alternatively, of course, one can simply begin "one at a time" rejection if datasets are small.

1 Comment
Show -1 older commentsHide -1 older comments

Saúl on 27 Aug 2014

Thanks, indeed my problem is that i have a lot of data, so doing it manually is not really an option.

Sign in to comment.

How to eliminate values in a matrix so i have a high Rsquare linear regression

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

How to eliminate values in a matrix so i have a high Rsquare linear regression

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments

1 Comment
Show -1 older commentsHide -1 older comments