Derive full formula from fitlm(X,y,"quadratic")

Question

1 vote

I have X which is a 258x6 double and y which is a 258x1 double. These are used to fit a model utilizing fitlm. I found the quadratic to yield the best results.

X = outTableNumeric{:, corr_ids}; % Using the most correlated features
y = outTableNumeric.target;      % Target variabel
% Fit a linear regression model
mdl = fitlm(X, y, "quadratic");

However when I try to derive a formula to be deployed outside matlab I find it extremely obscure how I'm to combine the quadratic terms with the coefficients and so on?

I tried to make the formula programmatically but that is not working when I test the result on some row of X.

coefficients = mdl.Coefficients.Estimate;
% Extract the simple notation of the formula
intercept = coefficients(1);
formula = sprintf('%.2f', intercept);
for i = 2:length(coefficients)
    if coefficients(i) ~= 0
        feature_indices = find(strcmp(mdl.CoefficientNames, mdl.CoefficientNames{i}));
        feature_name = mdl.CoefficientNames{feature_indices(1)};
        formula = sprintf('%s + %.2f * %s', formula, coefficients(i), feature_name);
    end
end
strrep(formula,':','*');
disp('Simple notation of the formula:');
disp(formula);

Can you please help me extract the full equation?

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 23 Jul 2024

Edited: John D'Errico on 23 Jul 2024

Open in MATLAB Online

1 vote

X = rand(258,6);
y = rand(258,1);
mdl = fitlm(X, y, "quadratic")
mdl = 
Linear regression model:
    y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2

Estimated Coefficients:
                   Estimate       SE        tStat       pValue 
                   _________    _______    ________    ________

    (Intercept)      0.57576    0.29322      1.9636    0.050783
    x1              -0.37307    0.39362     -0.9478     0.34423
    x2                0.1415    0.39079     0.36209     0.71762
    x3                0.2982    0.39912     0.74715     0.45573
    x4              -0.15672    0.41253     -0.3799     0.70437
    x5               -0.2706    0.40666    -0.66542     0.50645
    x6              -0.18681    0.39256    -0.47587     0.63462
    x1:x2            0.11095    0.22534     0.49235     0.62294
    x1:x3            0.11595    0.22555     0.51408     0.60769
    x1:x4           -0.21297    0.23823    -0.89399     0.37226
    x1:x5           0.018241    0.22852    0.079822     0.93645
    x1:x6            0.14906      0.224     0.66547     0.50642
    x2:x3          -0.078433    0.22451    -0.34935     0.72714
    x2:x4           -0.10602    0.24031    -0.44116     0.65951
    x2:x5          -0.054747    0.23808    -0.22996     0.81833
    x2:x6            0.20974    0.24169      0.8678     0.38641
    x3:x4           -0.29942    0.22792     -1.3137     0.19026
    x3:x5           -0.03888    0.21461    -0.18116      0.8564
    x3:x6          -0.080697    0.25933    -0.31117     0.75595
    x4:x5            0.33241    0.24641       1.349     0.17866
    x4:x6           0.030806    0.24789     0.12427     0.90121
    x5:x6           -0.10127    0.24546    -0.41257     0.68031
    x1^2             0.23249    0.26395     0.88083     0.37933
    x2^2            -0.17792    0.25219    -0.70551     0.48121
    x3^2            0.013372    0.25903    0.051623     0.95887
    x4^2             0.22872     0.2511     0.91086     0.36332
    x5^2            0.098667    0.25524     0.38657     0.69943
    x6^2             0.18421    0.25429     0.72438     0.46957


Number of observations: 258, Error degrees of freedom: 230
Root Mean Squared Error: 0.299
R-squared: 0.078,  Adjusted R-Squared: -0.0303
F-statistic vs. constant model: 0.72, p-value = 0.844

There are 6 variables, here called x1...x6, corresponding to the 6 columns of your array. And 28 terms in the model. You can see them listed above, as represented by the linear regression model.

If you are not sure how to use the result of fitlm, here are the tools that can be applied:

methods(mdl)
Methods for class LinearModel:

addTerms               compact                gather                 plotAdjustedResponse   plotPartialDependence  random                 
anova                  disp                   partialDependence      plotDiagnostics        plotResiduals          removeTerms            
coefCI                 dwtest                 plot                   plotEffects            plotSlice              step                   
coefTest               feval                  plotAdded              plotInteraction        predict                

And of course, you can perform a prediction at any point as:

predict(mdl,X(1,:))
ans = 0.4734

If you do extract the coefficients themselves from the model, do NOT extract them only as 5 digit approximations. Do NOT just use the numbers you see under the extimate column. (That is perhaps the most common mistake we see made in MATLAB, thinking that because we see a number written as 0.12097, that is the actual number. WRONG. The actual number is a double precision number. Use it properly.) So we can extract the coefficients as the vector:

coeffs = mdl.Coefficients.Estimate;
format long g
coeffs(1) % the constant term...
ans = 
          0.57575799544073

13 Comments
Show 11 older comments Hide 11 older comments

John D'Errico on 23 Jul 2024

Edited: John D'Errico on 23 Jul 2024

Open in MATLAB Online

@Torsten

Interesting. I did not notice that. I just assumed the linear terms were listed someowhere to the right in that line, but they are not. Looks like a bug in the display utility for a fitlm object. For example, if I do this:

X = rand(258,6);
y = rand(258,1);
mdl = fitlm(X,y)
mdl = 
Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6

Estimated Coefficients:
                    Estimate        SE         tStat        pValue  
                   __________    ________    _________    __________

    (Intercept)       0.48658    0.082306       5.9119    1.0993e-08
    x1             -0.0069772    0.066331     -0.10519       0.91631
    x2             -0.0050201    0.066071    -0.075979        0.9395
    x3              0.0074364    0.065044      0.11433       0.90907
    x4               0.089513    0.067478       1.3265       0.18587
    x5               0.012954    0.067464      0.19201       0.84789
    x6              -0.091138    0.065475       -1.392       0.16517


Number of observations: 258, Error degrees of freedom: 251
Root Mean Squared Error: 0.305
R-squared: 0.016,  Adjusted R-Squared: -0.00752
F-statistic vs. constant model: 0.68, p-value = 0.666

We see the constant term listed, plus all the linear terms, so 7 terms in total.

mdl2= fitlm(X,y,'quadratic')
mdl2 = 
Linear regression model:
    y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2

Estimated Coefficients:
                    Estimate       SE         tStat       pValue 
                   __________    _______    _________    ________

    (Intercept)       0.61635    0.28777       2.1418    0.033257
    x1                0.23025    0.37541      0.61333     0.54026
    x2               -0.16666    0.40479     -0.41173     0.68092
    x3              -0.065841    0.39528     -0.16657     0.86786
    x4                0.26667    0.40126      0.66458     0.50699
    x5                0.13509    0.38994      0.34645     0.72932
    x6               -0.72313    0.39228      -1.8434    0.066555
    x1:x2             0.20728    0.25501      0.81284     0.41715
    x1:x3           -0.056925    0.22702     -0.25075     0.80223
    x1:x4            -0.30899    0.25506      -1.2115     0.22696
    x1:x5            -0.20711    0.24053     -0.86103     0.39012
    x1:x6             0.11804    0.23793      0.49612     0.62029
    x2:x3           -0.095032    0.24044     -0.39524     0.69303
    x2:x4             0.41023    0.24129       1.7001    0.090456
    x2:x5            0.030326    0.23436       0.1294     0.89716
    x2:x6             0.16344    0.24957       0.6549     0.51319
    x3:x4            -0.59079    0.25365      -2.3291    0.020721
    x3:x5             0.26037    0.25723       1.0122     0.31251
    x3:x6             0.44535     0.2344       1.8999    0.058695
    x4:x5           -0.056338    0.25525     -0.22072     0.82551
    x4:x6           0.0091123    0.24503     0.037188     0.97037
    x5:x6             0.51868    0.24629        2.106    0.036288
    x1^2             -0.16212    0.26105     -0.62103      0.5352
    x2^2             -0.14084     0.2834     -0.49696     0.61969
    x3^2              0.10528    0.26642      0.39515      0.6931
    x4^2              0.07584    0.28514      0.26597      0.7905
    x5^2             -0.42532    0.26958      -1.5777       0.116
    x6^2           -0.0040015    0.26542    -0.015076     0.98798


Number of observations: 258, Error degrees of freedom: 230
Root Mean Squared Error: 0.305
R-squared: 0.101,  Adjusted R-Squared: -0.00401
F-statistic vs. constant model: 0.962, p-value = 0.523

If I tell it to use a quadratic model, then it knows there are 28 terms. That corresponds to a constant term, 6 linear terms, 6 squared terms, and 15 crossproduct terms. But the line that says linear regression model seems to have dropped the linear terms in the display.

If I call it with only one independent variable, it shows a correct model:

fitlm(rand(10,1),rand(10,1),'quadratic')
ans = 
Linear regression model:
    y ~ 1 + x1 + x1^2

Estimated Coefficients:
                   Estimate      SE        tStat      pValue 
                   ________    _______    ________    _______

    (Intercept)     0.25159    0.13968      1.8012    0.11468
    x1              0.68865     1.2586     0.54715    0.60128
    x1^2           -0.66133     1.5905    -0.41579    0.69001


Number of observations: 10, Error degrees of freedom: 7
Root Mean Squared Error: 0.204
R-squared: 0.101,  Adjusted R-Squared: -0.156
F-statistic vs. constant model: 0.394, p-value = 0.688

However, when I do the same with two or more independent variables, it seems to be dropping the linear terms from that display.

fitlm(rand(10,2),rand(10,1),'quadratic')
ans = 
Linear regression model:
    y ~ 1 + x1*x2 + x1^2 + x2^2

Estimated Coefficients:
                   Estimate      SE        tStat       pValue 
                   ________    _______    ________    ________

    (Intercept)    -1.0662     0.62065     -1.7179     0.16094
    x1              4.3206      1.6753       2.579    0.061389
    x2              3.7438      2.0977      1.7847     0.14886
    x1:x2          -1.0273      1.4708    -0.69852     0.52333
    x1^2           -3.0734      1.5127     -2.0317     0.11199
    x2^2           -5.0901      2.7917     -1.8233     0.14233


Number of observations: 10, Error degrees of freedom: 4
Root Mean Squared Error: 0.142
R-squared: 0.862,  Adjusted R-Squared: 0.689
F-statistic vs. constant model: 4.99, p-value = 0.0722

Again, it does know the terms in the model. They can be seen as:

S = struct(mdl2);
Warning: Calling STRUCT on an object prevents the object from hiding its implementation details and should thus be avoided. Use DISP or DISPLAY to see the visible public details of an object. See 'help struct' for more information.
S.CoefficientNames
ans = 1x28 cell array
  Columns 1 through 14

    {'(Intercept)'}    {'x1'}    {'x2'}    {'x3'}    {'x4'}    {'x5'}    {'x6'}    {'x1:x2'}    {'x1:x3'}    {'x1:x4'}    {'x1:x5'}    {'x1:x6'}    {'x2:x3'}    {'x2:x4'}

  Columns 15 through 28

    {'x2:x5'}    {'x2:x6'}    {'x3:x4'}    {'x3:x5'}    {'x3:x6'}    {'x4:x5'}    {'x4:x6'}    {'x5:x6'}    {'x1^2'}    {'x2^2'}    {'x3^2'}    {'x4^2'}    {'x5^2'}    {'x6^2'}

Just that the line in the display seems to be incorrect for problems with 2 or more independent variables. So a bug, but nothing that ever triggered a bug report, because nothing ever actually uses that line from the display. I can post it as a bug to tech support. I would guess the problem lies somewhere in the sub-functions of:

.../MATLAB_R2023a.app/toolbox/stats/classreg/@LinearModel/LinearModel.m

Edit: I've reported this as a bug. However, it is PURELY a bug in the display method for a fitlm object. Since there appear to be no issues in performance due to that bug, that makes it likely they won't put a high priority on fixing it. Of course, since it is surely something simple to fix, I'd not be surprised to see it fixed soon in some new release.

William Rose on 23 Jul 2024

Open in MATLAB Online

That is a good question from @Torsten, about why the linear terms do not appear in the regression model. It looks to me like a mistake in the console display of the model. The linear terms are being used in the model predictions (see below), even though the console display of the model does not include the linear terms. Simple demonstration of this is below.

y=rand(50,1); X=rand(50,2);
m=fitlm(X,y,'quadratic')
m = 
Linear regression model:
    y ~ 1 + x1*x2 + x1^2 + x2^2

Estimated Coefficients:
                    Estimate       SE         tStat        pValue 
                   __________    _______    __________    ________

    (Intercept)       0.23545    0.23148        1.0172     0.31463
    x1                 1.0101    0.66222        1.5253     0.13435
    x2                0.34731    0.61128       0.56817      0.5728
    x1:x2          -0.0015628    0.50775    -0.0030779     0.99756
    x1^2              -1.0717    0.58869       -1.8205    0.075488
    x2^2            -0.069418    0.55139       -0.1259     0.90039


Number of observations: 50, Error degrees of freedom: 44
Root Mean Squared Error: 0.283
R-squared: 0.117,  Adjusted R-Squared: 0.017
F-statistic vs. constant model: 1.17, p-value = 0.339
Xa=[ones(50,1),X,X(:,1).*X(:,2),X.^2];
b=m.Coefficients.Estimate;
ypred=Xa*b;
disp([y(1:3),m.Fitted(1:3),ypred(1:3)])
    0.3825    0.5955    0.5955
    0.9898    0.6727    0.6727
    0.1772    0.4683    0.4683

The values m.Fitted and ypred (columns 2 and 3 above) are the same, which shows that the regression model is in fact using the linear terms to compute the Fitted values. Even though the linear regression model displayed on the console, by fitlm(), does not include the linear terms.

Jacob Emil Elbek on 24 Jul 2024

But that doesn't work when my coefficients and formula look like this or does it? I can't get it to output the same numbers as when I utilize predict when I use the method you suggested? I think I might have some fundamental misunderstanding here?

y ~ 1 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2

Estimate SE tStat pValue

___________ __________ ________ __________

(Intercept) -9971.9 3128.8 -3.1872 0.0016362

x1 1.1872 2.0597 0.57641 0.5649

x2 -0.27842 0.17296 -1.6097 0.10883

x3 125.52 47.162 2.6614 0.0083297

x4 -0.69604 0.22471 -3.0974 0.0021949

x5 12651 4758.7 2.6585 0.0084008

x6 296.1 68.311 4.3346 2.1842e-05

x1:x2 -6.3289e-05 7.6113e-05 -0.83151 0.40655

x1:x3 -0.052199 0.012239 -4.2651 2.9197e-05

x1:x4 -0.00091916 0.00016133 -5.6973 3.704e-08

x1:x5 42.644 4.6921 9.0886 4.7754e-17

x1:x6 -0.269 0.048581 -5.5372 8.3613e-08

x2:x3 0.0023564 0.0013033 1.8081 0.071903

x2:x4 3.6937e-05 4.2682e-06 8.6541 8.7413e-16

x2:x5 -0.13992 0.1395 -1.003 0.31689

x2:x6 0.0041256 0.0014358 2.8734 0.0044408

x3:x4 0.0042098 0.0016095 2.6157 0.009495

x3:x5 -213.37 48.137 -4.4324 1.4428e-05

x3:x6 -1.2349 0.46077 -2.68 0.0078945

x4:x5 -1.3272 0.31259 -4.2458 3.1625e-05

x4:x6 0.029593 0.0038887 7.6099 7.0092e-13

x5:x6 -460.26 117.08 -3.931 0.00011194

x1^2 0.010879 0.0017756 6.1272 3.8429e-09

x2^2 -7.9515e-06 2.1123e-06 -3.7643 0.00021208

x3^2 -0.32064 0.17503 -1.8319 0.068256

x4^2 -6.1587e-05 7.9081e-06 -7.7879 2.3169e-13

x5^2 54855 10628 5.1612 5.3029e-07

x6^2 -2.4288 0.49161 -4.9405 1.4997e-06

Torsten on 24 Jul 2024

Edited: Torsten on 24 Jul 2024

Open in MATLAB Online

X = rand(258,2);
y = rand(258,1);
mdl = fitlm(X, y, "quadratic")
mdl = 
Linear regression model:
    y ~ 1 + x1*x2 + x1^2 + x2^2

Estimated Coefficients:
                   Estimate       SE        tStat        pValue  
                   _________    _______    ________    __________

    (Intercept)      0.55183    0.10718      5.1488    5.2846e-07
    x1             -0.060941    0.28228    -0.21589       0.82925
    x2              -0.23265    0.31614    -0.73591       0.46247
    x1:x2           -0.19381    0.23901    -0.81089       0.41819
    x1^2             0.15189    0.24491      0.6202       0.53569
    x2^2             0.27773     0.2677      1.0375       0.30051


Number of observations: 258, Error degrees of freedom: 252
Root Mean Squared Error: 0.294
R-squared: 0.0107,  Adjusted R-Squared: -0.00897
F-statistic vs. constant model: 0.543, p-value = 0.744
p = mdl.Coefficients.Estimate
p = 6x1
    0.5518
   -0.0609
   -0.2327
   -0.1938
    0.1519
    0.2777
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fun = @(x)p(1)+p(2)*x(1)+p(3)*x(2)+p(4)*x(1)*x(2)+p(5)*x(1)^2+p(6)*x(2)^2;
format long
ypred = predict(mdl,[1 6])
ypred = 
   8.082192401474730
fun([1 6])
ans = 
   8.082192401474730

John D'Errico on 24 Jul 2024

Open in MATLAB Online

@Jacob Emil Elbek

Unfortunately, they make it more difficult than they should, at least in my opinion, because this was not well documented in the help, AND because they don't have a simple way to extract the explicit order of the terms.

You need to read about

https://www.mathworks.com/help/stats/wilkinson-notation.htm

So, when you see this:

Linear regression model:

y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2

the linear terms are all implicitly in the model too. Unfortunately, that does not help you here a lot. Instead, you can find the sequence of terms like this:

X = randn(100,3);y = rand(100,1);
mdl = fitlm(X,y,'quadratic')
mdl = 
Linear regression model:
    y ~ 1 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2

Estimated Coefficients:
                    Estimate        SE        tStat        pValue  
                   __________    ________    ________    __________

    (Intercept)       0.47887    0.044888      10.668    1.1808e-17
    x1             0.00089959    0.024696    0.036426       0.97102
    x2               0.026874     0.02949      0.9113       0.36457
    x3             -0.0070283    0.031603    -0.22239       0.82451
    x1:x2          -0.0075917    0.030142    -0.25186       0.80172
    x1:x3           0.0040778    0.033902     0.12028       0.90453
    x2:x3           0.0034223    0.033627     0.10177       0.91916
    x1^2            0.0013212    0.016642    0.079389        0.9369
    x2^2            0.0011447    0.025451    0.044977       0.96422
    x3^2             0.045289    0.031263      1.4487        0.1509


Number of observations: 100, Error degrees of freedom: 90
Root Mean Squared Error: 0.26
R-squared: 0.0362,  Adjusted R-Squared: -0.0602
F-statistic vs. constant model: 0.375, p-value = 0.944
mdl.CoefficientNames
ans = 1x10 cell array
    {'(Intercept)'}    {'x1'}    {'x2'}    {'x3'}    {'x1:x2'}    {'x1:x3'}    {'x2:x3'}    {'x1^2'}    {'x2^2'}    {'x3^2'}

As you can see, what they call the linear regression model, is, while valid, based on the Wilkinson notation, it does not lead easily to reconstructing the model with the coefficients in the proper order. However, mdl.CoefficientNames does do that, at least more directly.

Jacob Emil Elbek on 24 Jul 2024

Thanks everyone - I got what I needed.

John D'Errico on 25 Jul 2024

Edited: John D'Errico on 25 Jul 2024

Open in MATLAB Online

After some discussion with tech support, I learned a resolution, though I did not find it myself. The documentation for LinearModel needs to be enhanced IMHO.

X=rand(50,3); y=rand(50,1);
mdl=fitlm(X,y,'quadratic')
mdl = 
Linear regression model:
    y ~ 1 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2

Estimated Coefficients:
                    Estimate       SE         tStat        pValue  
                   __________    _______    __________    _________

    (Intercept)       0.92378    0.27333        3.3798    0.0016297
    x1               -0.68435    0.69208      -0.98883      0.32869
    x2               -0.69012    0.78414      -0.88011      0.38406
    x3               -0.39798    0.74644      -0.53317      0.59687
    x1:x2             0.62579    0.55378          1.13       0.2652
    x1:x3             0.40651    0.58199       0.69848      0.48892
    x2:x3            -0.35752    0.68687       -0.5205      0.60559
    x1^2           -0.0047517    0.57365    -0.0082833      0.99343
    x2^2              0.62704    0.73182       0.85682      0.39665
    x3^2              0.44112    0.68853       0.64067      0.52539


Number of observations: 50, Error degrees of freedom: 40
Root Mean Squared Error: 0.303
R-squared: 0.126,  Adjusted R-Squared: -0.0707
F-statistic vs. constant model: 0.64, p-value = 0.756
exponents = mdl.Formula.Terms(:,1:end-1)
exponents = 10x3
     0     0     0
     1     0     0
     0     1     0
     0     0     1
     1     1     0
     1     0     1
     0     1     1
     2     0     0
     0     2     0
     0     0     2
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

Note that I needed to drop the last column. But this is essentially a full description of each term in the model. It identifies the exponents of each variable that would be necessary to reconstruct the model. How can we use this? First of all, we should see it does work properly.

syms x1 x2 x3

vpa(mdl.Coefficients.Estimate.'*prod([x1,x2,x3].^exponents,2),4)

ans =

For a single point newX, as a row vector, we can just do this:

mdlfun = @(newX) mdl.Coefficients.Estimate.'*prod(newX.^exponents,2);
mdlfun([1 2 3])
ans = 4.4647

And for comparison, we see:

predict(mdl,[1 2 3])
ans = 4.4647

With just a little more effort, we could even vectorize the function.

Sign in to comment.

Derive full formula from fitlm(X,y,"quadratic")

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

13 Comments
Show 11 older comments Hide 11 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

Derive full formula from fitlm(X,y,"quadratic")

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

13 Comments Show 11 older comments Hide 11 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

13 Comments
Show 11 older comments Hide 11 older comments