Derive full formula from fitlm(X,y,"quadratic")

I have X which is a 258x6 double and y which is a 258x1 double. These are used to fit a model utilizing fitlm. I found the quadratic to yield the best results.
X = outTableNumeric{:, corr_ids}; % Using the most correlated features
y = outTableNumeric.target; % Target variabel
% Fit a linear regression model
mdl = fitlm(X, y, "quadratic");
However when I try to derive a formula to be deployed outside matlab I find it extremely obscure how I'm to combine the quadratic terms with the coefficients and so on?
I tried to make the formula programmatically but that is not working when I test the result on some row of X.
coefficients = mdl.Coefficients.Estimate;
% Extract the simple notation of the formula
intercept = coefficients(1);
formula = sprintf('%.2f', intercept);
for i = 2:length(coefficients)
if coefficients(i) ~= 0
feature_indices = find(strcmp(mdl.CoefficientNames, mdl.CoefficientNames{i}));
feature_name = mdl.CoefficientNames{feature_indices(1)};
formula = sprintf('%s + %.2f * %s', formula, coefficients(i), feature_name);
end
end
strrep(formula,':','*');
disp('Simple notation of the formula:');
disp(formula);
Can you please help me extract the full equation?

 Accepted Answer

X = rand(258,6);
y = rand(258,1);
mdl = fitlm(X, y, "quadratic")
mdl =
Linear regression model: y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2 Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ ________ (Intercept) 0.57576 0.29322 1.9636 0.050783 x1 -0.37307 0.39362 -0.9478 0.34423 x2 0.1415 0.39079 0.36209 0.71762 x3 0.2982 0.39912 0.74715 0.45573 x4 -0.15672 0.41253 -0.3799 0.70437 x5 -0.2706 0.40666 -0.66542 0.50645 x6 -0.18681 0.39256 -0.47587 0.63462 x1:x2 0.11095 0.22534 0.49235 0.62294 x1:x3 0.11595 0.22555 0.51408 0.60769 x1:x4 -0.21297 0.23823 -0.89399 0.37226 x1:x5 0.018241 0.22852 0.079822 0.93645 x1:x6 0.14906 0.224 0.66547 0.50642 x2:x3 -0.078433 0.22451 -0.34935 0.72714 x2:x4 -0.10602 0.24031 -0.44116 0.65951 x2:x5 -0.054747 0.23808 -0.22996 0.81833 x2:x6 0.20974 0.24169 0.8678 0.38641 x3:x4 -0.29942 0.22792 -1.3137 0.19026 x3:x5 -0.03888 0.21461 -0.18116 0.8564 x3:x6 -0.080697 0.25933 -0.31117 0.75595 x4:x5 0.33241 0.24641 1.349 0.17866 x4:x6 0.030806 0.24789 0.12427 0.90121 x5:x6 -0.10127 0.24546 -0.41257 0.68031 x1^2 0.23249 0.26395 0.88083 0.37933 x2^2 -0.17792 0.25219 -0.70551 0.48121 x3^2 0.013372 0.25903 0.051623 0.95887 x4^2 0.22872 0.2511 0.91086 0.36332 x5^2 0.098667 0.25524 0.38657 0.69943 x6^2 0.18421 0.25429 0.72438 0.46957 Number of observations: 258, Error degrees of freedom: 230 Root Mean Squared Error: 0.299 R-squared: 0.078, Adjusted R-Squared: -0.0303 F-statistic vs. constant model: 0.72, p-value = 0.844
There are 6 variables, here called x1...x6, corresponding to the 6 columns of your array. And 28 terms in the model. You can see them listed above, as represented by the linear regression model.
If you are not sure how to use the result of fitlm, here are the tools that can be applied:
methods(mdl)
Methods for class LinearModel: addTerms compact gather plotAdjustedResponse plotPartialDependence random anova disp partialDependence plotDiagnostics plotResiduals removeTerms coefCI dwtest plot plotEffects plotSlice step coefTest feval plotAdded plotInteraction predict
And of course, you can perform a prediction at any point as:
predict(mdl,X(1,:))
ans = 0.4734
If you do extract the coefficients themselves from the model, do NOT extract them only as 5 digit approximations. Do NOT just use the numbers you see under the extimate column. (That is perhaps the most common mistake we see made in MATLAB, thinking that because we see a number written as 0.12097, that is the actual number. WRONG. The actual number is a double precision number. Use it properly.) So we can extract the coefficients as the vector:
coeffs = mdl.Coefficients.Estimate;
format long g
coeffs(1) % the constant term...
ans =
0.57575799544073

13 Comments

Do you know why linear terms appear for the coefficients, but not in the regression model ?
Interesting. I did not notice that. I just assumed the linear terms were listed someowhere to the right in that line, but they are not. Looks like a bug in the display utility for a fitlm object. For example, if I do this:
X = rand(258,6);
y = rand(258,1);
mdl = fitlm(X,y)
mdl =
Linear regression model: y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 Estimated Coefficients: Estimate SE tStat pValue __________ ________ _________ __________ (Intercept) 0.48658 0.082306 5.9119 1.0993e-08 x1 -0.0069772 0.066331 -0.10519 0.91631 x2 -0.0050201 0.066071 -0.075979 0.9395 x3 0.0074364 0.065044 0.11433 0.90907 x4 0.089513 0.067478 1.3265 0.18587 x5 0.012954 0.067464 0.19201 0.84789 x6 -0.091138 0.065475 -1.392 0.16517 Number of observations: 258, Error degrees of freedom: 251 Root Mean Squared Error: 0.305 R-squared: 0.016, Adjusted R-Squared: -0.00752 F-statistic vs. constant model: 0.68, p-value = 0.666
We see the constant term listed, plus all the linear terms, so 7 terms in total.
mdl2= fitlm(X,y,'quadratic')
mdl2 =
Linear regression model: y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2 Estimated Coefficients: Estimate SE tStat pValue __________ _______ _________ ________ (Intercept) 0.61635 0.28777 2.1418 0.033257 x1 0.23025 0.37541 0.61333 0.54026 x2 -0.16666 0.40479 -0.41173 0.68092 x3 -0.065841 0.39528 -0.16657 0.86786 x4 0.26667 0.40126 0.66458 0.50699 x5 0.13509 0.38994 0.34645 0.72932 x6 -0.72313 0.39228 -1.8434 0.066555 x1:x2 0.20728 0.25501 0.81284 0.41715 x1:x3 -0.056925 0.22702 -0.25075 0.80223 x1:x4 -0.30899 0.25506 -1.2115 0.22696 x1:x5 -0.20711 0.24053 -0.86103 0.39012 x1:x6 0.11804 0.23793 0.49612 0.62029 x2:x3 -0.095032 0.24044 -0.39524 0.69303 x2:x4 0.41023 0.24129 1.7001 0.090456 x2:x5 0.030326 0.23436 0.1294 0.89716 x2:x6 0.16344 0.24957 0.6549 0.51319 x3:x4 -0.59079 0.25365 -2.3291 0.020721 x3:x5 0.26037 0.25723 1.0122 0.31251 x3:x6 0.44535 0.2344 1.8999 0.058695 x4:x5 -0.056338 0.25525 -0.22072 0.82551 x4:x6 0.0091123 0.24503 0.037188 0.97037 x5:x6 0.51868 0.24629 2.106 0.036288 x1^2 -0.16212 0.26105 -0.62103 0.5352 x2^2 -0.14084 0.2834 -0.49696 0.61969 x3^2 0.10528 0.26642 0.39515 0.6931 x4^2 0.07584 0.28514 0.26597 0.7905 x5^2 -0.42532 0.26958 -1.5777 0.116 x6^2 -0.0040015 0.26542 -0.015076 0.98798 Number of observations: 258, Error degrees of freedom: 230 Root Mean Squared Error: 0.305 R-squared: 0.101, Adjusted R-Squared: -0.00401 F-statistic vs. constant model: 0.962, p-value = 0.523
If I tell it to use a quadratic model, then it knows there are 28 terms. That corresponds to a constant term, 6 linear terms, 6 squared terms, and 15 crossproduct terms. But the line that says linear regression model seems to have dropped the linear terms in the display.
If I call it with only one independent variable, it shows a correct model:
fitlm(rand(10,1),rand(10,1),'quadratic')
ans =
Linear regression model: y ~ 1 + x1 + x1^2 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ________ _______ (Intercept) 0.25159 0.13968 1.8012 0.11468 x1 0.68865 1.2586 0.54715 0.60128 x1^2 -0.66133 1.5905 -0.41579 0.69001 Number of observations: 10, Error degrees of freedom: 7 Root Mean Squared Error: 0.204 R-squared: 0.101, Adjusted R-Squared: -0.156 F-statistic vs. constant model: 0.394, p-value = 0.688
However, when I do the same with two or more independent variables, it seems to be dropping the linear terms from that display.
fitlm(rand(10,2),rand(10,1),'quadratic')
ans =
Linear regression model: y ~ 1 + x1*x2 + x1^2 + x2^2 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ________ ________ (Intercept) -1.0662 0.62065 -1.7179 0.16094 x1 4.3206 1.6753 2.579 0.061389 x2 3.7438 2.0977 1.7847 0.14886 x1:x2 -1.0273 1.4708 -0.69852 0.52333 x1^2 -3.0734 1.5127 -2.0317 0.11199 x2^2 -5.0901 2.7917 -1.8233 0.14233 Number of observations: 10, Error degrees of freedom: 4 Root Mean Squared Error: 0.142 R-squared: 0.862, Adjusted R-Squared: 0.689 F-statistic vs. constant model: 4.99, p-value = 0.0722
Again, it does know the terms in the model. They can be seen as:
S = struct(mdl2);
Warning: Calling STRUCT on an object prevents the object from hiding its implementation details and should thus be avoided. Use DISP or DISPLAY to see the visible public details of an object. See 'help struct' for more information.
S.CoefficientNames
ans = 1x28 cell array
Columns 1 through 14 {'(Intercept)'} {'x1'} {'x2'} {'x3'} {'x4'} {'x5'} {'x6'} {'x1:x2'} {'x1:x3'} {'x1:x4'} {'x1:x5'} {'x1:x6'} {'x2:x3'} {'x2:x4'} Columns 15 through 28 {'x2:x5'} {'x2:x6'} {'x3:x4'} {'x3:x5'} {'x3:x6'} {'x4:x5'} {'x4:x6'} {'x5:x6'} {'x1^2'} {'x2^2'} {'x3^2'} {'x4^2'} {'x5^2'} {'x6^2'}
Just that the line in the display seems to be incorrect for problems with 2 or more independent variables. So a bug, but nothing that ever triggered a bug report, because nothing ever actually uses that line from the display. I can post it as a bug to tech support. I would guess the problem lies somewhere in the sub-functions of:
.../MATLAB_R2023a.app/toolbox/stats/classreg/@LinearModel/LinearModel.m
Edit: I've reported this as a bug. However, it is PURELY a bug in the display method for a fitlm object. Since there appear to be no issues in performance due to that bug, that makes it likely they won't put a high priority on fixing it. Of course, since it is surely something simple to fix, I'd not be surprised to see it fixed soon in some new release.
That is a good question from @Torsten, about why the linear terms do not appear in the regression model. It looks to me like a mistake in the console display of the model. The linear terms are being used in the model predictions (see below), even though the console display of the model does not include the linear terms. Simple demonstration of this is below.
y=rand(50,1); X=rand(50,2);
m=fitlm(X,y,'quadratic')
m =
Linear regression model: y ~ 1 + x1*x2 + x1^2 + x2^2 Estimated Coefficients: Estimate SE tStat pValue __________ _______ __________ ________ (Intercept) 0.23545 0.23148 1.0172 0.31463 x1 1.0101 0.66222 1.5253 0.13435 x2 0.34731 0.61128 0.56817 0.5728 x1:x2 -0.0015628 0.50775 -0.0030779 0.99756 x1^2 -1.0717 0.58869 -1.8205 0.075488 x2^2 -0.069418 0.55139 -0.1259 0.90039 Number of observations: 50, Error degrees of freedom: 44 Root Mean Squared Error: 0.283 R-squared: 0.117, Adjusted R-Squared: 0.017 F-statistic vs. constant model: 1.17, p-value = 0.339
Xa=[ones(50,1),X,X(:,1).*X(:,2),X.^2];
b=m.Coefficients.Estimate;
ypred=Xa*b;
disp([y(1:3),m.Fitted(1:3),ypred(1:3)])
0.3825 0.5955 0.5955 0.9898 0.6727 0.6727 0.1772 0.4683 0.4683
The values m.Fitted and ypred (columns 2 and 3 above) are the same, which shows that the regression model is in fact using the linear terms to compute the Fitted values. Even though the linear regression model displayed on the console, by fitlm(), does not include the linear terms.
Yes. I have already reported it as a bug to support. However, it is a bug of little real importance, just in that one line of the display.
It seems that behavior is part of Wilkinson notation.
Though I did not see the point clearly made in doc fitlm.
Thanks alot!
But I still don't understand how I can end up with a standard notation? As in I would like an equation like y=2+1.5x that I can use outside matlab?
You can reconstruct the equation as
fun = @(x) mdl.Coefficients.Estimate(1) + mdl.Coefficients.Estimate(2)*x(1)+mdl.Coefficients.Estimate(3)*x(2) ...
and so on and call the function with a 6x1 input argument.
But that doesn't work when my coefficients and formula look like this or does it? I can't get it to output the same numbers as when I utilize predict when I use the method you suggested? I think I might have some fundamental misunderstanding here?
y ~ 1 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2
Estimate SE tStat pValue
___________ __________ ________ __________
(Intercept) -9971.9 3128.8 -3.1872 0.0016362
x1 1.1872 2.0597 0.57641 0.5649
x2 -0.27842 0.17296 -1.6097 0.10883
x3 125.52 47.162 2.6614 0.0083297
x4 -0.69604 0.22471 -3.0974 0.0021949
x5 12651 4758.7 2.6585 0.0084008
x6 296.1 68.311 4.3346 2.1842e-05
x1:x2 -6.3289e-05 7.6113e-05 -0.83151 0.40655
x1:x3 -0.052199 0.012239 -4.2651 2.9197e-05
x1:x4 -0.00091916 0.00016133 -5.6973 3.704e-08
x1:x5 42.644 4.6921 9.0886 4.7754e-17
x1:x6 -0.269 0.048581 -5.5372 8.3613e-08
x2:x3 0.0023564 0.0013033 1.8081 0.071903
x2:x4 3.6937e-05 4.2682e-06 8.6541 8.7413e-16
x2:x5 -0.13992 0.1395 -1.003 0.31689
x2:x6 0.0041256 0.0014358 2.8734 0.0044408
x3:x4 0.0042098 0.0016095 2.6157 0.009495
x3:x5 -213.37 48.137 -4.4324 1.4428e-05
x3:x6 -1.2349 0.46077 -2.68 0.0078945
x4:x5 -1.3272 0.31259 -4.2458 3.1625e-05
x4:x6 0.029593 0.0038887 7.6099 7.0092e-13
x5:x6 -460.26 117.08 -3.931 0.00011194
x1^2 0.010879 0.0017756 6.1272 3.8429e-09
x2^2 -7.9515e-06 2.1123e-06 -3.7643 0.00021208
x3^2 -0.32064 0.17503 -1.8319 0.068256
x4^2 -6.1587e-05 7.9081e-06 -7.7879 2.3169e-13
x5^2 54855 10628 5.1612 5.3029e-07
x6^2 -2.4288 0.49161 -4.9405 1.4997e-06
X = rand(258,2);
y = rand(258,1);
mdl = fitlm(X, y, "quadratic")
mdl =
Linear regression model: y ~ 1 + x1*x2 + x1^2 + x2^2 Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ __________ (Intercept) 0.55183 0.10718 5.1488 5.2846e-07 x1 -0.060941 0.28228 -0.21589 0.82925 x2 -0.23265 0.31614 -0.73591 0.46247 x1:x2 -0.19381 0.23901 -0.81089 0.41819 x1^2 0.15189 0.24491 0.6202 0.53569 x2^2 0.27773 0.2677 1.0375 0.30051 Number of observations: 258, Error degrees of freedom: 252 Root Mean Squared Error: 0.294 R-squared: 0.0107, Adjusted R-Squared: -0.00897 F-statistic vs. constant model: 0.543, p-value = 0.744
p = mdl.Coefficients.Estimate
p = 6x1
0.5518 -0.0609 -0.2327 -0.1938 0.1519 0.2777
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
fun = @(x)p(1)+p(2)*x(1)+p(3)*x(2)+p(4)*x(1)*x(2)+p(5)*x(1)^2+p(6)*x(2)^2;
format long
ypred = predict(mdl,[1 6])
ypred =
8.082192401474730
fun([1 6])
ans =
8.082192401474730
Unfortunately, they make it more difficult than they should, at least in my opinion, because this was not well documented in the help, AND because they don't have a simple way to extract the explicit order of the terms.
You need to read about
So, when you see this:
Linear regression model:
y ~ 1 + x1*x2 + x1*x3 + x1*x4 + x1*x5 + x1*x6 + x2*x3 + x2*x4 + x2*x5 + x2*x6 + x3*x4 + x3*x5 + x3*x6 + x4*x5 + x4*x6 + x5*x6 + x1^2 + x2^2 + x3^2 + x4^2 + x5^2 + x6^2
the linear terms are all implicitly in the model too. Unfortunately, that does not help you here a lot. Instead, you can find the sequence of terms like this:
X = randn(100,3);y = rand(100,1);
mdl = fitlm(X,y,'quadratic')
mdl =
Linear regression model: y ~ 1 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2 Estimated Coefficients: Estimate SE tStat pValue __________ ________ ________ __________ (Intercept) 0.47887 0.044888 10.668 1.1808e-17 x1 0.00089959 0.024696 0.036426 0.97102 x2 0.026874 0.02949 0.9113 0.36457 x3 -0.0070283 0.031603 -0.22239 0.82451 x1:x2 -0.0075917 0.030142 -0.25186 0.80172 x1:x3 0.0040778 0.033902 0.12028 0.90453 x2:x3 0.0034223 0.033627 0.10177 0.91916 x1^2 0.0013212 0.016642 0.079389 0.9369 x2^2 0.0011447 0.025451 0.044977 0.96422 x3^2 0.045289 0.031263 1.4487 0.1509 Number of observations: 100, Error degrees of freedom: 90 Root Mean Squared Error: 0.26 R-squared: 0.0362, Adjusted R-Squared: -0.0602 F-statistic vs. constant model: 0.375, p-value = 0.944
mdl.CoefficientNames
ans = 1x10 cell array
{'(Intercept)'} {'x1'} {'x2'} {'x3'} {'x1:x2'} {'x1:x3'} {'x2:x3'} {'x1^2'} {'x2^2'} {'x3^2'}
As you can see, what they call the linear regression model, is, while valid, based on the Wilkinson notation, it does not lead easily to reconstructing the model with the coefficients in the proper order. However, mdl.CoefficientNames does do that, at least more directly.
@John D'Errico, Thank you for explaining that fitlm() uses Wilkinson notation to display the formula.
If using the quadratic option, then it is not hard to know the right order for the coefficients. As you point out, they are in mdl.CoefficientNames. Or you can simply add the linear terms, in order, after the initial "1". Example.
X=rand(50,3); y=rand(50,1);
mdl=fitlm(X,y,'quadratic');
Display the formula (Wilkinson notation)
mdl.Formula.LinearPredictor
ans = '1 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2'
The full formula, with the terms in the right order, is
fullFormula=['1 + x1 + x2 + x3 + ',mdl.Formula.LinearPredictor(5:end)]
fullFormula = '1 + x1 + x2 + x3 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2'
If the model order is higher than 2 in some or all predictors, then the simplest approach for the full formula is to actually use mdl.CoefficientNames.
Thanks everyone - I got what I needed.
After some discussion with tech support, I learned a resolution, though I did not find it myself. The documentation for LinearModel needs to be enhanced IMHO.
X=rand(50,3); y=rand(50,1);
mdl=fitlm(X,y,'quadratic')
mdl =
Linear regression model: y ~ 1 + x1*x2 + x1*x3 + x2*x3 + x1^2 + x2^2 + x3^2 Estimated Coefficients: Estimate SE tStat pValue __________ _______ __________ _________ (Intercept) 0.92378 0.27333 3.3798 0.0016297 x1 -0.68435 0.69208 -0.98883 0.32869 x2 -0.69012 0.78414 -0.88011 0.38406 x3 -0.39798 0.74644 -0.53317 0.59687 x1:x2 0.62579 0.55378 1.13 0.2652 x1:x3 0.40651 0.58199 0.69848 0.48892 x2:x3 -0.35752 0.68687 -0.5205 0.60559 x1^2 -0.0047517 0.57365 -0.0082833 0.99343 x2^2 0.62704 0.73182 0.85682 0.39665 x3^2 0.44112 0.68853 0.64067 0.52539 Number of observations: 50, Error degrees of freedom: 40 Root Mean Squared Error: 0.303 R-squared: 0.126, Adjusted R-Squared: -0.0707 F-statistic vs. constant model: 0.64, p-value = 0.756
exponents = mdl.Formula.Terms(:,1:end-1)
exponents = 10x3
0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 2 0 0 0 2 0 0 0 2
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Note that I needed to drop the last column. But this is essentially a full description of each term in the model. It identifies the exponents of each variable that would be necessary to reconstruct the model. How can we use this? First of all, we should see it does work properly.
syms x1 x2 x3
vpa(mdl.Coefficients.Estimate.'*prod([x1,x2,x3].^exponents,2),4)
ans = 
For a single point newX, as a row vector, we can just do this:
mdlfun = @(newX) mdl.Coefficients.Estimate.'*prod(newX.^exponents,2);
mdlfun([1 2 3])
ans = 4.4647
And for comparison, we see:
predict(mdl,[1 2 3])
ans = 4.4647
With just a little more effort, we could even vectorize the function.

Sign in to comment.

More Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!