Main Content

regstats

Regression diagnostics

Description

regstats(y,X) performs a multiple linear regression of the responses in y on the predictors in X, and computes regression diagnostic statistics, which you specify in a dialog box. For a list of supported statistics, see StatNames.

regstats(y,X,modelspec) performs a multiple linear regression using the terms specified in modelspec.

regstats(y,X,modelspec,StatNames) computes only the statistics specified in StatNames. To compute all supported statistics, use regstats(y,X,modelspec,"all").

example

stats = regstats(___) returns the regression diagnostic statistics in the structure stats, for any of the previous syntaxes. If you do not specify StatNames, the function returns all supported statistics.

example

Examples

collapse all

Load the patients data set, which contains simulated data on 100 hospital patients.

load patients

Perform a multiple linear regression of the Diastolic values on the Height, Weight, and Age predictors. Return all available diagnostic statistics in the structure stats.

stats = regstats(Diastolic,[Height Weight Age])
stats = struct with fields:
        source: 'regstats'
             Q: [100×4 double]
             R: [4×4 double]
          beta: [4×1 double]
          covb: [4×4 double]
          yhat: [100×1 double]
             r: [100×1 double]
           mse: 46.9215
       rsquare: 0.0533
    adjrsquare: 0.0237
      leverage: [100×1 double]
        hatmat: [100×100 double]
          s2_i: [100×1 double]
        beta_i: [4×100 double]
      standres: [100×1 double]
       studres: [100×1 double]
       dfbetas: [4×100 double]
         dffit: [100×1 double]
        dffits: [100×1 double]
      covratio: [100×1 double]
         cookd: [100×1 double]
         tstat: [1×1 struct]
         fstat: [1×1 struct]
        dwstat: [1×1 struct]

The stats structure contains diagnostic statistics for the regression fit. To access individual statistics, use dot notation. For example, stats.beta displays the regression coefficients.

stats.beta
ans = 4×1

   72.3544
   -0.0064
    0.0571
    0.0585

Load the hald data set, which measures the effect of cement composition on its hardening heat.

load hald

This data set includes the variables ingredients and heat. The matrix ingredients contains the percent composition of four chemicals present in the cement. The vector heat contains the values for the heat hardening after 180 days for each cement sample.

Fit a multiple linear regression model to the data. Use a model that includes interaction terms, and return only the fitted values and residuals to the structure stats.

stats = regstats(heat,ingredients,"interactions",["yhat" "r"])
stats = struct with fields:
    source: 'regstats'
      yhat: [13×1 double]
         r: [13×1 double]

Plot the residuals versus the fitted heat values, and add a red dashed line to indicate the mean residual value.

scatter(stats.yhat,stats.r)
ylabel("Residual")
xlabel("Fitted Heat Value")
hold on
yline(mean(stats.r),"r--")
hold off

Figure contains an axes object. The axes object with xlabel Fitted Heat Value, ylabel Residual contains 2 objects of type scatter, constantline.

The plot shows that the mean residual value is close to 0, and that the residual values range from –0.7 to 0.6.

Input Arguments

collapse all

Response variable, specified as an n-by-1 numeric vector, where n is the number of rows (observations) in X. Each entry in y is the response for the corresponding row of X. The software treats NaN values as missing values, and omits observations with missing response values from the regression fit.

Data Types: single | double

Predictor variables, specified as an n-by-p numeric matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

By default, the model includes a constant term unless you explicitly remove it, so do not include a column of 1s in X.

The software treats NaN values as missing values, and omits observations with missing values from the regression fit.

Data Types: single | double

Model terms, specified as a value in the following table or as a numeric matrix.

ValueModel Contents
"linear" or "additive" (default)Constant and linear terms
"interactions"Constant, linear, and interaction terms
"quadratic"Constant, linear, interaction, and squared terms
"purequadratic"Constant, linear, and squared terms

If you specify modelspec as a numeric matrix, it must contain one column for each predictor and one row for each polynomial term in the model. The entries in each row are exponents for the predictors in the columns of X. For example, if a model has predictors X1, X2, and X3, then row [0 1 2] in modelspec specifies the term X10X21X32. A row of all zeros in modelspec specifies a constant term, which can be omitted.

Data Types: single | double | char | string

Names of the diagnostic statistics to compute, specified as a character vector, string array, or cell array of character vectors. You can specify any combination of values in the following table, or specify "all" to compute all of the statistics.

ValueDescriptionMore Information
QQ from the QR decomposition of the design matrixSee Factorizations
RR from the QR decomposition of the design matrix
betaRegression coefficientsSee What Is a Linear Regression Model?
yhatFitted values of the response data
covbCovariance of regression coefficientsSee Coefficient Standard Errors and Confidence Intervals
mseMean squared error
rResidualsSee Residuals
studresStudentized residuals
standresStandardized residuals
rsquareR2 statistic. This statistic might be negative for models without a constant term, which indicates that the model is not appropriate for the data.See Coefficient of Determination (R-Squared)
adjrsquareAdjusted R2 statistic
leverageLeverageSee Hat Matrix and Leverage
hatmatHat matrix
s2_iDelete-1 varianceSee Delete-1 Statistics
beta_iDelete-1 coefficients
dfbetasScaled change in regression coefficients
dffitChange in fitted values
dffitsScaled change in fitted values
covratioChange in covariance
cookdCook's distanceSee Cook’s Distance
tstatt statistics and p-values for coefficientsSee F-statistic and t-statistic
fstatF-statistic and p-value. The software computes this statistic with the assumption that the model contains a constant term. The F-statistic is incorrect for models without a constant term.
dwstatDurbin-Watson statistic and p-valueSee Durbin-Watson Test
allAll supported statistics

If you do not specify StatNames or return stats, the function displays a dialog box that contains a list of supported diagnostic statistics. The names of the workspace variables are displayed on the right-hand side of the interface. You can change the name of a workspace variable to any valid MATLAB® variable name. When you select check boxes corresponding to the statistics you want to compute and click OK, the function returns the selected statistics to the MATLAB workspace as variables.

Regstats Export to Workspace dialog box displaying a list of diagnostic statistics to select.

Data Types: string | char | cell

Output Arguments

collapse all

Diagnostic statistics, returned as a structure. For a description of the statistics, see the StatNames input argument.

References

[1] Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, Inc., 1980.

[2] Chatterjee, S., and A. S. Hadi. "Influential Observations, High Leverage Points, and Outliers in Linear Regression." Statistical Science. Vol. 1, 1986, pp. 379–416.

[3] Cook, R. D., and S. Weisberg. Residuals and Influence in Regression. New York: Chapman & Hall/CRC Press, 1983.

[4] Goodall, C. R. "Computation Using the QR Decomposition." Handbook in Statistics. Vol. 9, Amsterdam: Elsevier/North-Holland, 1993.

Version History

Introduced before R2006a