Documentation Center

  • Trials
  • Product Updates

Linear Regression

Prepare Data

To begin fitting a regression, put your data into a form that fitting functions expect. All regression techniques begin with input data in an array X and response data in a separate vector y, or input data in a table or dataset array tbl and response data as a column in tbl. Each row of the input data represents one observation. Each column represents one predictor (variable).

For a table or dataset array tbl, indicate the response variable with the 'ResponseVar' name-value pair:

mdl = fitlm(tbl,'ResponseVar','BloodPressure');
% or
mdl = fitglm(tbl,'ResponseVar','BloodPressure');

The response variable is the last column by default.

You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities.

  • For a numeric array X, indicate the categorical predictors using the 'Categorical' name-value pair. For example, to indicate that predictors 2 and 3 out of six are categorical:

    mdl = fitlm(X,y,'Categorical',[2,3]);
    % or
    mdl = fitglm(X,y,'Categorical',[2,3]);
    % or equivalently
    mdl = fitlm(X,y,'Categorical',logical([0 1 1 0 0 0]));
  • For a table or dataset array tbl, fitting functions assume that these data types are categorical:

    • Logical

    • Categorical (nominal or ordinal)

    • String or character array

    If you want to indicate that a numeric predictor is categorical, use the 'Categorical' name-value pair.

Represent missing numeric data as NaN. To represent missing data for other data types, see Missing Group Values.

Dataset Array for Input and Response Data

To create a dataset array from an Excel® spreadsheet:

ds = dataset('XLSFile','hospital.xls',...
    'ReadObsNames',true);

To create a dataset array from workspace variables:

load carsmall
ds = dataset(MPG,Weight);
ds.Year = ordinal(Model_Year);

Table for Input and Response Data

To create a table from an Excel spreadsheet:

tbl = readtable('hospital.xls',...
    'ReadRowNames',true);

To create a table from workspace variables:

load carsmall
tbl = table(MPG,Weight);
tbl.Year = ordinal(Model_Year);

Numeric Matrix for Input Data, Numeric Vector for Response

For example, to create numeric arrays from workspace variables:

load carsmall
X = [Weight Horsepower Cylinders Model_Year];
y = MPG;

To create numeric arrays from an Excel spreadsheet:

[X Xnames] = xlsread('hospital.xls');
y = X(:,4); % response y is systolic pressure
X(:,4) = []; % remove y from the X matrix

Notice that the nonnumeric entries, such as sex, do not appear in X.

Choose a Fitting Method

There are three ways to fit a model to data:

Least-Squares Fit

Use fitlm to construct a least-squares fit of a model to the data. This method is best when you are reasonably certain of the model's form, and mainly need to find its parameters. This method is also useful when you want to explore a few models. The method requires you to examine the data manually to discard outliers, though there are techniques to help (see Residuals — Model Quality for Training Data).

Robust Fit

Use fitlm with the RobustOpts name-value pair to create a model that is little affected by outliers. Robust fitting saves you the trouble of manually discarding outliers. However, step does not work with robust fitting. This means that when you use robust fitting, you cannot search stepwise for a good model.

Stepwise Fit

Use stepwiselm to find a model, and fit parameters to the model. stepwiselm starts from one model, such as a constant, and adds or subtracts terms one at a time, choosing an optimal term each time in a greedy fashion, until it cannot improve further. Use stepwise fitting to find a good model, which is one that has only relevant terms.

The result depends on the starting model. Usually, starting with a constant model leads to a small model. Starting with more terms can lead to a more complex model, but one that has lower mean squared error. See Compare large and small stepwise models.

You cannot use robust options along with stepwise fitting. So after a stepwise fit, examine your model for outliers (see Residuals — Model Quality for Training Data).

Choose a Model or Range of Models

There are several ways of specifying a model for linear regression. Use whichever you find most convenient.

For fitlm, the model specification you give is the model that is fit. If you do not give a model specification, the default is 'linear'.

For stepwiselm, the model specification you give is the starting model, which the stepwise procedure tries to improve. If you do not give a model specification, the default starting model is 'constant', and the default upper bounding model is 'interactions'. Change the upper bounding model using the Upper name-value pair.

Brief String

StringModel Type
'constant'Model contains only a constant (intercept) term.
'linear'Model contains an intercept and linear terms for each predictor.
'interactions'Model contains an intercept, linear terms, and all products of pairs of distinct predictors (no squared terms).
'purequadratic'Model contains an intercept, linear terms, and squared terms.
'quadratic'Model contains an intercept, linear terms, interactions, and squared terms.
'polyijk'Model is a polynomial with all terms up to degree i in the first predictor, degree j in the second predictor, etc. Use numerals 0 through 9. For example, 'poly2111' has a constant plus all linear and product terms, and also contains terms with predictor 1 squared.

For example, to specify an interaction model using fitlm with matrix predictors:

mdl = fitlm(X,y,'interactions');

To specify a model using stepwiselm and a table or dataset array tbl of predictors, suppose you want to start from a constant and have a linear model upper bound. Assume the response variable in tbl is in the third column.

mdl2 = stepwiselm(tbl,'constant',...
    'Upper','linear','ResponseVar',3);

Terms Matrix

A terms matrix is a t-by-(p + 1) matrix specifying terms in a model, where t is the number of terms, p is the number of predictor variables, and plus one is for the response variable.

The value of T(i,j) is the exponent of variable j in term i. Suppose there are three predictor variables A, B, and C:

[0 0 0 0] % Constant term or intercept
[0 1 0 0] % B; equivalently, A^0 * B^1 * C^0
[1 0 1 0] % A*C
[2 0 0 0] % A^2
[0 1 2 0] % B*(C^2)

The 0 at the end of each term represents the response variable. In general,

  • If you have the variables in a table or dataset array, then 0 must represent the response variable depending on the position of the response variable. The following example illustrates this.

    Load the sample data and define the dataset array.

    load hospital
    ds = dataset(hospital.Sex,hospital.BloodPressure(:,1),hospital.Age,...
    hospital.Smoker,'VarNames',{'Sex','BloodPressure','Age','Smoker'});

    Represent the linear model 'BloodPressure ~ 1 + Sex + Age + Smoker' in a terms matrix. The response variable is in the second column of the dataset array, so there must be a column of 0s for the response variable in the second column of the terms matrix.

    T = [0 0 0 0;1 0 0 0;0 0 1 0;0 0 0 1]
    
    T =
    
         0     0     0     0
         1     0     0     0
         0     0     1     0
         0     0     0     1

    Redefine the dataset array.

    ds = dataset(hospital.BloodPressure(:,1),hospital.Sex,hospital.Age,...
    hospital.Smoker,'VarNames',{'BloodPressure','Sex','Age','Smoker'});
    

    Now, the response variable is the first term in the dataset array. Specify the same linear model, 'BloodPressure ~ 1 + Sex + Age + Smoker', using a terms matrix.

    T = [0 0 0 0;0 1 0 0;0 0 1 0;0 0 0 1]
    T =
    
         0     0     0     0
         0     1     0     0
         0     0     1     0
         0     0     0     1
  • If you have the predictor and response variables in a matrix and column vector, then you must include 0 for the response variable at the end of each term. The following example illustrates this.

    Load the sample data and define the matrix of predictors.

    load carsmall
    X = [Acceleration,Weight];
    

    Specify the model 'MPG ~ Acceleration + Weight + Acceleration:Weight + Weight^2' using a term matrix and fit the model to the data. This model includes the main effect and two-way interaction terms for the variables, Acceleration and Weight, and a second-order term for the variable, Weight.

    T = [0 0 0;1 0 0;0 1 0;1 1 0;0 2 0]
    
    T =
    
         0     0     0
         1     0     0
         0     1     0
         1     1     0
         0     2     0
    

    Fit a linear model.

    mdl = fitlm(X,MPG,T)
    mdl = 
    
    Linear regression model:
        y ~ 1 + x1*x2 + x2^2
    
    Estimated Coefficients:
                       Estimate       SE            tStat      pValue    
        (Intercept)         48.906        12.589     3.8847    0.00019665
        x1                 0.54418       0.57125    0.95261       0.34337
        x2               -0.012781     0.0060312    -2.1192      0.036857
        x1:x2          -0.00010892    0.00017925    -0.6076         0.545
        x2^2            9.7518e-07    7.5389e-07     1.2935       0.19917
    
    Number of observations: 94, Error degrees of freedom: 89
    Root Mean Squared Error: 4.1
    R-squared: 0.751,  Adjusted R-Squared 0.739
    F-statistic vs. constant model: 67, p-value = 4.99e-26

    Only the intercept and x2 term, which correspond to the Weight variable, are significant at the 5% significance level.

    Now, perform a stepwise regression with a constant model as the starting model and a linear model with interactions as the upper model.

    T = [0 0 0;1 0 0;0 1 0;1 1 0];
    mdl = stepwiselm(X,MPG,[0 0 0],'upper',T)
    1. Adding x2, FStat = 259.3087, pValue = 1.643351e-28
    
    mdl = 
    
    Linear regression model:
        y ~ 1 + x2
    
    Estimated Coefficients:
                       Estimate      SE           tStat      pValue    
        (Intercept)        49.238       1.6411     30.002    2.7015e-49
        x2             -0.0086119    0.0005348    -16.103    1.6434e-28
    
    Number of observations: 94, Error degrees of freedom: 92
    Root Mean Squared Error: 4.13
    R-squared: 0.738,  Adjusted R-Squared 0.735
    F-statistic vs. constant model: 259, p-value = 1.64e-28

    The results of the stepwise regression are consistent with the results of fitlm in the previous step.

Formula

A formula for a model specification is a string of the form

'Y ~ terms',

  • Y is the response name.

  • terms contains

    • Variable names

    • + to include the next variable

    • - to exclude the next variable

    • : to define an interaction, a product of terms

    • * to define an interaction and all lower-order terms

    • ^ to raise the predictor to a power, exactly as in * repeated, so ^ includes lower order terms as well

    • () to group terms

    Tip   Formulas include a constant (intercept) term by default. To exclude a constant term from the model, include -1 in the formula.

Examples:

'Y ~ A + B + C' is a three-variable linear model with intercept.
'Y ~ A + B + C - 1' is a three-variable linear model without intercept.
'Y ~ A + B + C + B^2' is a three-variable model with intercept and a B^2 term.
'Y ~ A + B^2 + C' is the same as the previous example, since B^2 includes a B term.
'Y ~ A + B + C + A:B' includes an A*B term.
'Y ~ A*B + C' is the same as the previous example, since A*B = A + B + A:B.
'Y ~ A*B*C - A:B:C' has all interactions among A, B, and C, except the three-way interaction.
'Y ~ A*(B + C + D)' has all linear terms, plus products of A with each of the other variables.

For example, to specify an interaction model using fitlm with matrix predictors:

mdl = fitlm(X,y,'y ~ x1*x2*x3 - x1:x2:x3');

To specify a model using stepwiselm and a table or dataset array tbl of predictors, suppose you want to start from a constant and have a linear model upper bound. Assume the response variable in tbl is named 'y', and the predictor variables are named 'x1', 'x2', and 'x3'.

mdl2 = stepwiselm(tbl,'y ~ 1','Upper','y ~ x1 + x2 + x3');

Fit Model to Data

The most common optional arguments for fitting:

  • For robust regression in fitlm, set the 'RobustOpts' name-value pair to 'on'.

  • Specify an appropriate upper bound model in stepwiselm, such as set 'Upper' to 'linear'.

  • Indicate which variables are categorical using the 'CategoricalVars' name-value pair. Provide a vector with column numbers, such as [1 6] to specify that predictors 1 and 6 are categorical. Alternatively, give a logical vector the same length as the data columns, with a 1 entry indicating that variable is categorical. If there are seven predictors, and predictors 1 and 6 are categorical, specify logical([1,0,0,0,0,1,0]).

  • For a table or dataset array, specify the response variable using the 'ResponseVar' name-value pair. The default is the last column in the array.

For example,

mdl = fitlm(X,y,'linear',...
    'RobustOpts','on','CategoricalVars',3);
mdl2 = stepwiselm(tbl,'constant',...
    'ResponseVar','MPG','Upper','quadratic');

Examine Quality and Adjust the Fitted Model

After fitting a model, examine the result.

Model Display

A linear regression model shows several diagnostics when you enter its name or enter disp(mdl). This display gives some of the basic information to check whether the fitted model represents the data adequately.

For example, fit a linear model to data constructed with two out of five predictors not present and with no intercept term:

X = randn(100,5);
y = X*[1;0;3;0;-1]+randn(100,1);
mdl = fitlm(X,y)
mdl = 

Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5

Estimated Coefficients:
                   Estimate     SE          tStat       pValue    
    (Intercept)     0.038164    0.099458     0.38372       0.70205
    x1               0.92794    0.087307      10.628    8.5494e-18
    x2             -0.075593     0.10044    -0.75264       0.45355
    x3                2.8965    0.099879          29    1.1117e-48
    x4              0.045311     0.10832     0.41831       0.67667
    x5              -0.99708     0.11799     -8.4504     3.593e-13

Number of observations: 100, Error degrees of freedom: 94
Root Mean Squared Error: 0.972
R-squared: 0.93,  Adjusted R-Squared 0.926
F-statistic vs. constant model: 248, p-value = 1.5e-52

Notice that:

  • The display contains the estimated values of each coefficient in the Estimate column. These values are reasonably near the true values [0;1;0;3;0;-1].

  • There is a standard error column for the coefficient estimates.

  • The reported pValue (which are derived from the t statistics under the assumption of normal errors) for predictors 1, 3, and 5 are extremely small. These are the three predictors that were used to create the response data y.

  • The pValue for (Intercept), x2 and x4 are much larger than 0.01. These three predictors were not used to create the response data y.

  • The display contains R2, adjusted R2, and F statistics.

ANOVA

To examine the quality of the fitted model, consult an ANOVA table. For example, use anova on a linear model with five predictors:

X = randn(100,5);
y = X*[1;0;3;0;-1]+randn(100,1);
mdl = fitlm(X,y);
tbl = anova(mdl)
tbl = 
             SumSq      DF    MeanSq     F          pValue    
    x1        106.62     1     106.62     112.96    8.5494e-18
    x2       0.53464     1    0.53464    0.56646       0.45355
    x3        793.74     1     793.74     840.98    1.1117e-48
    x4       0.16515     1    0.16515    0.17498       0.67667
    x5        67.398     1     67.398      71.41     3.593e-13
    Error     88.719    94    0.94382

This table gives somewhat different results than the default display (see Model Display). The table clearly shows that the effects of x2 and x4 are not significant. Depending on your goals, consider removing x2 and x4 from the model.

Diagnostic Plots

Diagnostic plots help you identify outliers, and see other problems in your model or fit. For example, load the carsmall data, and make a model of MPG as a function of Cylinders (nominal) and Weight:

load carsmall
tbl = table(Weight,MPG,Cylinders);
tbl.Cylinders = ordinal(tbl.Cylinders);
mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');ˋ

Make a leverage plot of the data and model.

plotDiagnostics(mdl)

There are a few points with high leverage. But this plot does not reveal whether the high-leverage points are outliers.

Look for points with large Cook's distance.

plotDiagnostics(mdl,'cookd')

There is one point with large Cook's distance. Identify it and remove it from the model. You can use the Data Cursor to click the outlier and identify it, or identify it programmatically:

[~,larg] = max(mdl.Diagnostics.CooksDistance);
mdl2 = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2',...
    'Exclude',larg);

Residuals — Model Quality for Training Data

There are several residual plots to help you discover errors, outliers, or correlations in the model or data. The simplest residual plots are the default histogram plot, which shows the range of the residuals and their frequencies, and the probability plot, which shows how the distribution of the residuals compares to a normal distribution with matched variance.

Load the carsmall data, and make a model of MPG as a function of Cylinders (nominal) and Weight:

load carsmall
tbl = table(Weight,MPG,Cylinders);
tbl.Cylinders = ordinal(tbl.Cylinders);
mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');

Examine the residuals:

plotResiduals(mdl)

The observations above 12 are potential outliers.

plotResiduals(mdl,'probability')

The two potential outliers appear on this plot as well. Otherwise, the probability plot seems reasonably straight, meaning a reasonable fit to normally distributed residuals.

You can identify the two outliers and remove them from the data:

outl = find(mdl.Residuals.Raw > 12)
outl =

    90
    97

To remove the outliers, use the Exclude name-value pair:

mdl2 = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2',...
    'Exclude',outl);

Examine a residuals plot of mdl2:

plotResiduals(mdl2)

The new residuals plot looks fairly symmetric, without obvious problems. However, there might be some serial correlation among the residuals. Create a new plot to see if such an effect exists.

plotResiduals(mdl2,'lagged')

The scatter plot shows many more crosses in the upper-right and lower-left quadrants than in the other two quadrants, indicating positive serial correlation among the residuals.

Another potential issue is when residuals are large for large observations. See if the current model has this issue.

plotResiduals(mdl2,'fitted')

There is some tendency for larger fitted values to have larger residuals. Perhaps the model errors are proportional to the measured values.

Plots to Understand Predictor Effects

This example shows how to understand the effect each predictor has on a regression model using a variety of available plots.

  1. Create a model of mileage from some predictors in the carsmall data.

    load carsmall
    tbl = table(Weight,MPG,Cylinders);
    tbl.Cylinders = ordinal(tbl.Cylinders);
    mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');
  2. Examine a slice plot of the responses. This displays the effect of each predictor separately.

    plotSlice(mdl)

    You can drag the individual predictor values, which are represented by dashed blue vertical lines. You can also choose between simultaneous and non-simultaneous confidence bounds, which are represented by dashed red curves.

  3. Use an effects plot to show another view of the effect of predictors on the response.

    plotEffects(mdl)

    This plot shows that changing Weight from about 2500 to 4732 lowers MPG by about 30 (the location of the upper blue circle). It also shows that changing the number of cylinders from 8 to 4 raises MPG by about 10 (the lower blue circle). The horizontal blue lines represent confidence intervals for these predictions. The predictions come from averaging over one predictor as the other is changed. In cases such as this, where the two predictors are correlated, be careful when interpreting the results.

  4. Instead of viewing the effect of averaging over a predictor as the other is changed, examine the joint interaction in an interaction plot.

    plotInteraction(mdl,'Weight','Cylinders')

    The interaction plot shows the effect of changing one predictor with the other held fixed. In this case, the plot is much more informative. It shows, for example, that lowering the number of cylinders in a relatively light car (Weight = 1795) leads to an increase in mileage, but lowering the number of cylinders in a relatively heavy car (Weight = 4732) leads to a decrease in mileage.

  5. For an even more detailed look at the interactions, look at an interaction plot with predictions. This plot holds one predictor fixed while varying the other, and plots the effect as a curve. Look at the interactions for various fixed numbers of cylinders.

    plotInteraction(mdl,'Cylinders','Weight','predictions')

    Now look at the interactions with various fixed levels of weight.

    plotInteraction(mdl,'Weight','Cylinders','predictions')

Plots to Understand Terms Effects

This example shows how to understand the effect of each term in a regression model using a variety of available plots.

  1. Create a model of mileage from some predictors in the carsmall data.

    load carsmall
    tbl = table(Weight,MPG,Cylinders);
    tbl.Cylinders = ordinal(tbl.Cylinders);
    mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');
  2. Create an added variable plot with Weight^2 as the added variable.

    plotAdded(mdl,'Weight^2')

    This plot shows the results of fitting both Weight^2 and MPG to the terms other than Weight^2. The reason to use plotAdded is to understand what additional improvement in the model you get by adding Weight^2. The coefficient of a line fit to these points is the coefficient of Weight^2 in the full model. The Weight^2 predictor is just over the edge of significance (pValue < 0.05) as you can see in the coefficients table display. You can see that in the plot as well. The confidence bounds look like they could not contain a horizontal line (constant y), so a zero-slope model is not consistent with the data.

  3. Create an added variable plot for the model as a whole.

    plotAdded(mdl)

    The model as a whole is very significant, so the bounds don't come close to containing a horizontal line. The slope of the line is the slope of a fit to the predictors projected onto their best-fitting direction, or in other words, the norm of the coefficient vector.

Change Models

There are two ways to change a model:

If you created a model using stepwiselm, step can have an effect only if you give different upper or lower models. step does not work when you fit a model using RobustOpts.

For example, start with a linear model of mileage from the carbig data:

load carbig
tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG);
mdl = fitlm(tbl,'linear','ResponseVar','MPG')
mdl = 


Linear regression model:
    MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight

Estimated Coefficients:
                    Estimate      SE            tStat       pValue    
    (Intercept)         45.251         2.456      18.424    7.0721e-55
    Acceleration     -0.023148        0.1256     -0.1843       0.85388
    Displacement    -0.0060009     0.0067093    -0.89441       0.37166
    Horsepower       -0.043608      0.016573     -2.6312      0.008849
    Weight          -0.0052805    0.00081085     -6.5123    2.3025e-10


Number of observations: 392, Error degrees of freedom: 387
Root Mean Squared Error: 4.25
R-squared: 0.707,  Adjusted R-Squared 0.704
F-statistic vs. constant model: 233, p-value = 9.63e-102

Try to improve the model using step for up to 10 steps:

mdl1 = step(mdl,'NSteps',10)
1. Adding Displacement:Horsepower, FStat = 87.4802, pValue = 7.05273e-19

mdl1 = 


Linear regression model:
    MPG ~ 1 + Acceleration + Weight + Displacement*Horsepower

Estimated Coefficients:
                               Estimate      SE            tStat      pValue    
    (Intercept)                    61.285        2.8052     21.847    1.8593e-69
    Acceleration                 -0.34401       0.11862       -2.9     0.0039445
    Displacement                -0.081198      0.010071    -8.0623    9.5014e-15
    Horsepower                   -0.24313      0.026068    -9.3265    8.6556e-19
    Weight                     -0.0014367    0.00084041    -1.7095      0.088166
    Displacement:Horsepower    0.00054236    5.7987e-05     9.3531    7.0527e-19


Number of observations: 392, Error degrees of freedom: 386
Root Mean Squared Error: 3.84
R-squared: 0.761,  Adjusted R-Squared 0.758
F-statistic vs. constant model: 246, p-value = 1.32e-117

step stopped after just one change.

To try to simplify the model, remove the Acceleration and Weight terms from mdl1:

mdl2 = removeTerms(mdl1,'Acceleration + Weight')
mdl2 = 


Linear regression model:
    MPG ~ 1 + Displacement*Horsepower

Estimated Coefficients:
                               Estimate      SE           tStat      pValue     
    (Intercept)                    53.051        1.526     34.765    3.0201e-121
    Displacement                -0.098046    0.0066817    -14.674     4.3203e-39
    Horsepower                   -0.23434     0.019593     -11.96     2.8024e-28
    Displacement:Horsepower    0.00058278    5.193e-05     11.222     1.6816e-25


Number of observations: 392, Error degrees of freedom: 388
Root Mean Squared Error: 3.94
R-squared: 0.747,  Adjusted R-Squared 0.745
F-statistic vs. constant model: 381, p-value = 3e-115

mdl2 uses just Displacement and Horsepower, and has nearly as good a fit to the data as mdl1 in the Adjusted R-Squared metric.

Predict or Simulate Responses to New Data

There are three ways to use a linear model to predict or simulate the response to new data:

predict

This example shows how to predict and obtain confidence intervals on the predictions using the predict method.

  1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors.

    load carbig
    X = [Acceleration,Displacement,Horsepower,Weight];
    mdl = fitlm(X,MPG);
  2. Create a three-row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values.

    Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data
  3. Find the predicted model responses and confidence intervals on the predictions.

    [NewMPG NewMPGCI] = predict(mdl,Xnew)
    NewMPG =
       34.1345
       23.4078
        4.7751
    
    NewMPGCI =
       31.6115   36.6575
       22.9859   23.8298
        0.6134    8.9367

    The confidence bound on the mean response is narrower than those for the minimum or maximum responses, which is quite sensible.

feval

When you construct a model from a table or dataset array, feval is often more convenient for predicting mean responses than predict. However, feval does not provide confidence bounds.

This example shows how to predict mean responses using the feval method.

  1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors.

    load carbig
    tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG);
    mdl = fitlm(tbl,'linear','ResponseVar','MPG');
  2. Create a three-row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values.

    X = [Acceleration,Displacement,Horsepower,Weight];
    Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data

    The Xnew array has the correct number of columns for prediction, so feval can use it for predictions.

  3. Find the predicted model responses.

    NewMPG = feval(mdl,Xnew)
    NewMPG =
       34.1345
       23.4078
        4.7751

random

The random method simulates new random response values, equal to the mean prediction plus a random disturbance with the same variance as the training data.

This example shows how to simulate responses using the random method.

  1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors.

    load carbig
    X = [Acceleration,Displacement,Horsepower,Weight];
    mdl = fitlm(X,MPG);
  2. Create a three-row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values.

    Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data
  3. Generate new predicted model responses including some randomness.

    rng('default') % for reproducibility
    NewMPG = random(mdl,Xnew)
    NewMPG =
    
       36.4178
       31.1958
       -4.8176
  4. Because a negative value of MPG does not seem sensible, try predicting two more times.

    NewMPG = random(mdl,Xnew)
    NewMPG =
    
       37.7959
       24.7615
       -0.7783
    NewMPG = random(mdl,Xnew)
    NewMPG =
    
       32.2931
       24.8628
       19.9715

    Clearly, the predictions for the third (maximal) row of Xnew are not reliable.

Share Fitted Models

Suppose you have a linear regression model, such as mdl from the following commands:

load carbig
tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG);
mdl = fitlm(tbl,'linear','ResponseVar','MPG');

To share the model with other people, you can:

  • Provide the model display.

    mdl
    mdl = 
    
    
    Linear regression model:
        MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight
    
    Estimated Coefficients:
                        Estimate      SE            tStat       pValue    
        (Intercept)         45.251         2.456      18.424    7.0721e-55
        Acceleration     -0.023148        0.1256     -0.1843       0.85388
        Displacement    -0.0060009     0.0067093    -0.89441       0.37166
        Horsepower       -0.043608      0.016573     -2.6312      0.008849
        Weight          -0.0052805    0.00081085     -6.5123    2.3025e-10
    
    
    Number of observations: 392, Error degrees of freedom: 387
    Root Mean Squared Error: 4.25
    R-squared: 0.707,  Adjusted R-Squared 0.704
    F-statistic vs. constant model: 233, p-value = 9.63e-102
  • Provide just the model definition and coefficients.

    mdl.CoefficientNames
    ans = 
    
        '(Intercept)'    'Acceleration'    'Displacement'    'Horsepower'    'Weight'
    mdl.Coefficients.Estimate
    ans =
    
       45.2511
       -0.0231
       -0.0060
       -0.0436
       -0.0053
    mdl.Formula
    ans = 
    
    MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight

Linear Regression Workflow

This example shows how to fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.

Step 1. Import the data into a dataset array.

hospital.xls is an Excel® spreadsheet containing patient names, sex, age, weight, blood pressure, and dates of treatment in an experimental protocol. First read the data into a table.

patients = readtable('hospital.xls',...
    'ReadRowNames',true);

Examine the first row of data.

patients(1,:)
ans = 

                name      sex    age    wgt    smoke    sys    dia    trial1
               _______    ___    ___    ___    _____    ___    ___    ______

    YPL-320    'SMITH'    'm'    38     176    1        124    93     18    


               trial2    trial3    trial4
               ______    ______    ______

    YPL-320    -99       -99       -99   

The sex and smoke fields seem to have two choices each. So change these fields to nominal.

patients.smoke = nominal(patients.smoke,{'No','Yes'});
patients.sex = nominal(patients.sex);

Step 2. Create a fitted model.

Your goal is to model the systolic pressure as a function of a patient's age, weight, sex, and smoking status. Create a linear formula for 'sys' as a function of 'age', 'wgt', 'sex', and 'smoke' .

modelspec = 'sys ~ age + wgt + sex + smoke';
mdl = fitlm(patients,modelspec)
mdl = 


Linear regression model:
    sys ~ 1 + sex + age + wgt + smoke

Estimated Coefficients:
                   Estimate        SE        tStat        pValue  
                   _________    ________    ________    __________

    (Intercept)       118.28      7.6291      15.504    9.1557e-28
    sex_m            0.88162      2.9473     0.29913       0.76549
    age              0.08602     0.06731       1.278       0.20438
    wgt            -0.016685    0.055714    -0.29947       0.76524
    smoke_Yes          9.884      1.0406       9.498    1.9546e-15


Number of observations: 100, Error degrees of freedom: 95
Root Mean Squared Error: 4.81
R-squared: 0.508,  Adjusted R-Squared 0.487
F-statistic vs. constant model: 24.5, p-value = 5.99e-14

The sex, age, and weight predictors have rather high $p$ -values, indicating that some of these predictors might be unnecessary.

Step 3. Locate and remove outliers.

See if there are outliers in the data that should be excluded from the fit. Plot the residuals.

plotResiduals(mdl)

There is one possible outlier, with a value greater than 12. This is probably not truly an outlier. For demonstration, here is how to find and remove it.

Find the outlier.

outlier = mdl.Residuals.Raw > 12;
find(outlier)
ans =

    84

Remove the outlier.

mdl = fitlm(patients,modelspec,...
    'Exclude',84);

mdl.ObservationInfo(84,:)
ans = 

               Weights    Excluded    Missing    Subset
               _______    ________    _______    ______

    WXM-486    1          true        false      false 

Observation 84 is no longer in the model.

Step 4. Simplify the model.

Try to obtain a simpler model, one with fewer predictors but the same predictive accuracy. step looks for a better model by adding or removing one term at a time. Allow step take up to 10 steps.

mdl1 = step(mdl,'NSteps',10)
1. Removing wgt, FStat = 4.6001e-05, pValue = 0.9946
2. Removing sex, FStat = 0.063241, pValue = 0.80199

mdl1 = 


Linear regression model:
    sys ~ 1 + age + smoke

Estimated Coefficients:
                   Estimate       SE       tStat       pValue  
                   ________    ________    ______    __________

    (Intercept)     115.11       2.5364    45.383    1.1407e-66
    age            0.10782     0.064844    1.6628       0.09962
    smoke_Yes       10.054      0.97696    10.291    3.5276e-17


Number of observations: 99, Error degrees of freedom: 96
Root Mean Squared Error: 4.61
R-squared: 0.536,  Adjusted R-Squared 0.526
F-statistic vs. constant model: 55.4, p-value = 1.02e-16

step took two steps. This means it could not improve the model further by adding or subtracting a single term.

Plot the effectiveness of the simpler model on the training data.

plotResiduals(mdl1)

The residuals look about as small as those of the original model.

Step 5. Predict responses to new data.

Suppose you have four new people, aged 25, 30, 40, and 65, and the first and third smoke. Predict their systolic pressure using mdl1.

ages = [25;30;40;65];
smoker = {'Yes';'No';'Yes';'No'};
systolicnew = feval(mdl1,ages,smoker)
systolicnew =

  127.8561
  118.3412
  129.4734
  122.1149

To make predictions, you need only the variables that mdl1 uses.

Step 6. Share the model.

You might want others to be able to use your model for prediction. Access the terms in the linear model.

coefnames = mdl1.CoefficientNames
coefnames = 

    '(Intercept)'    'age'    'smoke_Yes'

View the model formula.

mdl1.Formula
ans = 

sys ~ 1 + age + smoke

Access the coefficients of the terms.

coefvals = mdl1.Coefficients(:,1); % table
coefvals = table2array(coefvals)
coefvals =

  115.1066
    0.1078
   10.0540

The model is sys = 115.1066 + 0.1078*age + 10.0540*smoke, where smoke is 1 for a smoker, and 0 otherwise.

Was this topic helpful?