Making predictions from a simple linear regression model
Last updated on 2024-03-12 | Edit this page
Estimated time: 20 minutes
Overview
Questions
- How can predictions be manually obtained from a simple linear regression model?
- How can R be used to obtain predictions from a simple linear regression model?
Objectives
- Calculate a prediction from a simple linear regression model using parameter estimates given by the model output.
- Use the predict function to generate predictions from a simple linear regression model.
One of the features of linear regression is prediction: a model
presents predicted mean values for the outcome variable for any values
of the explanatory variables. We have already seen this in the previous
episodes through our effect_plot()
outputs, which showed
mean predicted responses as straight lines (episode 2) or individual
points for levels of a categorical variable (episodes 3). Here, we will
see how to obtain predicted values and the uncertainty surrounding
them.
Calculating predictions manually
First, we can calculate a predicted value manually. From the
summ()
output associated with our
Weight_Height_lm
model from episode 2, we can write the
model as \(E(\text{Weight}) = \beta_0 +
\beta_1 \times \text{Height} = -70.194 + 0.901 \times
\text{Height}\). The output can be found again below. If we take
a height of 165 cm, then our model predicts an average weight of \(-70.194 + 0.901 \times 165 = 78.471\)
kg.
R
Weight_Height_lm <- dat %>%
filter(Age > 17) %>%
lm(formula = Weight ~ Height)
summ(Weight_Height_lm)
OUTPUT
MODEL INFO:
Observations: 6177 (320 missing obs. deleted)
Dependent Variable: Weight
Type: OLS linear regression
MODEL FIT:
F(1,6175) = 1398.22, p = 0.00
R² = 0.18
Adj. R² = 0.18
Standard errors: OLS
-------------------------------------------------
Est. S.E. t val. p
----------------- -------- ------ -------- ------
(Intercept) -70.19 4.06 -17.29 0.00
Height 0.90 0.02 37.39 0.00
-------------------------------------------------
Exercise
Given the summ
output from our
BPSysAve_AgeMonths_lm
model, the model can be described
as
\(E(\text{BPSysAve}) = \beta_0 + \beta_1 \times \text{Age (months)} = 101.812 + 0.033 \times \text{Age (months)}\).
What level of average systolic blood pressure does the model predict, on average, for an individual with an age of 480 months?
\(101.812 + 0.033 * 480 = 117.652 \text{mmHg}\).
Making predictions using make_predictions()
Using the make_predictions()
function brings two
advantages. First, when calculating multiple predictions, we are saved
the effort of inserting multiple values into our model manually and
doing the calculations. Secondly, make_predictions()
returns 95% confidence intervals around the predictions, giving us a
sense of the uncertainty around the predictions.
To use make_predictions()
, we need to create a
tibble
with the explanatory variable values for which we
wish to have mean predictions from the model. We do this using the
tibble()
function. Note that the column name must
correspond to the name of the explanatory variable in the model,
i.e. Height
. In the code below, we create a
tibble
with the values 150, 160, 170 and 180. We then
provide make_predictions()
with this tibble
,
alongside the model from which we wish to have predictions. By default,
95% confidence intervals are returned.
We see that the model predicts an average weight of 64.88 kg for an individual with a height of 150 cm, with a 95% confidence interval of [63.9kg, 65.9kg].
R
Heights <- tibble(Height = c(150, 160, 170, 180))
make_predictions(Weight_Height_lm, new_data = Heights)
OUTPUT
# A tibble: 4 × 4
Height Weight ymin ymax
<dbl> <dbl> <dbl> <dbl>
1 150 64.9 63.9 65.9
2 160 73.9 73.3 74.5
3 170 82.9 82.4 83.4
4 180 91.9 91.2 92.6
Exercise
- Using the
make_predictions()
function, obtain the expected mean average systolic blood pressure levels predicted by theBPSysAve_AgeMonths_lm
model for individuals with an age of 300, 400, 500 and 600 months. - Obtain 95% confidence intervals for these predictions.
- How are these confidence intervals interpreted?
R
BPSysAve_AgeMonths_lm <- dat %>%
filter(Age > 17) %>%
lm(formula = BPSysAve ~ AgeMonths)
ages <- tibble(AgeMonths = c(300, 400, 500, 600))
make_predictions(BPSysAve_AgeMonths_lm, new_data = ages)
OUTPUT
# A tibble: 4 × 4
AgeMonths BPSysAve ymin ymax
<dbl> <dbl> <dbl> <dbl>
1 300 112. 111. 113.
2 400 115. 114. 116.
3 500 118. 118. 119.
4 600 121. 121. 122.
Recall that 95% of 95% confidence intervals are expected to contain the population mean. Therefore, we can be fairly confident that the true population means lie somewhere between the bounds of the intervals, assuming that our model is good.