The purpose of this section is not to teach you to use linear models for the first time, but to remind you some core ideas that you may have learned in the past (i.e., in previous subjects) if you are a little rusty when moving into advanced subjects.
Linear models are the most fundamental statistical models. They are
mathematical tools that describe linear relationships between variables
in a dataset. Fitting models is a crucial component of statistics,
enabling both statistical inference and prediction.
A Response variable is the one whose content we are
trying to model with other variables, called
explanatory/predictor variables. In any model there is
one response variable and one or many predictor variables. The choice of
model depends on the type of the variables and the questions we seek to
answer. For example, for continuous response variable and one continuous
predictor variable, we will use simple linear regression model.
Exploratory data analysis is a very good first step, providing insight into the nature of your data before fitting a particular model.
An essential element of model lies in the use of model formulas to define the variables within the model and the potential interactions among predictor variables incorporated in the model. These model formulas are inputted into functions used to execute operations such as linear regression or analysis of variance (ANOVA) in R. A model formulas have a format like \[y \sim x_1+x_2\]
where \(y\) is the response variable and \(x_1\) and \(x_2\) are the predictor variables and \(\sim\) means “is modeled as a function of”.
The linear models may the the most widely used statistical model. Linear models is used when the response variable is continuous. For the response variable \(y\) and predictor variables \(x_1, x_2,......,x_p\), we fit a model of them form, \[y= \beta_0 + \beta_1 x_1+....+\beta_p x_p + \epsilon, \ \ \mbox{ where } \epsilon \sim N(0, \sigma^2)\] Our aim is the estimate the parameter \(\beta_1, \beta_2,....,\beta_p\).
The function used for fitting linear models in R is
the lm( ) (lm is short for linear model). The
lm( ) function has many arguments but the most important is
the first argument which specifies the model you want to fit using a
model formula. As said earlier, the type of linear model we want to fit
depends on the type of data in the predictor variable(s).
If the predictor variable is continuous, we will fit a simple linear regression model.
Model formula : \(y \sim x\)
Model: lm(y ~ x)
If the predictor variable is categorical, we will fit an
ANOVA type model.
Model formula : \(y \sim x\), where \(x\) is categorical
Model: aov(y ~ x)
If we have two or more predictor variables, we will fit a multiple linear regression.
Model formula : \(y \sim x_1 +x_2+x_3..\)
Model: lm(y ~ x_1 +x_2+x_3..)
If there is an interaction between two predictor variables, we can perform linear model with interaction.
Model formula : \(y \sim x_1 * x_2\)
Model: lm(y ~ x_1 * x_2)
Linear relationship: There exist linear relationship between the response and explanatory variable(s).
Independence: The residuals are independent.
Constant variance: The residuals must have constant variance.
Normality: The residuals of the model are normally distributed.
We will use residual plots to check these assumptions.
A simple linear regression models the relationship between a continuous response variable and continuous predictor variable by fitting a linear equation to observed data.
We will demonstrate the usage of simple linear regression using the \(R\) built-in data called iris. You can read more about the dataset using
?iris
The dataset contains information about the sepal and petal measures for three species of iris flowers. Let us model the relation between the petal width and petal length.
library(GGally)
ggpairs(iris [,c("Petal.Length","Petal.Width")],
aes(colour = as.factor(iris$Species)))

We can see from the plots that relationship between petal width and petal length differs by the species. Let us for now focus only on the versicolor species.
Let us first subset the dataset by versicolor, which contains information about the versicolor species only.
versicolor <- subset(iris, Species == "versicolor" )
head(versicolor, 5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
Let’s plot the petal width vs petal length
ggplot(versicolor, aes(x = Petal.Length, y = Petal.Width)) + geom_point()

We can see that there is a strong pattern in the data. The relationship looks quite strong and linear.
lm()Let us use petal width as response variable and petal length as an explanatory variable. We will fit model of the form \[\hat{y} = \beta_0 + \beta_1 x\]
mod1 <- lm(Petal.Width ~ Petal.Length, data=versicolor)
lm() function comes with few utility functions like
coef() to generate model coefficients.fitted() to generate the fitted values.residuals() to generate the residuals.summary() to produce model summaries.plot() to produce diagnostic plots.predict() to make predictions using the model.For example the model coefficients are
coef(mod1)
## (Intercept) Petal.Length
## -0.08428835 0.33105360
i.e
We can add the fitted model to the graph including the standard error
ggplot(versicolor, aes(x = Petal.Length, y = Petal.Width)) + geom_point() + geom_smooth(method = "lm", se = TRUE)

summary() functionThe function summary() gives the summary of the linear model we have fitted
summary(mod1)
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = versicolor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.273031 -0.073373 -0.006479 0.086271 0.295231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08429 0.16070 -0.525 0.602
## Petal.Length 0.33105 0.03750 8.828 1.27e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1234 on 48 degrees of freedom
## Multiple R-squared: 0.6188, Adjusted R-squared: 0.6109
## F-statistic: 77.93 on 1 and 48 DF, p-value: 1.272e-11
This gives
F-statistic: 77.93 on 1 and 48 DF, p-value: 1.272e-11 is significant, so Petal.Length is useful in predicting Petal width.
\(R^2 = 0.618\) means that 61.8% of the variation in Petal width is explained by this model using petal length.
The predicted model equation is
\[\widehat{Petal.Width} = -0.084 + 0.33 \times Petal.Length\] ### Interpretation of coefficients
plot() functionpar(mfrow = c(2,2))
plot(mod1)

Residual vs Fitted plot : Check if the residual exhibits a clear pattern and if the residuals are randomly scattered around 0.
Q-Q Residuals plot : Check if the residual follows a linear trend with no obvious shape or deviation.
Scale-Location: Same as first plot but plotted on different scale.
Residual vs Leverage: Check for obvious groups and if points are too far from the center.
predict() fucntionUsing the fitted model, let us predict the Petal.Width of versicolor species with Petal.Length=4.8,
petal_new <- data.frame(Petal.Length= 4.8) # Define a new data.frame with Petal;length = 4.8
mod1.pred = predict(mod1, newdata = petal_new )
mod1.pred
## 1
## 1.504769
When we have two continous predictors, we will fit model of the form
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +.....\]
Let us include additional additional variable Sepal.Width in order to estimate the Petal.Width,
mod2 <- lm(Petal.Width ~ Petal.Length +Sepal.Width, data=versicolor)
summary(mod2)
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Sepal.Width, data = versicolor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.270959 -0.064119 -0.005774 0.072671 0.248525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.32521 0.16312 -1.994 0.05201 .
## Petal.Length 0.25433 0.04118 6.177 1.45e-07 ***
## Sepal.Width 0.20496 0.06166 3.324 0.00173 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1122 on 47 degrees of freedom
## Multiple R-squared: 0.6914, Adjusted R-squared: 0.6783
## F-statistic: 52.65 on 2 and 47 DF, p-value: 1.002e-12
F-statistic: 52.65 on 2 and 47 DF, p-value: 1.002e-12 is significant, so at least one predictors is useful in predicting petal width. We can see that both the predictor are also statistically significant.
\(R^2 = 0.67\) means that 67% of the variation in petal width is explained by this model using petal length and sepal width.
The predicted model equation is
\[\widehat{Petal.Width} = -0.325 + 0.254 \times Petal.Length+ 0.205 \times Sepal.Width \]
par(mfrow = c(2,2))
plot(mod2)

When fitting multiple linear regression, we need to be careful of
Multicollinearity. Multicollinearity is the occurrence high correlation
with the two or more predictor variables in a multiple regression model.
Each regression coefficient is interpreted as mean change in the
response bariable assuming other variables are held constant. But if two
or more predictor variables are highly correlated, it is difficult to
change one variable without changing another. One way to detect using
the variance inflation factor (VIF), which measures
strength of correlation between the predictor variables in a regression
model. We can us use the vif() function from the car
library to calculate the VIF for each predictor variable in the model.
Generally VIF of less than 5 is not an issue.
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Warning: package 'carData' was built under R version 4.3.2
vif(mod2)
## Petal.Length Sepal.Width
## 1.458119 1.458119
Analysis of variance is the modeling technique used when the response variable is continuous and the explanatory variable is categorical. When there is only two levels, this is basically the t.test.
An anova in R is executed with the aov( ) function. From
the iris data, let look at the difference in petal
length between the three species. Lets visualize the relationship
between petal length and species using a boxplot.
ggplot(iris, mapping=aes(x=Species, y=Petal.Length, fill = Species))+geom_boxplot()

Calculate the standard deviations
tapply(iris$Petal.Length,iris$Species, sd)
## setosa versicolor virginica
## 0.1736640 0.4699110 0.5518947
mod3 <- aov(Petal.Length~Species, data=iris)
summary(mod3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the F-statistic and the p-value associated with that. P-value is significant, so we can say that there is difference in petal length among the species.
To specify the difference between each groups, we can use a pairwise
test using the pairwise.t.test() in \(R\). We will specify that
p.adj=‘bonf’ as we need to adjust the p-values for
doing multiple comparisons.
pairwise.t.test(iris$Petal.Length, iris$Species, p.adj='bonf')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris$Petal.Length and iris$Species
##
## setosa versicolor
## versicolor <2e-16 -
## virginica <2e-16 <2e-16
##
## P value adjustment method: bonferroni
There are 3 groups, so we have 3 pairwise comparisons to look at. We can see that all the 3 p-values are significant.
lm() for categorical predictorFor a categorical predictor with levels A, B, C…., we can also fit a similar linear model of the form \[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2.......\] Where
The intercept \(\beta_0\) is the mean for the base level. \(\beta_1\) is the mean difference between level B and level A. \(\beta_2\) is the mean difference between level C and level A.
Lets fit the linear model for the above question
mod4 <- lm(Petal.Length~Species, data=iris)
summary(mod4)
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.260 -0.258 0.038 0.240 1.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.46200 0.06086 24.02 <2e-16 ***
## Speciesversicolor 2.79800 0.08607 32.51 <2e-16 ***
## Speciesvirginica 4.09000 0.08607 47.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406
## F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16
We can see that the overall F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16 is same as above.
Here the base/reference level is setosa. If you don’t specify, \(R\) will level them alphabetically.
The predicted model equation is
\[\widehat{Petal.Length} = 1.46 + 2.80 \times Species_{Versicolor} + 4.09 \times Species_{Virginica}\]
par(mfrow = c(2,2))
plot(mod4)

The multiple regression model assumes that when one of the predictors is changed by 1 unit and the values of the other variables remain constant, the predicted response changes by \(\beta_j\).
A statistical interaction occurs when this assumption is not true, such that the effect of one explanatory variable on the response depends on other explanatory variables.
Here we specifically examine interaction in a two-variable setting, where one of the predictors is categorical and the other is numerical.
For example, the effect of petal width on petal length may be dependent on the species of the flowers. This dependency is known as an interaction effect.
We can use interaction_plot() function from
interactions package, as follows
library(interactions)
## Warning: package 'interactions' was built under R version 4.3.3
mod5<- lm(Petal.Length ~ Petal.Width * Species, data = iris)
interact_plot(mod5, pred = Petal.Width, modx = Species,
plot.points = TRUE)

Here we will fit model of the form
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_3 + \beta_4 x_1 \times x_2 + \beta_5 x_1 \times x_3 \] where
mod5<- lm(Petal.Length ~ Petal.Width * Species, data = iris)
anova(mod5)
## Analysis of Variance Table
##
## Response: Petal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Petal.Width 1 430.48 430.48 3294.5561 < 2.2e-16 ***
## Species 2 13.01 6.51 49.7891 < 2.2e-16 ***
## Petal.Width:Species 2 2.02 1.01 7.7213 0.0006525 ***
## Residuals 144 18.82 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod5)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84099 -0.19343 -0.03686 0.16314 1.17065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3276 0.1309 10.139 < 2e-16 ***
## Petal.Width 0.5465 0.4900 1.115 0.2666
## Speciesversicolor 0.4537 0.3737 1.214 0.2267
## Speciesvirginica 2.9131 0.4060 7.175 3.53e-11 ***
## Petal.Width:Speciesversicolor 1.3228 0.5552 2.382 0.0185 *
## Petal.Width:Speciesvirginica 0.1008 0.5248 0.192 0.8480
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3615 on 144 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9581
## F-statistic: 681.9 on 5 and 144 DF, p-value: < 2.2e-16
We can see that the overall model is F-statistic: 562.9 on 5 and 144 DF, p-value: < 2.2e-16, which is significant. From the ANOVA table, the interaction and main effect terms are all significant.
The predicted model equation is
\[\widehat{Petal.Length} = 1.328 + 0.547 \times Petal.Width + 0.454 \times Species_{Versicolor} + 2.913 \times Species_{Virginica} + 1.323 \times Petal.Width \times Species_{Versicolor} + 0.101 \times Petal.Width \times Species_{Virginica} \]
par(mfrow = c(2,2))
plot(mod5)

predict() fucntionUsing the fitted model, let us predict the Petal.Width of versicolor species with Sepal.Width = 2.9.
petal_new <- data.frame(Petal.Width = 1.4, Species = "versicolor")
mod5.pred = predict(mod5, newdata = petal_new )
mod5.pred
## 1
## 4.39833
A logistic regression is used when there is one binary response variable (yes/no, success/failure) and continuous, categorical or multiple predictor variables which is related to the probability or odds of the response variable.
If \(p\) is the probability of an outcome occurring then,
\[\mbox{ odds } = \frac{p}{1-p} \]
So, suppose \(Y\) is a binary response variable and \(x_1, x_2, x_3....\) are a predictor variable, we fit model of the form
\[ ln(\frac{\hat{p}}{1-\hat{p}}) = \beta_0 + \beta_1 x_1+ \beta_2 x_2.....\] Where \(\hat{p}\) is the expected proportion of success.
glm()We can use glm() function to fit a logistic regression
model. As in the lm(), the important argument is the model
formula. We also need to specify that
family="binomial".
We will demonstrate the usage of logistic regression using the R built-in data called mtcars. You can read more about the dataset using
?mtcars
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
As vsEngine (0 = V-shaped, 1 = straight) is a binary
variable, we will use it as response variable, mpg
Miles/(US) gallon as a continuous predictor variable.
table(mtcars$vs)
##
## 0 1
## 18 14
mod6 <- glm(vs ~ mpg , data=mtcars, family=binomial)
summary(mod6)
##
## Call:
## glm(formula = vs ~ mpg, family = binomial, data = mtcars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.8331 3.1623 -2.793 0.00522 **
## mpg 0.4304 0.1584 2.717 0.00659 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 25.533 on 30 degrees of freedom
## AIC: 29.533
##
## Number of Fisher Scoring iterations: 6
glm()Deviance is the measure of goodness of fit of the glm. The residual deviance has an approximate \(\chi^2\) distribution. we can test the goodness of fit by
pchisq(df=30,q=25.533,lower.tail=F)
## [1] 0.6987382
This is not significant, hence the model provides an adequate fit to the data.
The predicted model equation is
\[ ln(\frac{\hat{p}}{1-\hat{p}}) = -8.833 + 0.430 \times mpg\]
vs =1 is -8.833. i.e, the odds of having
vs =1 is \(e^{-8.833} =
0.00014\), when mpg = 0.vs =1is increased by 0.430. i.e, for every increase in mpg,
odd of `vs =1’ is increase by \(e^{0.430} =
1.537\) .