Motor Trend Car Road Test (mtcars) is an R inbuilt dataset contained in the datasets package. R developers obtained the data from the 1974 Motor Trend United States magazine. The dataset contains details of 32 automobiles manufactured between (1973 -1974). The dataset has 32 rows and 11 columns. The columns rows represent each car model, while the columns represent each of the 11 aspects of the automobile design and performance.
We can download the data using the data(mtcars) command and review the dataset’s dimension and the first five rows.
data(mtcars)
dim(mtcars)
[1] 32 11
head(mtcars, 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Variables definitions
We use the ?mtcars to learn about the details of variables contained in the mtcars dataset.
- mpg – Denotes miles per gallon.
- cyl – Denotes the number of cylinders.
- disp – Denotes engine displacement.
- hp – Denotes the gross horsepower: the power an engine produces.
- drat – Denotes the Rear axle ratio.
- wt – Denotes the weight of the vehicle.
- qsec – Denotes a quarter mile time.
- vs – Denotes the shape of the engine.
- am – Denotes the transmission mode.
- gear – Denotes the number of forward gears.
- carb – Denotes the number of carburetors.
Determinants of mpg and fuel efficiency
Several factors may lead to variations in the distance a vehicle covers per gallon of gasoline or diesel. Past studies have indicated that weight is the most significant predictor of fuel efficiency. The American Physical Society reported a 6 to 7% increase in fuel efficiency for every 10% reduction in a vehicle’s weight (1). Indeed, (Knittle 13) found a 4.26% increase in fuel efficiency attributed to a 10% decrease in passenger car’s weight. Evidence from these studies suggests a negative association between weight and fuel efficiency.
Other factors that are negatively associated with fuel efficiency include horsepower, automatic transmission, and the number of cylinders. Vehicles aerodynamics are another significant determinant of fuel efficiency (Gautam 5). Aerodynamic design enhances air displacement at an optimal level, enabling the vehicle to flow through easily with the least amount of energy required. Less energy translates to low fuel per unit distance. Smaller vehicles have less drag than bigger vehicles.
Variable labelling
We use the str() function from utils package to check the structure of the eleven variables in our dataset.
str(mtcars)
‘data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 …
$ cyl : num 6 6 4 6 8 6 8 4 4 6 …
$ disp: num 160 160 108 258 360 …
$ hp : num 110 110 93 110 175 105 245 62 95 123 …
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 …
$ wt : num 2.62 2.88 2.32 3.21 3.44 …
$ qsec: num 16.5 17 18.6 19.4 17 …
$ vs : num 0 0 1 1 0 1 0 1 1 1 …
$ am : num 1 1 1 0 0 0 0 0 0 0 …
$ gear: num 4 4 4 3 3 3 3 4 4 4 …
$ carb: num 4 4 1 1 2 1 4 2 2 4 …
Variables such as mpg, disp, drat, wt, qsec are continuous and are correctly labelled. However, variables such as cyl, hp, gear, and carb are integers and need to be re-labelled. Vs and am are factors but labelled as numerical. The following code re-labels all the wrongly labeled variables.
library(tidyverse)
mtcars <- mtcars %>%
mutate(cyl = as.integer(cyl),
hp = as.integer(hp),
gear = as.integer(gear),
carb = as.integer(carb),
vs = as.factor(vs),
am = as.factor(am))
str(mtcars)
‘data.frame’: 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 …
$ cyl : int 6 6 4 6 8 6 8 4 4 6 …
$ disp: num 160 160 108 258 360 …
$ hp : int 110 110 93 110 175 105 245 62 95 123 …
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 …
$ wt : num 2.62 2.88 2.32 3.21 3.44 …
$ qsec: num 16.5 17 18.6 19.4 17 …
$ vs : Factor w/ 2 levels “0”,”1″: 1 1 2 2 1 2 1 2 2 2 …
$ am : Factor w/ 2 levels “0”,”1″: 2 2 2 1 1 1 1 1 1 1 …
$ gear: int 4 4 4 3 3 3 3 4 4 4 …
$ carb: int 4 4 1 1 2 1 4 2 2 4 …
Rounding off the wt variable
Weight (wt) has more than two digits after the decimal point.
mtcars$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
We round it off to two decimal points and rename the rename the variable as wtR.
mtcars$wtR <- round(mtcars$wt, 2)
mtcars$wtR
[1] 2.62 2.88 2.32 3.21 3.44 3.46 3.57 3.19 3.15 3.44 3.44 4.07 3.73 3.78 5.25
[16] 5.42 5.34 2.20 1.61 1.83 2.46 3.52 3.44 3.84 3.85 1.94 2.14 1.51 3.17 2.77
[31] 3.57 2.78
Association between mpg and other variables
The following are the results of a multiple regression model of mpg as the dependent variable and other explanatory variables.
mod1 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, data = mtcars)
summary(mod1)
Predictors Estimate 95% CI P-value
Intercept 12.30 -26.62-51.23 0.52
cyl -0.11 -2.28-2.06 0.92
disp 0.01 -0.02-0.05 0.46
hp -0.02 -0.07-0.02 0.34
drat 0.79 -2.61-4.19 0.64
wt -3.72 -7.65-0.22 0.06
qsec 0.82 -0.70-2.34 0.27
vs[1] 0.32 -4.06-4.69 0.89
am[1] 2.52 -1.76-6.80 0.23
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066. CI, confidence interval
Although not statistically significant, weight, number of cylinders and horsepower are negatively associated with miles per gallon.
Missing Values
In R, missing values are denoted as NA. We use the is.na() function to find missing values.
table(is.na(mtcars))
FALSE
384
The result is FALSE, which implies that the mtcars dataset has no missing values. There are several ways of handling missing values.
Removing observations with missing values is a common practice of dealing with missing values. The method involves deleting all the observations with at least one missing value in any column. However, deleting observations should be applied with caution. The method reduces the sample size and consequently limits the extent to which we can generalize our findings.
Imputing the missing values is a method that capitalizes on the limitation of deleting missing values. Imputation means replacing the missing value with a value. Mean and median imputations are the most common imputation methods for numerical data, while the mode for categorical or factor data. However, mean, median, and mode are not robust and necessitate other methods such as bag impute.
Minimum and maximum values for numerical variables
Minimum
mtcars %>%
select_if(is.numeric) %>%
map_dbl(min)
mpg cyl disp hp drat wt qsec gear carb wtR
10.400 4.000 71.100 52.000 2.760 1.513 14.500 3.000 1.000 1.510
Maximum
mtcars %>%
select_if(is.numeric) %>%
map_dbl(max)
mpg cyl disp hp drat wt qsec gear carb wtR
33.900 8.000 472.000 335.000 4.930 5.424 22.900 5.000 8.000 5.420
The range of the different variables seems acceptable. In the case of “weird” values, the appropriate action is to confirm whether there was an entry error with the data source.
Density plots for numerical variables
Histograms for numerical variables
While histograms and density plots help visualize numerical variables’ distribution, histograms seem more meaningful for discrete and density plots seem meaningful for continuous data.
Boxplots for numerical variables
Boxplots are only meaningful for continuous data. In R, potential outliers are observation below q1 – 1.5xIQR, or above q3 + 1.5xIQR, where q1 is the 25th percentile, q3 is the 75 percentile, and IQR is the interquartile range. In most cases, removing or keeping an outlier depends on the analysis’s context and the robustness of the statistical method to be used when handling outliers. In our case, R detects outliers in mpg, hp, wt, qsec, and carb. However, the distance between the maximum values and the outliers is not significant to consider values above q3 + 1.5xIQR as outliers.
Summary statistics for numerical variables
mtcars %>% select_if(is.numeric) %>%
map_df(summary)
Variable Min 1st Qu. Median Mean 3rd Qu. Max
mpg 10.40 15.43 19.20 20.09 22.8 33.9
cyl 4.00 4.00 6.00 6.19 8.00 8.00
disp 71.10 120.83 196.30 230.72 326.00 472.00
hp 52.00 96.50 123.00 146.68 180.00 335.00
drat 2.76 3.08 3.70 3.60 3.92 4.93
wt 1.51 2.58 3.33 3.22 3.61 5.42
qsec 14.50 16.89 17.71 17.85 18.90 22.90
gear 3.00 3.00 4.00 3.69 4.00 5.00
carb 1.00 2.00 2.00 2.81 4.00 8.00
Min, minimum, Max, maximum, 1st Qu., first quartile, 3rd Qu., third quartile.
vs and am variables
vs is a factor variable and denotes the shape of the engine. 0 denotes a V-shaped engine while 1 denotes a straight engine.
table(mtcars$vs)
0 1
18 14
round(prop.table(table(mtcars$vs)), 2)
0 1
0.56 0.44
56% of the vehicles in mtcars dataset had a V-shaped engine.
am is a factor variable and denotes mode of transmission. 0 denotes automatic transmission while 1 denotes manual transmission.
table(mtcars$am)
0 1
19 13
round (prop.table(table(mtcars$am)), 2)
0 1
0.59 0.41
59% of the vehicles in the mtcars dataset were operated on automatic transmission.
Works cited
American Physical Society. “Energy Future: Think Efficiency.” (September 2008).
Gautam, Suman. “What Factors Affect Average Fuel Economy of US Passenger Vehicles?.” (2010).
Team, R. Core. “R: A language and environment for statistical computing.” (2013).