GoalZone is a fitness club chain providing fitness classes in Canada. Finally, the fitness classes schedule is back to normal after the COVID-19 restrictions are lifted. However, they have received many complaints from the customers about having a hard time booking a fitness class.
The Problem: Some classes are always fully booked, and fully booked classes often have a low attendance rate.
Goalzone wants to increase the number of spaces available for classes. They want to do this by predicting whether the member will attend the class or not. If they can predict a member will not attend the class, they can make another space available
The dataset contains the attendance information for the class scheduled this year so far. The data comes from DataCamp as a part of their Data Science Certificate Project. The data can also be downloaded from Kaggle.com
Column Name | Criteria |
---|---|
booking_id | Nominal. The unique identifier of the booking. Missing values are not possible due to the database structure. |
months_as_member | Discrete. The number of months as this fitness club member, minimum 1 month. Replace missing values with the overall average month. |
weight | Continuous. The member’s weight in kg, rounded to 2 decimal places. The minimum possible value is 40.00 kg. Replace missing values with the overall average weight. |
days_before | Discrete. The number of days before the class the member registered, minimum 1 day. Replace missing values with 0. |
day_of_week | Nominal. The day of the week of the class. One of “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat” or “Sun”. Replace missing values with “unknown”. |
time | Ordinal. The time of day of the class. Either “AM” or “PM”. Replace missing values with “unknown”. |
category | Nominal. The category of the fitness class. One of “Yoga”, “Aqua”, “Strength”, “HIIT”, or “Cycling”. Replace missing values with “unknown”. |
attended | Nominal. Whether the member attended the class (1) or not (0). Missing values should be removed. |
library(tidyverse)
library(ggthemes)
library(rsample)
library(pROC)
library(rpart)
library(rpart.plot)
# Import data
goalzone_fc <- read_csv("https://raw.githubusercontent.com/xrander/fitness_class_project/master/fitness_class_2212.csv")
## [1] 1500 8
The data contains 1500 observations and 8 variables.
I’ll start by checking if the data loaded follows the data description above.
## spc_tbl_ [1,500 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ booking_id : chr [1:1500] "0001" "0002" "0003" "0004" ...
## $ months_as_member: num [1:1500] 17 10 16 5 15 7 11 9 23 7 ...
## $ weight : num [1:1500] 79.6 79 74.5 86.1 69.3 ...
## $ days_before : chr [1:1500] "8" "2" "14" "10" ...
## $ day_of_week : chr [1:1500] "Wed" "Mon" "Sun" "Fri" ...
## $ time : chr [1:1500] "PM" "AM" "AM" "AM" ...
## $ category : chr [1:1500] "Strength" "HIIT" "Strength" "Cycling" ...
## $ attended : num [1:1500] 0 0 0 0 0 0 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. booking_id = col_character(),
## .. months_as_member = col_double(),
## .. weight = col_double(),
## .. days_before = col_character(),
## .. day_of_week = col_character(),
## .. time = col_character(),
## .. category = col_character(),
## .. attended = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
All the variables do not follow the description provided in the dataset section above, except “attended”, “months_as_member”, and “weight” column. The first column, “booking_id” is the unique_id and not really of interest here.
Before correcting the variables, I will check for duplicated entries and unique values.
## [1] FALSE
There are no duplicate observations.
## $days_before
## [1] "8" "2" "14" "10" "6" "4" "9"
## [8] "12" "5" "3" "7" "13" "12 days" "20"
## [15] "1" "15" "6 days" "11" "13 days" "3 days" "16"
## [22] "1 days" "7 days" "8 days" "10 days" "14 days" "17" "5 days"
## [29] "2 days" "4 days" "29"
##
## $day_of_week
## [1] "Wed" "Mon" "Sun" "Fri" "Thu" "Wednesday"
## [7] "Fri." "Tue" "Sat" "Monday"
##
## $time
## [1] "PM" "AM"
##
## $category
## [1] "Strength" "HIIT" "Cycling" "Yoga" "-" "Aqua"
##
## $attended
## [1] 0 1
The variable days_before should be numeric, day_of_week have some inconsistent data, time.
# Mutate columns that do not follow data description to write data type
goalzone_fc <- goalzone_fc %>%
mutate(booking_id = as.numeric(booking_id),
#trim strings, remove all days which is a recurring character and coarse the figures into numeric data type
days_before = as.numeric(str_remove_all(str_trim(days_before), "days")),
# rename the observations to aid consistency and coarse to factor data type
day_of_week = factor(ifelse(day_of_week == "Wednesday", "Wed",
ifelse(day_of_week == "Fri.", "Fri",
ifelse(day_of_week == "Monday", "Mon", day_of_week))), level = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
time = factor(time, levels = c("AM", "PM"),ordered = T),
category = factor(ifelse(category == "-", "Unknown", category), ordered = F),
attended = as.numeric(attended))
## booking_id months_as_member weight days_before
## Min. : 1.0 Min. : 1.00 Min. : 55.41 Min. : 1.000
## 1st Qu.: 375.8 1st Qu.: 8.00 1st Qu.: 73.49 1st Qu.: 4.000
## Median : 750.5 Median : 12.00 Median : 80.76 Median : 9.000
## Mean : 750.5 Mean : 15.63 Mean : 82.61 Mean : 8.347
## 3rd Qu.:1125.2 3rd Qu.: 19.00 3rd Qu.: 89.52 3rd Qu.:12.000
## Max. :1500.0 Max. :148.00 Max. :170.52 Max. :29.000
## NA's :20
## day_of_week time category attended
## Mon:228 AM:1141 Aqua : 76 Min. :0.0000
## Tue:195 PM: 359 Cycling :376 1st Qu.:0.0000
## Wed:116 HIIT :667 Median :0.0000
## Thu:241 Strength:233 Mean :0.3027
## Fri:305 Unknown : 13 3rd Qu.:1.0000
## Sat:202 Yoga :135 Max. :1.0000
## Sun:213
From the summary output above, weight is having missing data and this will be replaced with the mean.
# replace missing data with mean of weights and round to 2 decimal place
goalzone_fc <- goalzone_fc %>%
mutate(weight = round(replace_na(weight, mean(weight, na.rm = T)), 2))
summary(goalzone_fc)
## booking_id months_as_member weight days_before
## Min. : 1.0 Min. : 1.00 Min. : 55.41 Min. : 1.000
## 1st Qu.: 375.8 1st Qu.: 8.00 1st Qu.: 73.56 1st Qu.: 4.000
## Median : 750.5 Median : 12.00 Median : 81.03 Median : 9.000
## Mean : 750.5 Mean : 15.63 Mean : 82.61 Mean : 8.347
## 3rd Qu.:1125.2 3rd Qu.: 19.00 3rd Qu.: 89.38 3rd Qu.:12.000
## Max. :1500.0 Max. :148.00 Max. :170.52 Max. :29.000
##
## day_of_week time category attended
## Mon:228 AM:1141 Aqua : 76 Min. :0.0000
## Tue:195 PM: 359 Cycling :376 1st Qu.:0.0000
## Wed:116 HIIT :667 Median :0.0000
## Thu:241 Strength:233 Mean :0.3027
## Fri:305 Unknown : 13 3rd Qu.:1.0000
## Sat:202 Yoga :135 Max. :1.0000
## Sun:213
Some insights into the data.
goalzone_fc %>%
ggplot(aes(category, fill = factor(attended)))+
geom_bar(position = "dodge",
width = 1)+
labs(fill = "Attended",
x = "Category",
y = "count")+
expand_limits(y = (0:500))+
scale_fill_manual(values = c("olivedrab2", "slateblue1"),
labels = c("No", "Yes"))+
theme_igray()
For all the categories of exercise in the gym. members miss gym classes than attend them.
goalzone_fc %>%
ggplot(aes(months_as_member)) +
geom_histogram(binwidth = 10,
fill = "mistyrose4")+
labs()+
facet_wrap(~category, scales = "free_y")
For all the categories of classes most of the members have been members for less than 50 months.
## [1] -0.4655935
ggplot(goalzone_fc, aes(weight, months_as_member))+
geom_point()+
geom_smooth(method = "lm", se = F, col = "red")+
labs(x = "weight (kg)",
y = "Months as Member",
title = "Relationship between Months as Member and Weight")+
theme_clean()+
geom_text(x = 100, y = 100, label = "r = -4.66", col = "red")+
theme(plot.title = element_text(face = "plain"))
The chart shows that people who have been gym members for long tend to have their weight reduced, but this is not so for all the case.
goalzone_fc %>% ggplot(aes(months_as_member, attended))+
geom_point()+
geom_smooth(se = F,
method = "glm",
method.args = list(family = "binomial")) +
labs(x = "Months as member",
y = "Attended",
title = "Relationship between Attendance and Number of Months as Member")+
theme_igray()
## `geom_smooth()` using formula = 'y ~ x'
The plot shows a logistic relationship
The Problem: Predict which member will or will not attend a gym class.
My first approach is to use logistic regression. This will be the baseline model, and it is used for its simplicity.
My second approach is to use Decision Trees approach. This will be used as the comparison model.
# preparing the data
goalzone_fc_md_data <- goalzone_fc %>%
select(-1)
# Set seed for reproducibility
set.seed(50)
# Split the data to train and test data
split <- initial_split(goalzone_fc_md_data, prop = 0.7)
goalzone_train <- training(split) #training data
goalzone_test <- testing(split) #testing data
goalzone_model <- glm(attended ~ .,
data = goalzone_train,
family = "binomial")
summary(goalzone_model)
##
## Call:
## glm(formula = attended ~ ., family = "binomial", data = goalzone_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.059442 0.845104 -1.254 0.2100
## months_as_member 0.121614 0.011037 11.019 <2e-16 ***
## weight -0.009192 0.007682 -1.197 0.2314
## days_before -0.313646 0.153185 -2.048 0.0406 *
## day_of_weekTue 0.627051 0.432571 1.450 0.1472
## day_of_weekWed 0.559122 0.753202 0.742 0.4579
## day_of_weekThu 1.976939 0.973334 2.031 0.0422 *
## day_of_weekFri 2.352317 1.258583 1.869 0.0616 .
## day_of_weekSat 3.198875 1.567332 2.041 0.0413 *
## day_of_weekSun 4.039935 1.876025 2.153 0.0313 *
## time.L -0.094105 0.146445 -0.643 0.5205
## categoryCycling -0.425041 0.361030 -1.177 0.2391
## categoryHIIT -0.116860 0.343543 -0.340 0.7337
## categoryStrength -0.536814 0.385840 -1.391 0.1641
## categoryUnknown -1.469571 1.117743 -1.315 0.1886
## categoryYoga -0.415716 0.431120 -0.964 0.3349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1308.78 on 1049 degrees of freedom
## Residual deviance: 997.43 on 1034 degrees of freedom
## AIC: 1029.4
##
## Number of Fisher Scoring iterations: 5
# Make predictions with the model
goalzone_test$pred <-predict(goalzone_model, goalzone_test, type = "response")
goalzone_test <- goalzone_test %>%
mutate(pred = ifelse(pred < 0.5, 0, 1))
The logistic model is having a 0.78% accuracy.
Model will be compared using the area under the ROC curve and confusion matrix.
##
## 0 1
## 0 294 33
## 1 65 58
##
## 0 1
## 0 288 39
## 1 68 55
Model1 performs slightly better than Model2 .