Goalzone Fitness Class: Predict the Attendance of Members

Company Description

GoalZone is a fitness club chain providing fitness classes in Canada. Finally, the fitness classes schedule is back to normal after the COVID-19 restrictions are lifted. However, they have received many complaints from the customers about having a hard time booking a fitness class.

The Problem: Some classes are always fully booked, and fully booked classes often have a low attendance rate.

Goalzone wants to increase the number of spaces available for classes. They want to do this by predicting whether the member will attend the class or not. If they can predict a member will not attend the class, they can make another space available

Customer Question

  • The operation manager has asked to answer the following: - Can you predict the member that will attend a class or not?

Dataset

The dataset contains the attendance information for the class scheduled this year so far. The data comes from DataCamp as a part of their Data Science Certificate Project. The data can also be downloaded from Kaggle.com

Column Name Criteria
booking_id Nominal. The unique identifier of the booking. Missing values are not possible due to the database structure.
months_as_member Discrete. The number of months as this fitness club member, minimum 1 month. Replace missing values with the overall average month.
weight Continuous. The member’s weight in kg, rounded to 2 decimal places. The minimum possible value is 40.00 kg. Replace missing values with the overall average weight.
days_before Discrete. The number of days before the class the member registered, minimum 1 day. Replace missing values with 0.
day_of_week Nominal. The day of the week of the class. One of “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat” or “Sun”. Replace missing values with “unknown”.
time Ordinal. The time of day of the class. Either “AM” or “PM”. Replace missing values with “unknown”.
category Nominal. The category of the fitness class. One of “Yoga”, “Aqua”, “Strength”, “HIIT”, or “Cycling”. Replace missing values with “unknown”.
attended Nominal. Whether the member attended the class (1) or not (0). Missing values should be removed.

Load Libraries and Import the Data

library(tidyverse)
library(ggthemes)
library(rsample)
library(pROC)
library(rpart)
library(rpart.plot)

# Import data
goalzone_fc <- read_csv("https://raw.githubusercontent.com/xrander/fitness_class_project/master/fitness_class_2212.csv")
goalzone_fc #view document

Data Validation

Data Structure and Variables

## Check the structure of the data to see if column follow the data description
dim(goalzone_fc)
## [1] 1500    8

The data contains 1500 observations and 8 variables.

I’ll start by checking if the data loaded follows the data description above.

## Check the structure of the data to see if column follow the data description
str(goalzone_fc)
## spc_tbl_ [1,500 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ booking_id      : chr [1:1500] "0001" "0002" "0003" "0004" ...
##  $ months_as_member: num [1:1500] 17 10 16 5 15 7 11 9 23 7 ...
##  $ weight          : num [1:1500] 79.6 79 74.5 86.1 69.3 ...
##  $ days_before     : chr [1:1500] "8" "2" "14" "10" ...
##  $ day_of_week     : chr [1:1500] "Wed" "Mon" "Sun" "Fri" ...
##  $ time            : chr [1:1500] "PM" "AM" "AM" "AM" ...
##  $ category        : chr [1:1500] "Strength" "HIIT" "Strength" "Cycling" ...
##  $ attended        : num [1:1500] 0 0 0 0 0 0 0 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   booking_id = col_character(),
##   ..   months_as_member = col_double(),
##   ..   weight = col_double(),
##   ..   days_before = col_character(),
##   ..   day_of_week = col_character(),
##   ..   time = col_character(),
##   ..   category = col_character(),
##   ..   attended = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

All the variables do not follow the description provided in the dataset section above, except “attended”, “months_as_member”, and “weight” column. The first column, “booking_id” is the unique_id and not really of interest here.

Before correcting the variables, I will check for duplicated entries and unique values.

unique(duplicated(goalzone_fc[,1:8])) # To show if observations are duplicated or not
## [1] FALSE

There are no duplicate observations.

lapply(goalzone_fc[, c(4:8)], unique)
## $days_before
##  [1] "8"       "2"       "14"      "10"      "6"       "4"       "9"      
##  [8] "12"      "5"       "3"       "7"       "13"      "12 days" "20"     
## [15] "1"       "15"      "6 days"  "11"      "13 days" "3 days"  "16"     
## [22] "1 days"  "7 days"  "8 days"  "10 days" "14 days" "17"      "5 days" 
## [29] "2 days"  "4 days"  "29"     
## 
## $day_of_week
##  [1] "Wed"       "Mon"       "Sun"       "Fri"       "Thu"       "Wednesday"
##  [7] "Fri."      "Tue"       "Sat"       "Monday"   
## 
## $time
## [1] "PM" "AM"
## 
## $category
## [1] "Strength" "HIIT"     "Cycling"  "Yoga"     "-"        "Aqua"    
## 
## $attended
## [1] 0 1

The variable days_before should be numeric, day_of_week have some inconsistent data, time.

# Mutate columns that do not follow data description to write data type
goalzone_fc <- goalzone_fc %>%
  mutate(booking_id = as.numeric(booking_id),
         #trim strings, remove all days which is a recurring character and coarse the figures into numeric data type
         days_before = as.numeric(str_remove_all(str_trim(days_before), "days")),
         # rename the observations to aid consistency and coarse to factor data type
         day_of_week = factor(ifelse(day_of_week == "Wednesday", "Wed",
                              ifelse(day_of_week == "Fri.", "Fri",
                                     ifelse(day_of_week == "Monday", "Mon", day_of_week))), level = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
         
         time = factor(time, levels = c("AM", "PM"),ordered = T),
         category = factor(ifelse(category == "-", "Unknown", category), ordered = F),
         attended = as.numeric(attended))
summary(goalzone_fc)
##    booking_id     months_as_member     weight        days_before    
##  Min.   :   1.0   Min.   :  1.00   Min.   : 55.41   Min.   : 1.000  
##  1st Qu.: 375.8   1st Qu.:  8.00   1st Qu.: 73.49   1st Qu.: 4.000  
##  Median : 750.5   Median : 12.00   Median : 80.76   Median : 9.000  
##  Mean   : 750.5   Mean   : 15.63   Mean   : 82.61   Mean   : 8.347  
##  3rd Qu.:1125.2   3rd Qu.: 19.00   3rd Qu.: 89.52   3rd Qu.:12.000  
##  Max.   :1500.0   Max.   :148.00   Max.   :170.52   Max.   :29.000  
##                                    NA's   :20                       
##  day_of_week time          category      attended     
##  Mon:228     AM:1141   Aqua    : 76   Min.   :0.0000  
##  Tue:195     PM: 359   Cycling :376   1st Qu.:0.0000  
##  Wed:116               HIIT    :667   Median :0.0000  
##  Thu:241               Strength:233   Mean   :0.3027  
##  Fri:305               Unknown : 13   3rd Qu.:1.0000  
##  Sat:202               Yoga    :135   Max.   :1.0000  
##  Sun:213

Missing Data

From the summary output above, weight is having missing data and this will be replaced with the mean.

# replace missing data with mean of weights and round to 2 decimal place
goalzone_fc <- goalzone_fc %>%
  mutate(weight = round(replace_na(weight, mean(weight, na.rm = T)), 2))

summary(goalzone_fc)
##    booking_id     months_as_member     weight        days_before    
##  Min.   :   1.0   Min.   :  1.00   Min.   : 55.41   Min.   : 1.000  
##  1st Qu.: 375.8   1st Qu.:  8.00   1st Qu.: 73.56   1st Qu.: 4.000  
##  Median : 750.5   Median : 12.00   Median : 81.03   Median : 9.000  
##  Mean   : 750.5   Mean   : 15.63   Mean   : 82.61   Mean   : 8.347  
##  3rd Qu.:1125.2   3rd Qu.: 19.00   3rd Qu.: 89.38   3rd Qu.:12.000  
##  Max.   :1500.0   Max.   :148.00   Max.   :170.52   Max.   :29.000  
##                                                                     
##  day_of_week time          category      attended     
##  Mon:228     AM:1141   Aqua    : 76   Min.   :0.0000  
##  Tue:195     PM: 359   Cycling :376   1st Qu.:0.0000  
##  Wed:116               HIIT    :667   Median :0.0000  
##  Thu:241               Strength:233   Mean   :0.3027  
##  Fri:305               Unknown : 13   3rd Qu.:1.0000  
##  Sat:202               Yoga    :135   Max.   :1.0000  
##  Sun:213

Exploratory Data Analysis

Some insights into the data.

Day of the Week

goalzone_fc %>%
  ggplot(aes(day_of_week)) +
  geom_bar(fill = "springgreen3")+
  labs(x = "Day of the Week",
       y = "Count",
       title = "Total Attendance for Classes Per Day")+
  theme_igray()

Distribution of Days Before Classes Start

goalzone_fc %>%
  ggplot(aes(days_before))+
  geom_bar()+
  labs(x = "Days Before",
       title = "Number of Days Before Member Attends the Gym")+
  geom_text(x = 25,
            y = 150,
            label = "Most of the members are having their classes in 10 days to come",
            color = "tan4",
            size = 2.8)+
  coord_flip()+
  theme_igray()

Weight Distribution of Members

goalzone_fc %>%
  ggplot(aes(weight))+
  geom_density()+
  expand_limits(x = c(45, 180))+
  labs(x = "weight(kg)",
       title = "Weight Distribution of Gym Members")+
  theme_igray()

Member’s Attendance the Gym According to Category

goalzone_fc %>%
  ggplot(aes(category, fill = factor(attended)))+
  geom_bar(position = "dodge",
           width = 1)+
  labs(fill = "Attended",
       x = "Category",
       y = "count")+
  expand_limits(y = (0:500))+
  scale_fill_manual(values = c("olivedrab2", "slateblue1"),
                    labels = c("No", "Yes"))+
  theme_igray()

For all the categories of exercise in the gym. members miss gym classes than attend them.

Distribution of Number of Months as Members across all categories

goalzone_fc %>%
  ggplot(aes(months_as_member)) +
  geom_histogram(binwidth = 10,
                 fill = "mistyrose4")+
  labs()+
  facet_wrap(~category, scales = "free_y")

For all the categories of classes most of the members have been members for less than 50 months.

Relationship between Months as Member and Weight of Member

cor(goalzone_fc$months_as_member, goalzone_fc$weight)
## [1] -0.4655935
ggplot(goalzone_fc, aes(weight, months_as_member))+
  geom_point()+
  geom_smooth(method = "lm", se = F, col = "red")+
  labs(x = "weight (kg)",
       y = "Months as Member",
       title = "Relationship between Months as Member and Weight")+
  theme_clean()+
  geom_text(x = 100, y = 100, label = "r = -4.66", col = "red")+
  theme(plot.title = element_text(face = "plain"))

The chart shows that people who have been gym members for long tend to have their weight reduced, but this is not so for all the case.

Relationship Between Attendance and Months as Member

goalzone_fc %>% ggplot(aes(months_as_member, attended))+
  geom_point()+
  geom_smooth(se = F,
              method = "glm",
              method.args = list(family = "binomial")) +
  labs(x = "Months as member",
       y = "Attended",
       title = "Relationship between Attendance and Number of Months as Member")+
  theme_igray()
## `geom_smooth()` using formula = 'y ~ x'

The plot shows a logistic relationship

Model Fitting

Prediction Models

  • The Problem: Predict which member will or will not attend a gym class.

  • My first approach is to use logistic regression. This will be the baseline model, and it is used for its simplicity.

  • My second approach is to use Decision Trees approach. This will be used as the comparison model.

Splitting the datasets

# preparing the data
goalzone_fc_md_data <- goalzone_fc %>%
  select(-1) 

# Set seed for reproducibility
set.seed(50)

# Split the data to train and test data
split <- initial_split(goalzone_fc_md_data, prop = 0.7)
goalzone_train <- training(split) #training data
goalzone_test <- testing(split) #testing data

Training the Model (Logistic Model)

goalzone_model <- glm(attended ~ .,
                      data = goalzone_train,
                      family = "binomial")

summary(goalzone_model)
## 
## Call:
## glm(formula = attended ~ ., family = "binomial", data = goalzone_train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.059442   0.845104  -1.254   0.2100    
## months_as_member  0.121614   0.011037  11.019   <2e-16 ***
## weight           -0.009192   0.007682  -1.197   0.2314    
## days_before      -0.313646   0.153185  -2.048   0.0406 *  
## day_of_weekTue    0.627051   0.432571   1.450   0.1472    
## day_of_weekWed    0.559122   0.753202   0.742   0.4579    
## day_of_weekThu    1.976939   0.973334   2.031   0.0422 *  
## day_of_weekFri    2.352317   1.258583   1.869   0.0616 .  
## day_of_weekSat    3.198875   1.567332   2.041   0.0413 *  
## day_of_weekSun    4.039935   1.876025   2.153   0.0313 *  
## time.L           -0.094105   0.146445  -0.643   0.5205    
## categoryCycling  -0.425041   0.361030  -1.177   0.2391    
## categoryHIIT     -0.116860   0.343543  -0.340   0.7337    
## categoryStrength -0.536814   0.385840  -1.391   0.1641    
## categoryUnknown  -1.469571   1.117743  -1.315   0.1886    
## categoryYoga     -0.415716   0.431120  -0.964   0.3349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1308.78  on 1049  degrees of freedom
## Residual deviance:  997.43  on 1034  degrees of freedom
## AIC: 1029.4
## 
## Number of Fisher Scoring iterations: 5

Predictions with Logistic Model

# Make predictions with the model
goalzone_test$pred <-predict(goalzone_model, goalzone_test, type = "response")

goalzone_test <- goalzone_test %>%
  mutate(pred = ifelse(pred < 0.5, 0, 1)) 

The logistic model is having a 0.78% accuracy.

Training Model_2 (Decision Tree)

goalzone_model2 <-  rpart(attended ~ .,
                          goalzone_train,
                          method = "class")

# Visualize the decision tree
rpart.plot(goalzone_model2,
           type = 4,
           box.palette = c("red", "green"),
           fallen.leaves = TRUE)

Predictions with Decision Tree

# Predict the outcome using model 2
goalzone_test <- goalzone_test %>%
  mutate(pred2 = as.numeric(as.character(predict(goalzone_model2, goalzone_test, type = "class"))))

The decision tree is having a 0.76`

Model Comparison

Model will be compared using the area under the ROC curve and confusion matrix.

ROC Curve

ROC Curve (logistic regression - ROC1)

# Calculate the ROC
roc_curve <- roc(goalzone_test$attended, goalzone_test$pred)

# Visual model performance using ROC(Receiver Operating Characteristics Cost)
plot(roc_curve,
     col = "red",
     main = "Logistic Regression ROC Curve")

The auc of roc10.6853136

ROC Curve (decision trees- ROC2)

roc_curve2<- roc(goalzone_test$attended, goalzone_test$pred2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve2,
     col = "blue",
     main = "Decision Tree ROC Curve")

The auc of the roc2 is 0.6639442.

Confusion Matrix

# Logistic regression
table(goalzone_test$attended, goalzone_test$pred)
##    
##       0   1
##   0 294  33
##   1  65  58
# Decision tree
table(goalzone_test$attended, goalzone_test$pred2)
##    
##       0   1
##   0 288  39
##   1  68  55

Model1 performs slightly better than Model2 .

Homepage