Case Study: Bellabeat

Ishmael Roslan

2022-07-24

1 Ask

1.1 Case Study Briefing

1.1.1 Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy

1.1.2 Characters and Products

1.1.2.1 Characters

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

1.1.2.2 Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

1.1.3 About the Company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates. Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

1.2 The Business Task

To identify trends in usage of activity trackers and their associated apps.
Recommend features and inform marketing strategy to give Bellabeat a competitive advantage and to increase market share.

1.2.1 Key Questions

What are the trends in usage of the the trackers?
Are there correlations between sleep, activity and calories burned?
What features would improve user experience whilst also promoting better health?
Which features should be the focus of the marketing strategy?

2 Prepare

2.1 Data Sources

Where was the data stored?

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

2.2 Data Import and Store

2.2.1 Importing Data

List Files in Directory

library(data.table)
files <- list.files(path = "data", full.names = T)
files

##  [1] "data/dailyActivity_merged.csv"          
##  [2] "data/dailyCalories_merged.csv"          
##  [3] "data/dailyIntensities_merged.csv"       
##  [4] "data/dailySteps_merged.csv"             
##  [5] "data/heartrate_seconds_merged.csv"      
##  [6] "data/hourlyCalories_merged.csv"         
##  [7] "data/hourlyIntensities_merged.csv"      
##  [8] "data/hourlySteps_merged.csv"            
##  [9] "data/minuteCaloriesNarrow_merged.csv"   
## [10] "data/minuteCaloriesWide_merged.csv"     
## [11] "data/minuteIntensitiesNarrow_merged.csv"
## [12] "data/minuteIntensitiesWide_merged.csv"  
## [13] "data/minuteMETsNarrow_merged.csv"       
## [14] "data/minuteSleep_merged.csv"            
## [15] "data/minuteStepsNarrow_merged.csv"      
## [16] "data/minuteStepsWide_merged.csv"        
## [17] "data/sleepDay_merged.csv"               
## [18] "data/weightLogInfo_merged.csv"

Remove minutes and Wide tables as they they are replicated in hourly and Narrow tables respectively. dailyCalories, `dailySteps` and dailyIntensities are also duplicated in dailyActivity. Keep minuteSleep for extra data on sleep stages.

files <-files[grep("(hourly|dailyA|weight|sleepD|minuteSl|heart)", files, invert = FALSE)]
files

## [1] "data/dailyActivity_merged.csv"     "data/heartrate_seconds_merged.csv"
## [3] "data/hourlyCalories_merged.csv"    "data/hourlyIntensities_merged.csv"
## [5] "data/hourlySteps_merged.csv"       "data/minuteSleep_merged.csv"      
## [7] "data/sleepDay_merged.csv"          "data/weightLogInfo_merged.csv"

Extract Table Names from path
Read in all files to a list of tables
Clean the column names of the nested tables
Assign each table as a separate variable

library(janitor)
tablenames <- gsub("(.*/)(.*)(_.*)", r"(\2)", files)
l <- lapply(files, fread, sep = ",", na.strings = c(""))
l <- lapply(l,clean_names)
for (row in 1:length(tablenames)) {
  assign(tablenames[row], l[[row]])
}

2.2.2 Daily Data

In all cases below, date will need to be parsed from a character variable and Id as a factor variable. Given that there are fewer users in the sleep and weight tables, Id should be parsed after merging.

Note: This data is in a tidy format,with one observational unit being a single user per date.

library(funModeling)
df_status(dailyActivity)

##                      variable q_zeros p_zeros q_na p_na q_inf p_inf      type
## 1                          id       0    0.00    0    0     0     0 integer64
## 2               activity_date       0    0.00    0    0     0     0 character
## 3                 total_steps      77    8.19    0    0     0     0   integer
## 4              total_distance      78    8.30    0    0     0     0   numeric
## 5            tracker_distance      78    8.30    0    0     0     0   numeric
## 6  logged_activities_distance     908   96.60    0    0     0     0   numeric
## 7        very_active_distance     413   43.94    0    0     0     0   numeric
## 8  moderately_active_distance     386   41.06    0    0     0     0   numeric
## 9       light_active_distance      85    9.04    0    0     0     0   numeric
## 10  sedentary_active_distance     858   91.28    0    0     0     0   numeric
## 11        very_active_minutes     409   43.51    0    0     0     0   integer
## 12      fairly_active_minutes     384   40.85    0    0     0     0   integer
## 13     lightly_active_minutes      84    8.94    0    0     0     0   integer
## 14          sedentary_minutes       1    0.11    0    0     0     0   integer
## 15                   calories       4    0.43    0    0     0     0   integer
##    unique
## 1      33
## 2      31
## 3     842
## 4     615
## 5     613
## 6      19
## 7     333
## 8     211
## 9     491
## 10      9
## 11    122
## 12     81
## 13    335
## 14    549
## 15    734

head(dailyActivity)

All complete, showing activity of 33 users over 31 dates.

df_status(sleepDay)

##               variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1                   id       0       0    0    0     0     0 integer64     24
## 2            sleep_day       0       0    0    0     0     0 character     31
## 3  total_sleep_records       0       0    0    0     0     0   integer      3
## 4 total_minutes_asleep       0       0    0    0     0     0   integer    256
## 5    total_time_in_bed       0       0    0    0     0     0   integer    242

head(sleepDay)

24 users’ sleep logs over 31 days. This appears to have been summarised from minuteSleep. No information on Sleep Type though, so will need to get from minuteSleep.

df_status(weightLogInfo)

##           variable q_zeros p_zeros q_na  p_na q_inf p_inf      type unique
## 1               id       0    0.00    0  0.00     0     0 integer64      8
## 2             date       0    0.00    0  0.00     0     0 character     56
## 3        weight_kg       0    0.00    0  0.00     0     0   numeric     34
## 4    weight_pounds       0    0.00    0  0.00     0     0   numeric     34
## 5              fat       0    0.00   65 97.01     0     0   integer      2
## 6              bmi       0    0.00    0  0.00     0     0   numeric     36
## 7 is_manual_report      26   38.81    0  0.00     0     0   logical      2
## 8           log_id       0    0.00    0  0.00     0     0 integer64     56

head(weightLogInfo)

Weight logs of 8 unique users over 56 unique datetimes. This means we need to parse the date from this.

2.2.3 Hourly Data

In all cases below, date and time will need to be parsed from a character variable. All three tables have 33 unique users and 736 unique hours so can be joined on these variables.

Note: This data is in a tidy format,with one observational unit being a single user per hour.

df_status(hourlyCalories)

##        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1            id       0       0    0    0     0     0 integer64     33
## 2 activity_hour       0       0    0    0     0     0 character    736
## 3      calories       0       0    0    0     0     0   integer    442

head(hourlyCalories)

Looks good.

df_status(hourlyIntensities)

##            variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1                id       0    0.00    0    0     0     0 integer64     33
## 2     activity_hour       0    0.00    0    0     0     0 character    736
## 3   total_intensity    9097   41.16    0    0     0     0   integer    175
## 4 average_intensity    9097   41.16    0    0     0     0   numeric    175

head(hourlyIntensities)

Looks good. It appears that total intensity is a weighted sum of LightlyActiveMinutes, FairlyActiveMinutes and VeryActiveMinutes from the minuteIntensities table, whereas average_intensity divides this by 60 to get a value per minute.

df_status(hourlySteps)

##        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1            id       0    0.00    0    0     0     0 integer64     33
## 2 activity_hour       0    0.00    0    0     0     0 character    736
## 3    step_total    9297   42.07    0    0     0     0   integer   2222

head(hourlySteps)

Looks good.

2.2.4 Heart Rate

df_status(heartrate_seconds)

##   variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1       id       0       0    0    0     0     0 integer64     14
## 2     time       0       0    0    0     0     0 character 961274
## 3    value       0       0    0    0     0     0   integer    168

head(heartrate_seconds)

Heart rate data looks good, but could be averaged by hour and joined with the hourly data.

Note: This data is in a tidy format,with one observational unit being a single user per second.

2.2.5 Sleep

df_status(minuteSleep)

##   variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1       id       0       0    0    0     0     0 integer64     24
## 2     date       0       0    0    0     0     0 character  49773
## 3    value       0       0    0    0     0     0   integer      3
## 4   log_id       0       0    0    0     0     0 integer64    459

head(minuteSleep)

Note: This data is in a tidy format,with one observational unit being a single user per minute.

The sleep data poses a few challenges.

The value column identifies the type of sleep and this must be reflected in the data.
It would be more useful to summarise this data into a wide format, with the minutes of each sleep type per day were recorded. This could then be merged with the daily tables.
total_minutes_asleep from the sleepDay table does not appear to match with this table.

2.3 Key Questions

Are there issues with bias or credibility in the data?

Reliable: The data was not particularly reliable. There were inconsistencies in the data collected but these were corrected to the best of my ability as described in the cleaning and wrangling sections.
Original: I cannot locate the original data source that was provided for this case study. It is a Kaggle repository with data offered by the public.
Comprehensive: The data is not comprehensive, this is a relatively small dataset and many tables have a lot of missing values, or just very few rows. This was also volunteered data from Fitbit users and so is neither a random sample, nor stratified in any way. IT would be dangerous to draw conclusions from this data alone.
Current: These data were collected in 2016 and the landscape of the fitness tracking industry has changed a lot in the last 6 years. It would be best to seek newer data.
Cited: I have cited the original source to the best of my knowledge above.

3 Process

3.1 Data Wrangling

3.1.1 Sleep

I believe the problem arises as sleep events usually span midnight and therefore can occur on two separate dates. There are two potential ways to collate the minuteSleep data to make sleepDay:

Add together the number of minutes of sleep per calendar date and record that. The observation unit would be minutes of sleep per id per date
Add together the number of minutes of sleep per id per log_id and assign the date that the sleep event began as the date.

The second of these paradigms is far more complicated, so lets explore the first.

minuteSleep[,date := lubridate::mdy_hms(date)]
minuteSleep[,date := lubridate::date(date)]
sleep1 <- minuteSleep[, .N, .(id,date)]
sleepDay[,sleep_day := lubridate::mdy_hms(sleep_day)]
sleepDay[,sleep_day := lubridate::date(sleep_day)]

head(sleepDay)

head(sleep1)

The new sleep1 correlates quite well with total_time_in_bed but there are some discrepancies and also missing data in sleepDay. I will therefore replace sleepDay with my own summarised version of minuteSleep, which will provide transparency and coherence.

# Overwrite original table
dailySleep <- dcast(minuteSleep,
      id + date ~ value)
dailySleep[, total_sleep := `1` + `2` + `3`]
setnames(dailySleep, c("1", "2", "3"), c("rem", "light", "deep"))
head(dailySleep)

The new dailySleep table shows the minutes of each day slept (for the longest period of sleep), and in which stages.

3.1.2 Daily Data

Parse datetimes.
Set keys.
Merge tables

dailyActivity[,date := lubridate::mdy(activity_date)]
dailyActivity[,date := lubridate::date(date)][,activity_date :=NULL]
weightLogInfo[,date := lubridate::mdy_hms(date)]
weightLogInfo[,date := lubridate::date(date)]
setkeyv(dailySleep, c("id", "date"))
setkeyv(dailyActivity, c("id", "date"))
setkeyv(weightLogInfo, c("id", "date"))
daily <- weightLogInfo[dailySleep][dailyActivity]
daily[, id := factor(id)]
head(daily)

3.1.3 Hourly Data

Summarise heartrate by date, hour.

heartrate_seconds[, datetime := lubridate::mdy_hms(time)]
heartrate_seconds[, date := lubridate::date(datetime)]
heartrate_seconds[, hour := lubridate::hour(datetime)][, time := NULL]
hourlyHeartrate <- unique(heartrate_seconds[,
                                            heartrate := as.integer(mean(value)),
                                            .(id, date, hour)][, .(id, date, hour, heartrate)])
head(hourlyHeartrate)

Parse datetimes.
Set keys.
Merge tables

hourlyCalories[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyCalories[, date := lubridate::date(datetime)]
hourlyCalories[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyCalories)

hourlyIntensities[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyIntensities[, date := lubridate::date(datetime)]
hourlyIntensities[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyIntensities)

hourlySteps[, datetime := lubridate::mdy_hms(activity_hour)]
hourlySteps[, date := lubridate::date(datetime)]
hourlySteps[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlySteps)

setkeyv(hourlyCalories, c("id", "date", "hour"))
setkeyv(hourlyHeartrate, c("id", "date", "hour"))
setkeyv(hourlyIntensities, c("id", "date", "hour"))
setkeyv(hourlySteps, c("id", "date", "hour"))
hourly <- hourlyHeartrate[hourlyCalories][hourlyIntensities][hourlySteps]
hourly[, id := factor(id)]
# Tidy environment
rm(list=ls()[! ls() %in% c("daily","hourly")])
head(hourly)

3.1.4 Feature Engineering - User Type

Using id as a factor variable is useful, however, it would be good to split the users into groups based on their Activity to spot any trends is usage. I shall use k-means clustering to group the users

3.1.4.1 Normalisation

Select only the Activity Columns and then center and scale the data.

library(tidyverse)
library(tidymodels)
activity <-
  daily %>%
  select(c(very_active_minutes:sedentary_minutes)) %>%
  mutate(across(.fns=scale))
head(activity)

3.1.4.2 Try k = 3 : 7

set.seed(1234)

kclusts <-
  tibble(k = 3:6) %>%
  mutate(
    kclust = map(k, ~kmeans(activity, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, activity)
    )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))

assignments %>%
  select(-c(kclust, tidied,glanced)) %>%
  group_by(k, .cluster) %>%
  mutate(.cluster = fct_reorder(.cluster, very_active_minutes)) %>%
  summarise(across(where(is.numeric), mean)) %>%
  pivot_longer(c(very_active_minutes:sedentary_minutes), names_to = "Lifestyle", values_to = "Minutes") %>%
  mutate(Lifestyle = fct_relevel(Lifestyle, c("sedentary_minutes", "lightly_active_minutes", "fairly_active_minutes", "very_active_minutes"))) %>%
  ggplot(aes(fill= .cluster, y = Minutes, x = .cluster)) +
  geom_col(position = "dodge") +
  facet_grid(k~Lifestyle)

k = 3 appears to give the most coherent clusters in terms of user activity.

3.1.4.3 Map Cluster to UserType

Double check that the Lifestyle assignments make sense.

lifs <- assignments %>%
  filter(k == 3) %>%
  select(-c(k, kclust, tidied,glanced)) %>%
  mutate(.cluster = fct_reorder(.cluster, (very_active_minutes+fairly_active_minutes))) %>%
  pull(.cluster)

daily <- daily %>%
  mutate(
    Lifestyle = lifs,
    Lifestyle = fct_recode(Lifestyle,
                           "Sedentary" = "1",
                           "Fairly Active" = "2",
                           "Very Active" = "3")
  )
# Sanity Check
daily %>%
  select(c(very_active_minutes:sedentary_minutes),Lifestyle) %>%
  group_by(Lifestyle) %>%
  summarise(across(where(is.numeric), mean)) %>%
  arrange(very_active_minutes+fairly_active_minutes)

rm(list=ls()[! ls() %in% c("daily","hourly")])
head(daily)

3.2 Key Questions

3.2.1 What tools are you choosing and why?

I chose R and data.table as these allow for efficient processing and visualisation. It also allows for me to document the analysis using Rmarkdown to produce a document that meets the requirements of “Reproducible Research”.

4 Analyse

4.1 What are some trends in smart device usage?

4.1.1 Which features are people using?

Let’s see how many users use various combinations of the following features: activity (calories/steps), heartrate, weight and sleep tracking;

library(ggplot2)
library(gridExtra)
library(plotly)

How many days did each user wear the tracker? Let’s assume that if Calories >0 then the tracker was worn.

df <- daily[calories > 0,.N,by = .(date)]
p1 <-
  ggplot(data = df, aes(x = N)) +
  stat_ecdf(geom = "point") +
  labs(x = "Number of Days Worn",
       y = "Percentage of Users") +
  scale_y_continuous(labels =  scales::percent_format()) +
  theme_bw()

p2 <-
  ggplot(data = df, aes(x = `date`, y = N)) +
  geom_line(size = 1) +
  labs(x = "Date", y = "Number of Users") +
  theme_bw()

grid.arrange(p1,p2, nrow = 1)

Most users wore the device for 30-32 out of the 32 days measured. After 2 weeks, usage began to drop off.

Lets investigate the number of users who logged their weight;

df <- daily[!is.na(weight_kg), .N, by = .(date, id)]
ggplotly(
  ggplot(data = df, aes(x = date, y = N, fill = id)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Date", y = "Users") +
  theme_minimal()
)

5/33 users logged their weight.

One user logged almost every day.

df <- daily[!is.na(total_sleep),.N,.(date)]
ggplotly(
  ggplot(data = df, aes(x = N)) +
  stat_ecdf(geom = "point") +
  labs(x = "Number of Days Worn",
       y = "Percentage of Users") +
  scale_y_continuous(labels =  scales::percent_format()) +
  theme_bw()
)

No users logged their sleep for more than 17 days.

ggplotly(
  ggplot(data = df, aes(x = `date`, y = N)) +
  geom_line(size = 1) +
  labs(x = "Date", y = "Number of Users") +
  theme_bw()
)