Case Study: Bellabeat

Ishmael Roslan

2022-07-24

1 Ask

1.1 Case Study Briefing

1.1.1 Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy

1.1.2 Characters and Products

1.1.2.1 Characters

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.

  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

1.1.2.2 Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

1.1.3 About the Company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates. Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

1.2 The Business Task

  • To identify trends in usage of activity trackers and their associated apps.

  • Recommend features and inform marketing strategy to give Bellabeat a competitive advantage and to increase market share.

1.2.1 Key Questions

  • What are the trends in usage of the the trackers?
  • Are there correlations between sleep, activity and calories burned?
  • What features would improve user experience whilst also promoting better health?
  • Which features should be the focus of the marketing strategy?

2 Prepare

2.1 Data Sources

Where was the data stored?

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

2.2 Data Import and Store

2.2.1 Importing Data

  1. List Files in Directory
library(data.table)
files <- list.files(path = "data", full.names = T)
files
##  [1] "data/dailyActivity_merged.csv"          
##  [2] "data/dailyCalories_merged.csv"          
##  [3] "data/dailyIntensities_merged.csv"       
##  [4] "data/dailySteps_merged.csv"             
##  [5] "data/heartrate_seconds_merged.csv"      
##  [6] "data/hourlyCalories_merged.csv"         
##  [7] "data/hourlyIntensities_merged.csv"      
##  [8] "data/hourlySteps_merged.csv"            
##  [9] "data/minuteCaloriesNarrow_merged.csv"   
## [10] "data/minuteCaloriesWide_merged.csv"     
## [11] "data/minuteIntensitiesNarrow_merged.csv"
## [12] "data/minuteIntensitiesWide_merged.csv"  
## [13] "data/minuteMETsNarrow_merged.csv"       
## [14] "data/minuteSleep_merged.csv"            
## [15] "data/minuteStepsNarrow_merged.csv"      
## [16] "data/minuteStepsWide_merged.csv"        
## [17] "data/sleepDay_merged.csv"               
## [18] "data/weightLogInfo_merged.csv"
  1. Remove minutes and Wide tables as they they are replicated in hourly and Narrow tables respectively. dailyCalories, `dailySteps` and dailyIntensities are also duplicated in dailyActivity. Keep minuteSleep for extra data on sleep stages.
files <-files[grep("(hourly|dailyA|weight|sleepD|minuteSl|heart)", files, invert = FALSE)]
files
## [1] "data/dailyActivity_merged.csv"     "data/heartrate_seconds_merged.csv"
## [3] "data/hourlyCalories_merged.csv"    "data/hourlyIntensities_merged.csv"
## [5] "data/hourlySteps_merged.csv"       "data/minuteSleep_merged.csv"      
## [7] "data/sleepDay_merged.csv"          "data/weightLogInfo_merged.csv"
  1. Extract Table Names from path

  2. Read in all files to a list of tables

  3. Clean the column names of the nested tables

  4. Assign each table as a separate variable

library(janitor)
tablenames <- gsub("(.*/)(.*)(_.*)", r"(\2)", files)
l <- lapply(files, fread, sep = ",", na.strings = c(""))
l <- lapply(l,clean_names)
for (row in 1:length(tablenames)) {
  assign(tablenames[row], l[[row]])
}

2.2.2 Daily Data

In all cases below, date will need to be parsed from a character variable and Id as a factor variable. Given that there are fewer users in the sleep and weight tables, Id should be parsed after merging.

Note: This data is in a tidy format,with one observational unit being a single user per date.

library(funModeling)
df_status(dailyActivity)
##                      variable q_zeros p_zeros q_na p_na q_inf p_inf      type
## 1                          id       0    0.00    0    0     0     0 integer64
## 2               activity_date       0    0.00    0    0     0     0 character
## 3                 total_steps      77    8.19    0    0     0     0   integer
## 4              total_distance      78    8.30    0    0     0     0   numeric
## 5            tracker_distance      78    8.30    0    0     0     0   numeric
## 6  logged_activities_distance     908   96.60    0    0     0     0   numeric
## 7        very_active_distance     413   43.94    0    0     0     0   numeric
## 8  moderately_active_distance     386   41.06    0    0     0     0   numeric
## 9       light_active_distance      85    9.04    0    0     0     0   numeric
## 10  sedentary_active_distance     858   91.28    0    0     0     0   numeric
## 11        very_active_minutes     409   43.51    0    0     0     0   integer
## 12      fairly_active_minutes     384   40.85    0    0     0     0   integer
## 13     lightly_active_minutes      84    8.94    0    0     0     0   integer
## 14          sedentary_minutes       1    0.11    0    0     0     0   integer
## 15                   calories       4    0.43    0    0     0     0   integer
##    unique
## 1      33
## 2      31
## 3     842
## 4     615
## 5     613
## 6      19
## 7     333
## 8     211
## 9     491
## 10      9
## 11    122
## 12     81
## 13    335
## 14    549
## 15    734
head(dailyActivity)

All complete, showing activity of 33 users over 31 dates.

df_status(sleepDay)
##               variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1                   id       0       0    0    0     0     0 integer64     24
## 2            sleep_day       0       0    0    0     0     0 character     31
## 3  total_sleep_records       0       0    0    0     0     0   integer      3
## 4 total_minutes_asleep       0       0    0    0     0     0   integer    256
## 5    total_time_in_bed       0       0    0    0     0     0   integer    242
head(sleepDay)

24 users’ sleep logs over 31 days. This appears to have been summarised from minuteSleep. No information on Sleep Type though, so will need to get from minuteSleep.

df_status(weightLogInfo)
##           variable q_zeros p_zeros q_na  p_na q_inf p_inf      type unique
## 1               id       0    0.00    0  0.00     0     0 integer64      8
## 2             date       0    0.00    0  0.00     0     0 character     56
## 3        weight_kg       0    0.00    0  0.00     0     0   numeric     34
## 4    weight_pounds       0    0.00    0  0.00     0     0   numeric     34
## 5              fat       0    0.00   65 97.01     0     0   integer      2
## 6              bmi       0    0.00    0  0.00     0     0   numeric     36
## 7 is_manual_report      26   38.81    0  0.00     0     0   logical      2
## 8           log_id       0    0.00    0  0.00     0     0 integer64     56
head(weightLogInfo)

Weight logs of 8 unique users over 56 unique datetimes. This means we need to parse the date from this.

2.2.3 Hourly Data

In all cases below, date and time will need to be parsed from a character variable. All three tables have 33 unique users and 736 unique hours so can be joined on these variables.

Note: This data is in a tidy format,with one observational unit being a single user per hour.

df_status(hourlyCalories)
##        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1            id       0       0    0    0     0     0 integer64     33
## 2 activity_hour       0       0    0    0     0     0 character    736
## 3      calories       0       0    0    0     0     0   integer    442
head(hourlyCalories)

Looks good.

df_status(hourlyIntensities)
##            variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1                id       0    0.00    0    0     0     0 integer64     33
## 2     activity_hour       0    0.00    0    0     0     0 character    736
## 3   total_intensity    9097   41.16    0    0     0     0   integer    175
## 4 average_intensity    9097   41.16    0    0     0     0   numeric    175
head(hourlyIntensities)

Looks good. It appears that total intensity is a weighted sum of LightlyActiveMinutes, FairlyActiveMinutes and VeryActiveMinutes from the minuteIntensities table, whereas average_intensity divides this by 60 to get a value per minute.

df_status(hourlySteps)
##        variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1            id       0    0.00    0    0     0     0 integer64     33
## 2 activity_hour       0    0.00    0    0     0     0 character    736
## 3    step_total    9297   42.07    0    0     0     0   integer   2222
head(hourlySteps)

Looks good.

2.2.4 Heart Rate

df_status(heartrate_seconds)
##   variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1       id       0       0    0    0     0     0 integer64     14
## 2     time       0       0    0    0     0     0 character 961274
## 3    value       0       0    0    0     0     0   integer    168
head(heartrate_seconds)

Heart rate data looks good, but could be averaged by hour and joined with the hourly data.

Note: This data is in a tidy format,with one observational unit being a single user per second.

2.2.5 Sleep

df_status(minuteSleep)
##   variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1       id       0       0    0    0     0     0 integer64     24
## 2     date       0       0    0    0     0     0 character  49773
## 3    value       0       0    0    0     0     0   integer      3
## 4   log_id       0       0    0    0     0     0 integer64    459
head(minuteSleep)

Note: This data is in a tidy format,with one observational unit being a single user per minute.

The sleep data poses a few challenges.

  1. The value column identifies the type of sleep and this must be reflected in the data.

  2. It would be more useful to summarise this data into a wide format, with the minutes of each sleep type per day were recorded. This could then be merged with the daily tables.

  3. total_minutes_asleep from the sleepDay table does not appear to match with this table.

2.3 Key Questions

Are there issues with bias or credibility in the data?

  • Reliable: The data was not particularly reliable. There were inconsistencies in the data collected but these were corrected to the best of my ability as described in the cleaning and wrangling sections.

  • Original: I cannot locate the original data source that was provided for this case study. It is a Kaggle repository with data offered by the public.

  • Comprehensive: The data is not comprehensive, this is a relatively small dataset and many tables have a lot of missing values, or just very few rows. This was also volunteered data from Fitbit users and so is neither a random sample, nor stratified in any way. IT would be dangerous to draw conclusions from this data alone.

  • Current: These data were collected in 2016 and the landscape of the fitness tracking industry has changed a lot in the last 6 years. It would be best to seek newer data.

  • Cited: I have cited the original source to the best of my knowledge above.

3 Process

3.1 Data Wrangling

3.1.1 Sleep

I believe the problem arises as sleep events usually span midnight and therefore can occur on two separate dates. There are two potential ways to collate the minuteSleep data to make sleepDay:

  1. Add together the number of minutes of sleep per calendar date and record that. The observation unit would be minutes of sleep per id per date

  2. Add together the number of minutes of sleep per id per log_id and assign the date that the sleep event began as the date.

The second of these paradigms is far more complicated, so lets explore the first.

minuteSleep[,date := lubridate::mdy_hms(date)]
minuteSleep[,date := lubridate::date(date)]
sleep1 <- minuteSleep[, .N, .(id,date)]
sleepDay[,sleep_day := lubridate::mdy_hms(sleep_day)]
sleepDay[,sleep_day := lubridate::date(sleep_day)]

head(sleepDay)
head(sleep1)

The new sleep1 correlates quite well with total_time_in_bed but there are some discrepancies and also missing data in sleepDay. I will therefore replace sleepDay with my own summarised version of minuteSleep, which will provide transparency and coherence.

# Overwrite original table
dailySleep <- dcast(minuteSleep,
      id + date ~ value)
dailySleep[, total_sleep := `1` + `2` + `3`]
setnames(dailySleep, c("1", "2", "3"), c("rem", "light", "deep"))
head(dailySleep)

The new dailySleep table shows the minutes of each day slept (for the longest period of sleep), and in which stages.

3.1.2 Daily Data

  1. Parse datetimes.

  2. Set keys.

  3. Merge tables

dailyActivity[,date := lubridate::mdy(activity_date)]
dailyActivity[,date := lubridate::date(date)][,activity_date :=NULL]
weightLogInfo[,date := lubridate::mdy_hms(date)]
weightLogInfo[,date := lubridate::date(date)]
setkeyv(dailySleep, c("id", "date"))
setkeyv(dailyActivity, c("id", "date"))
setkeyv(weightLogInfo, c("id", "date"))
daily <- weightLogInfo[dailySleep][dailyActivity]
daily[, id := factor(id)]
head(daily)

3.1.3 Hourly Data

Summarise heartrate by date, hour.

heartrate_seconds[, datetime := lubridate::mdy_hms(time)]
heartrate_seconds[, date := lubridate::date(datetime)]
heartrate_seconds[, hour := lubridate::hour(datetime)][, time := NULL]
hourlyHeartrate <- unique(heartrate_seconds[,
                                            heartrate := as.integer(mean(value)),
                                            .(id, date, hour)][, .(id, date, hour, heartrate)])
head(hourlyHeartrate)
  1. Parse datetimes.

  2. Set keys.

  3. Merge tables

hourlyCalories[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyCalories[, date := lubridate::date(datetime)]
hourlyCalories[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyCalories)
hourlyIntensities[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyIntensities[, date := lubridate::date(datetime)]
hourlyIntensities[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyIntensities)
hourlySteps[, datetime := lubridate::mdy_hms(activity_hour)]
hourlySteps[, date := lubridate::date(datetime)]
hourlySteps[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlySteps)
setkeyv(hourlyCalories, c("id", "date", "hour"))
setkeyv(hourlyHeartrate, c("id", "date", "hour"))
setkeyv(hourlyIntensities, c("id", "date", "hour"))
setkeyv(hourlySteps, c("id", "date", "hour"))
hourly <- hourlyHeartrate[hourlyCalories][hourlyIntensities][hourlySteps]
hourly[, id := factor(id)]
# Tidy environment
rm(list=ls()[! ls() %in% c("daily","hourly")])
head(hourly)

3.1.4 Feature Engineering - User Type

Using id as a factor variable is useful, however, it would be good to split the users into groups based on their Activity to spot any trends is usage. I shall use k-means clustering to group the users

3.1.4.1 Normalisation

Select only the Activity Columns and then center and scale the data.

library(tidyverse)
library(tidymodels)
activity <-
  daily %>%
  select(c(very_active_minutes:sedentary_minutes)) %>%
  mutate(across(.fns=scale))
head(activity)

3.1.4.2 Try k = 3 : 7

set.seed(1234)

kclusts <-
  tibble(k = 3:6) %>%
  mutate(
    kclust = map(k, ~kmeans(activity, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, activity)
    )

clusters <- 
  kclusts %>%
  unnest(cols = c(tidied))

assignments <- 
  kclusts %>% 
  unnest(cols = c(augmented))

clusterings <- 
  kclusts %>%
  unnest(cols = c(glanced))
assignments %>%
  select(-c(kclust, tidied,glanced)) %>%
  group_by(k, .cluster) %>%
  mutate(.cluster = fct_reorder(.cluster, very_active_minutes)) %>%
  summarise(across(where(is.numeric), mean)) %>%
  pivot_longer(c(very_active_minutes:sedentary_minutes), names_to = "Lifestyle", values_to = "Minutes") %>%
  mutate(Lifestyle = fct_relevel(Lifestyle, c("sedentary_minutes", "lightly_active_minutes", "fairly_active_minutes", "very_active_minutes"))) %>%
  ggplot(aes(fill= .cluster, y = Minutes, x = .cluster)) +
  geom_col(position = "dodge") +
  facet_grid(k~Lifestyle)
k = 3 appears to give the most coherent clusters in terms of user activity.

k = 3 appears to give the most coherent clusters in terms of user activity.

3.1.4.3 Map Cluster to UserType

Double check that the Lifestyle assignments make sense.

lifs <- assignments %>%
  filter(k == 3) %>%
  select(-c(k, kclust, tidied,glanced)) %>%
  mutate(.cluster = fct_reorder(.cluster, (very_active_minutes+fairly_active_minutes))) %>%
  pull(.cluster)

daily <- daily %>%
  mutate(
    Lifestyle = lifs,
    Lifestyle = fct_recode(Lifestyle,
                           "Sedentary" = "1",
                           "Fairly Active" = "2",
                           "Very Active" = "3")
  )
# Sanity Check
daily %>%
  select(c(very_active_minutes:sedentary_minutes),Lifestyle) %>%
  group_by(Lifestyle) %>%
  summarise(across(where(is.numeric), mean)) %>%
  arrange(very_active_minutes+fairly_active_minutes)
rm(list=ls()[! ls() %in% c("daily","hourly")])
head(daily)

3.2 Key Questions

3.2.1 What tools are you choosing and why?

I chose R and data.table as these allow for efficient processing and visualisation. It also allows for me to document the analysis using Rmarkdown to produce a document that meets the requirements of “Reproducible Research”.

4 Analyse

5 Share and Act

5.1 Key Findings

5.1.1 Usage

  1. Activity Tracking (calories/steps) is by far the most popular feature, possibly because, all you need to do is wear the tracker.

  2. It appears that several users charged their devices overnight and therefore did not take advantage of the sleep tracking.

  3. Given that heartrate should be automatically measured whilst wearing the device, it was surprising that there were so many users who did not record it. This could be because in 2016, heart rate tracking was considered a premium feature but now is more commonplace.

  4. Weight tracking was not related to the device, but could be manually input into the app, it is therefore unsurprising that it was rarely logged.

5.1.2 Correlations

  1. Activity, Steps and Calories were highly correlated, as might be expected.

  2. Sleep the night before, or the same day did not correlate with Calories Burned.

5.1.3 What features should be the focus of marketing strategy?

  1. Marketing strategy should focus on the automation available in the product ecosystem. Users were much more likely to record data when it was automated vs. manual input.

  2. Promote fast charging, as this allows for sleep recording (instead of charging) overnight.

  3. Market the automatic insights and encouragement generated by the app, which is personalised to the data collected, encouraging further data collection.

5.1.4 Limitations of Data

As previously mentioned, there are some concerns over the data.

  • Comprehensive: The data is not comprehensive, this is a relatively small dataset and many tables have a lot of missing values, or just very few rows. This was also volunteered data from Fitbit users and so is neither a random sample, nor stratified in any way. It would be dangerous to draw conclusions from this data alone.

  • Current: These data were collected in 2016 and the landscape of the fitness tracking industry has changed a lot in the last 6 years.

I would recommend gathering more comprehensive and current data before acting upon conclusions drawn in this document.

5.2 Final Deliverable

Click here to view the final deliverable as a PowerPoint Presentation.