1 Ask
1.1 Case Study Briefing
1.1.1 Scenario
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy
1.1.2 Characters and Products
1.1.2.1 Characters
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.
1.1.2.2 Products
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
1.1.3 About the Company
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates. Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
1.2 The Business Task
To identify trends in usage of activity trackers and their associated apps.
Recommend features and inform marketing strategy to give Bellabeat a competitive advantage and to increase market share.
1.2.1 Key Questions
- What are the trends in usage of the the trackers?
- Are there correlations between sleep, activity and calories burned?
- What features would improve user experience whilst also promoting better health?
- Which features should be the focus of the marketing strategy?
2 Prepare
2.1 Data Sources
Where was the data stored?
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
2.2 Data Import and Store
2.2.1 Importing Data
- List Files in Directory
library(data.table)
files <- list.files(path = "data", full.names = T)
files
## [1] "data/dailyActivity_merged.csv"
## [2] "data/dailyCalories_merged.csv"
## [3] "data/dailyIntensities_merged.csv"
## [4] "data/dailySteps_merged.csv"
## [5] "data/heartrate_seconds_merged.csv"
## [6] "data/hourlyCalories_merged.csv"
## [7] "data/hourlyIntensities_merged.csv"
## [8] "data/hourlySteps_merged.csv"
## [9] "data/minuteCaloriesNarrow_merged.csv"
## [10] "data/minuteCaloriesWide_merged.csv"
## [11] "data/minuteIntensitiesNarrow_merged.csv"
## [12] "data/minuteIntensitiesWide_merged.csv"
## [13] "data/minuteMETsNarrow_merged.csv"
## [14] "data/minuteSleep_merged.csv"
## [15] "data/minuteStepsNarrow_merged.csv"
## [16] "data/minuteStepsWide_merged.csv"
## [17] "data/sleepDay_merged.csv"
## [18] "data/weightLogInfo_merged.csv"
- Remove
minutesandWidetables as they they are replicated inhourlyandNarrowtables respectively.dailyCalories, `dailySteps`anddailyIntensitiesare also duplicated indailyActivity. KeepminuteSleepfor extra data on sleep stages.
files <-files[grep("(hourly|dailyA|weight|sleepD|minuteSl|heart)", files, invert = FALSE)]
files
## [1] "data/dailyActivity_merged.csv" "data/heartrate_seconds_merged.csv"
## [3] "data/hourlyCalories_merged.csv" "data/hourlyIntensities_merged.csv"
## [5] "data/hourlySteps_merged.csv" "data/minuteSleep_merged.csv"
## [7] "data/sleepDay_merged.csv" "data/weightLogInfo_merged.csv"
Extract Table Names from path
Read in all files to a list of tables
Clean the column names of the nested tables
Assign each table as a separate variable
library(janitor)
tablenames <- gsub("(.*/)(.*)(_.*)", r"(\2)", files)
l <- lapply(files, fread, sep = ",", na.strings = c(""))
l <- lapply(l,clean_names)
for (row in 1:length(tablenames)) {
assign(tablenames[row], l[[row]])
}
2.2.2 Daily Data
In all cases below, date will need to be parsed from a character
variable and Id as a factor variable. Given that there are
fewer users in the sleep and weight tables, Id should be
parsed after merging.
Note: This data is in a tidy format,with one observational unit being a single user per date.
library(funModeling)
df_status(dailyActivity)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type
## 1 id 0 0.00 0 0 0 0 integer64
## 2 activity_date 0 0.00 0 0 0 0 character
## 3 total_steps 77 8.19 0 0 0 0 integer
## 4 total_distance 78 8.30 0 0 0 0 numeric
## 5 tracker_distance 78 8.30 0 0 0 0 numeric
## 6 logged_activities_distance 908 96.60 0 0 0 0 numeric
## 7 very_active_distance 413 43.94 0 0 0 0 numeric
## 8 moderately_active_distance 386 41.06 0 0 0 0 numeric
## 9 light_active_distance 85 9.04 0 0 0 0 numeric
## 10 sedentary_active_distance 858 91.28 0 0 0 0 numeric
## 11 very_active_minutes 409 43.51 0 0 0 0 integer
## 12 fairly_active_minutes 384 40.85 0 0 0 0 integer
## 13 lightly_active_minutes 84 8.94 0 0 0 0 integer
## 14 sedentary_minutes 1 0.11 0 0 0 0 integer
## 15 calories 4 0.43 0 0 0 0 integer
## unique
## 1 33
## 2 31
## 3 842
## 4 615
## 5 613
## 6 19
## 7 333
## 8 211
## 9 491
## 10 9
## 11 122
## 12 81
## 13 335
## 14 549
## 15 734
head(dailyActivity)
All complete, showing activity of 33 users over 31 dates.
df_status(sleepDay)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0 0 0 0 0 integer64 24
## 2 sleep_day 0 0 0 0 0 0 character 31
## 3 total_sleep_records 0 0 0 0 0 0 integer 3
## 4 total_minutes_asleep 0 0 0 0 0 0 integer 256
## 5 total_time_in_bed 0 0 0 0 0 0 integer 242
head(sleepDay)
24 users’ sleep logs over 31 days. This appears to have been
summarised from minuteSleep. No information on Sleep Type
though, so will need to get from minuteSleep.
df_status(weightLogInfo)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0.00 0 0.00 0 0 integer64 8
## 2 date 0 0.00 0 0.00 0 0 character 56
## 3 weight_kg 0 0.00 0 0.00 0 0 numeric 34
## 4 weight_pounds 0 0.00 0 0.00 0 0 numeric 34
## 5 fat 0 0.00 65 97.01 0 0 integer 2
## 6 bmi 0 0.00 0 0.00 0 0 numeric 36
## 7 is_manual_report 26 38.81 0 0.00 0 0 logical 2
## 8 log_id 0 0.00 0 0.00 0 0 integer64 56
head(weightLogInfo)
Weight logs of 8 unique users over 56 unique datetimes. This means we need to parse the date from this.
2.2.3 Hourly Data
In all cases below, date and time will need to be parsed from a character variable. All three tables have 33 unique users and 736 unique hours so can be joined on these variables.
Note: This data is in a tidy format,with one observational unit being a single user per hour.
df_status(hourlyCalories)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0 0 0 0 0 integer64 33
## 2 activity_hour 0 0 0 0 0 0 character 736
## 3 calories 0 0 0 0 0 0 integer 442
head(hourlyCalories)
Looks good.
df_status(hourlyIntensities)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0.00 0 0 0 0 integer64 33
## 2 activity_hour 0 0.00 0 0 0 0 character 736
## 3 total_intensity 9097 41.16 0 0 0 0 integer 175
## 4 average_intensity 9097 41.16 0 0 0 0 numeric 175
head(hourlyIntensities)
Looks good. It appears that total intensity is a weighted sum of
LightlyActiveMinutes, FairlyActiveMinutes and
VeryActiveMinutes from the minuteIntensities
table, whereas average_intensity divides this by 60 to get
a value per minute.
df_status(hourlySteps)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0.00 0 0 0 0 integer64 33
## 2 activity_hour 0 0.00 0 0 0 0 character 736
## 3 step_total 9297 42.07 0 0 0 0 integer 2222
head(hourlySteps)
Looks good.
2.2.4 Heart Rate
df_status(heartrate_seconds)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0 0 0 0 0 integer64 14
## 2 time 0 0 0 0 0 0 character 961274
## 3 value 0 0 0 0 0 0 integer 168
head(heartrate_seconds)
Heart rate data looks good, but could be averaged by hour and joined with the hourly data.
Note: This data is in a tidy format,with one observational unit being a single user per second.
2.2.5 Sleep
df_status(minuteSleep)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0 0 0 0 0 integer64 24
## 2 date 0 0 0 0 0 0 character 49773
## 3 value 0 0 0 0 0 0 integer 3
## 4 log_id 0 0 0 0 0 0 integer64 459
head(minuteSleep)
Note: This data is in a tidy format,with one observational unit being a single user per minute.
The sleep data poses a few challenges.
The
valuecolumn identifies the type of sleep and this must be reflected in the data.It would be more useful to summarise this data into a wide format, with the minutes of each sleep type per day were recorded. This could then be merged with the daily tables.
total_minutes_asleepfrom thesleepDaytable does not appear to match with this table.
2.3 Key Questions
Are there issues with bias or credibility in the data?
Reliable: The data was not particularly reliable. There were inconsistencies in the data collected but these were corrected to the best of my ability as described in the cleaning and wrangling sections.
Original: I cannot locate the original data source that was provided for this case study. It is a Kaggle repository with data offered by the public.
Comprehensive: The data is not comprehensive, this is a relatively small dataset and many tables have a lot of missing values, or just very few rows. This was also volunteered data from Fitbit users and so is neither a random sample, nor stratified in any way. IT would be dangerous to draw conclusions from this data alone.
Current: These data were collected in 2016 and the landscape of the fitness tracking industry has changed a lot in the last 6 years. It would be best to seek newer data.
Cited: I have cited the original source to the best of my knowledge above.
3 Process
3.1 Data Wrangling
3.1.1 Sleep
I believe the problem arises as sleep events usually span midnight
and therefore can occur on two separate dates. There are two potential
ways to collate the minuteSleep data to make
sleepDay:
Add together the number of minutes of sleep per calendar date and record that. The observation unit would be minutes of sleep per
idperdateAdd together the number of minutes of sleep per
idperlog_idand assign the date that the sleep event began as thedate.
The second of these paradigms is far more complicated, so lets explore the first.
minuteSleep[,date := lubridate::mdy_hms(date)]
minuteSleep[,date := lubridate::date(date)]
sleep1 <- minuteSleep[, .N, .(id,date)]
sleepDay[,sleep_day := lubridate::mdy_hms(sleep_day)]
sleepDay[,sleep_day := lubridate::date(sleep_day)]
head(sleepDay)
head(sleep1)
The new sleep1 correlates quite well with
total_time_in_bed but there are some discrepancies and also
missing data in sleepDay. I will therefore replace
sleepDay with my own summarised version of
minuteSleep, which will provide transparency and
coherence.
# Overwrite original table
dailySleep <- dcast(minuteSleep,
id + date ~ value)
dailySleep[, total_sleep := `1` + `2` + `3`]
setnames(dailySleep, c("1", "2", "3"), c("rem", "light", "deep"))
head(dailySleep)
The new dailySleep table shows the minutes of each day
slept (for the longest period of sleep), and in which stages.
3.1.2 Daily Data
Parse datetimes.
Set keys.
Merge tables
dailyActivity[,date := lubridate::mdy(activity_date)]
dailyActivity[,date := lubridate::date(date)][,activity_date :=NULL]
weightLogInfo[,date := lubridate::mdy_hms(date)]
weightLogInfo[,date := lubridate::date(date)]
setkeyv(dailySleep, c("id", "date"))
setkeyv(dailyActivity, c("id", "date"))
setkeyv(weightLogInfo, c("id", "date"))
daily <- weightLogInfo[dailySleep][dailyActivity]
daily[, id := factor(id)]
head(daily)
3.1.3 Hourly Data
Summarise heartrate by date, hour.
heartrate_seconds[, datetime := lubridate::mdy_hms(time)]
heartrate_seconds[, date := lubridate::date(datetime)]
heartrate_seconds[, hour := lubridate::hour(datetime)][, time := NULL]
hourlyHeartrate <- unique(heartrate_seconds[,
heartrate := as.integer(mean(value)),
.(id, date, hour)][, .(id, date, hour, heartrate)])
head(hourlyHeartrate)
Parse datetimes.
Set keys.
Merge tables
hourlyCalories[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyCalories[, date := lubridate::date(datetime)]
hourlyCalories[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyCalories)
hourlyIntensities[, datetime := lubridate::mdy_hms(activity_hour)]
hourlyIntensities[, date := lubridate::date(datetime)]
hourlyIntensities[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlyIntensities)
hourlySteps[, datetime := lubridate::mdy_hms(activity_hour)]
hourlySteps[, date := lubridate::date(datetime)]
hourlySteps[, hour := lubridate::hour(datetime)][, activity_hour := NULL][, datetime := NULL]
head(hourlySteps)
setkeyv(hourlyCalories, c("id", "date", "hour"))
setkeyv(hourlyHeartrate, c("id", "date", "hour"))
setkeyv(hourlyIntensities, c("id", "date", "hour"))
setkeyv(hourlySteps, c("id", "date", "hour"))
hourly <- hourlyHeartrate[hourlyCalories][hourlyIntensities][hourlySteps]
hourly[, id := factor(id)]
# Tidy environment
rm(list=ls()[! ls() %in% c("daily","hourly")])
head(hourly)
3.1.4 Feature Engineering - User Type
Using id as a factor variable is useful, however, it
would be good to split the users into groups based on their Activity to
spot any trends is usage. I shall use k-means clustering to group the
users
3.1.4.1 Normalisation
Select only the Activity Columns and then center and scale the data.
library(tidyverse)
library(tidymodels)
activity <-
daily %>%
select(c(very_active_minutes:sedentary_minutes)) %>%
mutate(across(.fns=scale))
head(activity)
3.1.4.2 Try k = 3 : 7
set.seed(1234)
kclusts <-
tibble(k = 3:6) %>%
mutate(
kclust = map(k, ~kmeans(activity, .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, activity)
)
clusters <-
kclusts %>%
unnest(cols = c(tidied))
assignments <-
kclusts %>%
unnest(cols = c(augmented))
clusterings <-
kclusts %>%
unnest(cols = c(glanced))
assignments %>%
select(-c(kclust, tidied,glanced)) %>%
group_by(k, .cluster) %>%
mutate(.cluster = fct_reorder(.cluster, very_active_minutes)) %>%
summarise(across(where(is.numeric), mean)) %>%
pivot_longer(c(very_active_minutes:sedentary_minutes), names_to = "Lifestyle", values_to = "Minutes") %>%
mutate(Lifestyle = fct_relevel(Lifestyle, c("sedentary_minutes", "lightly_active_minutes", "fairly_active_minutes", "very_active_minutes"))) %>%
ggplot(aes(fill= .cluster, y = Minutes, x = .cluster)) +
geom_col(position = "dodge") +
facet_grid(k~Lifestyle)
k = 3 appears to give the most coherent clusters in terms of user activity.
3.1.4.3 Map Cluster to UserType
Double check that the Lifestyle assignments make sense.
lifs <- assignments %>%
filter(k == 3) %>%
select(-c(k, kclust, tidied,glanced)) %>%
mutate(.cluster = fct_reorder(.cluster, (very_active_minutes+fairly_active_minutes))) %>%
pull(.cluster)
daily <- daily %>%
mutate(
Lifestyle = lifs,
Lifestyle = fct_recode(Lifestyle,
"Sedentary" = "1",
"Fairly Active" = "2",
"Very Active" = "3")
)
# Sanity Check
daily %>%
select(c(very_active_minutes:sedentary_minutes),Lifestyle) %>%
group_by(Lifestyle) %>%
summarise(across(where(is.numeric), mean)) %>%
arrange(very_active_minutes+fairly_active_minutes)
rm(list=ls()[! ls() %in% c("daily","hourly")])
head(daily)
3.2 Key Questions
3.2.1 What tools are you choosing and why?
I chose R and data.table as these allow for efficient processing and visualisation. It also allows for me to document the analysis using Rmarkdown to produce a document that meets the requirements of “Reproducible Research”.
4 Analyse
4.1 What are some trends in smart device usage?
4.1.1 Which features are people using?
Let’s see how many users use various combinations of the following features: activity (calories/steps), heartrate, weight and sleep tracking;
library(ggplot2)
library(gridExtra)
library(plotly)
How many days did each user wear the tracker? Let’s assume that if
Calories >0 then the tracker was worn.
df <- daily[calories > 0,.N,by = .(date)]
p1 <-
ggplot(data = df, aes(x = N)) +
stat_ecdf(geom = "point") +
labs(x = "Number of Days Worn",
y = "Percentage of Users") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
p2 <-
ggplot(data = df, aes(x = `date`, y = N)) +
geom_line(size = 1) +
labs(x = "Date", y = "Number of Users") +
theme_bw()
grid.arrange(p1,p2, nrow = 1)
Most users wore the device for 30-32 out of the 32 days measured. After 2 weeks, usage began to drop off.
Lets investigate the number of users who logged their weight;
df <- daily[!is.na(weight_kg), .N, by = .(date, id)]
ggplotly(
ggplot(data = df, aes(x = date, y = N, fill = id)) +
geom_col(show.legend = FALSE) +
labs(x = "Date", y = "Users") +
theme_minimal()
)
5/33 users logged their weight.
One user logged almost every day.
df <- daily[!is.na(total_sleep),.N,.(date)]
ggplotly(
ggplot(data = df, aes(x = N)) +
stat_ecdf(geom = "point") +
labs(x = "Number of Days Worn",
y = "Percentage of Users") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
)
No users logged their sleep for more than 17 days.
ggplotly(
ggplot(data = df, aes(x = `date`, y = N)) +
geom_line(size = 1) +
labs(x = "Date", y = "Number of Users") +
theme_bw()
)
No users logged their sleep for more than 17 days.
df <-
hourly %>%
drop_na(heartrate) %>%
group_by(date,id) %>%
summarise(heartrate = mean(heartrate)) %>%
ungroup(id) %>%
count()
ggplotly(
ggplot(data = df, aes(x = n)) +
stat_ecdf(geom = "point") +
labs(x = "Number of Days Worn",
y = "Percentage of Users") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
)
No more than 13 users recorded heartrate.
ggplotly(
ggplot(data = df, aes(x = `date`, y = n)) +
geom_line(size = 1) +
labs(x = "Date", y = "Number of Users") +
theme_bw()
)
No more than 13 users recorded heartrate.
Let’s look at the intersection of feature usage by user
#library(ggvenn)
library(UpSetR)
upset(
fromList(
list(
activity = unique(daily[calories > 0, as.character(id)]),
sleep = unique(daily[total_sleep > 0, as.character(id)]),
weight = unique(daily[weight_kg > 0, as.character(id)]),
heart = unique(hourly[heartrate > 0, as.character(id)])
)
),
order.by = "freq",
mb.ratio = c(0.30, 0.70),
point.size = 3.5,
line.size = 2,
text.scale = 1.5,
sets.x.label = "Users"
)
An UpSet plot showing how many users used each feature on at least one day.
4.1.2 Are there correlations between sleep, activity and calories burned?
Is there a relationship between total sleep the night before and calories burned by user type?
# Engineer total_sleep
yesterday <- daily %>%
select(id, date, total_sleep) %>%
mutate(date = date + lubridate::days(1)) %>%
dplyr::rename(yesterday_sleep = total_sleep)
daily %>%
left_join(yesterday, by = c("id","date")) %>%
#select(date, yesterday_sleep, calories, Lifestyle) %>%
ggplot(aes(x = yesterday_sleep, y = calories)) +
geom_point() +
geom_smooth()