forked from dataquestio/solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Mission516Solutions.Rmd
147 lines (115 loc) · 4.12 KB
/
Mission516Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: "Guided Project: Creating An Efficient Data Analysis Workflow, Part 2"
output: html_document
---
```{r}
library(tidyverse)
library(lubridate)
sales <- read_csv("sales2019.csv")
```
# Data Exploration
```{r}
# How big is the dataset?
dim(sales)
```
```{r}
# What are the column names?
colnames(sales)
```
The `date` column shows the date that the order of books was made. This will help us distinguish between orders that were made before and after the new program was implemented. `quantity` describes how many books were made, and `user_submitted_review` looks like it's a hand typed review of the books themselves. `customer_type` indicates whether or not the customer was an individual or a business. It seems that the company has started selling in bulk to other business too.
```{r}
# What are the types of all the columns?
for (col in colnames(sales)) {
paste0(col, " : ", typeof(sales[[col]])) %>% print
}
```
```{r}
# Is there missing data anywhere?
for (col in colnames(sales)) {
paste0(col,
", number of missing data rows: ",
is.na(sales[[col]]) %>% sum) %>% print
}
```
The `user_submitted_review` column has some missing data in it. We'll have to handle this later in the data cleaning, but at least we know about it ahead of time. The `total_purchased` column also has missing data, which we'll handle with imputation.
# Handling Missing Data
```{r}
# Remove the rows with no user_submitted_review
complete_sales <- sales %>%
filter(
!is.na(user_submitted_review)
)
# Calculate the mean of the total_purchased column, without the missing values
purchase_mean <- complete_sales %>%
filter(!is.na(total_purchased)) %>%
pull(total_purchased) %>%
mean
# Assign this mean to all of the rows where total_purchased was NA
complete_sales <- complete_sales %>%
mutate(
imputed_purchases = if_else(is.na(total_purchased),
purchase_mean,
total_purchased)
)
```
# Processing Review Data
```{r}
complete_sales %>% pull(user_submitted_review) %>% unique
```
The reviews range from outright hate ("Hated it") to positive ("Awesome!"). We'll create a function that uses a `case_when()` function to produce the output. `case_when()` functions can be incredibly bulky in cases where there's many options, but housing it in a function to `map` can make our code cleaner.
```{r}
is_positive <- function(review) {
review_positive = case_when(
str_detect(review, "Awesome") ~ TRUE,
str_detect(review, "OK") ~ TRUE,
str_detect(review, "Never") ~ TRUE,
str_detect(review, "a lot") ~ TRUE,
TRUE ~ FALSE # The review did not contain any of the above phrases
)
}
complete_sales <- complete_sales %>%
mutate(
is_positive = unlist(map(user_submitted_review, is_positive))
)
```
# Comparing Book Sales Between Pre- and Post-Program Sales
```{r}
complete_sales <- complete_sales %>%
mutate(
date_status = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post")
)
complete_sales %>%
group_by(date_status) %>%
summarize(
books_purchased = sum(imputed_purchases)
)
```
It doesn't seem that the program has increased sales. Maybe there were certain books that increased in sales?
```{r}
complete_sales %>%
group_by(date_status, title) %>%
summarize(
books_purchased = sum(imputed_purchases)
) %>%
arrange(title, date_status)
```
It turns out that certain books actually got more popular after the program started! R For Dummies and Secrets of R For Advanced Students got more popular.
# Comparing Book Sales Within Customer Type
```{r}
complete_sales %>%
group_by(date_status, customer_type) %>%
summarize(
books_purchased = sum(imputed_purchases)
) %>%
arrange(customer_type, date_status)
```
Baserd on the table, it looks like businesses started purchasing more books after the program! There was actually a drop in individual sales.
# Comparing Review Sentiment Between Pre- and Post-Program Sales
```{r}
complete_sales %>%
group_by(date_status) %>%
summarize(
num_positive_reviews = sum(is_positive)
)
```
There's slightly more reviews before the program, but this difference seems negigible.