ECON 575 – Data Analysis
10 Points
All answers to problem set questions must be typed so they can be reviewed by Turnitin.
These are premium bottle warmers! (10 points)
This problem requires you to use the baby_data.csv file posted
You have just begun working for a baby products company called BabiesRWe (in no way related to another company by a similar name). The company has a database of 1 million existing customers that are registered with an account on their website.
Just prior to your arrival, the company sent an advertisement via email to a random sample of its existing customers for a premium bottle warmer. Now, the company wants to send out another advertisement to another sample of existing customers for the same product. However, instead of choosing randomly, they want to target the customers most likely to respond based on the results of the first round of advertisements. You have been given a data set by your manager (baby_data.csv) with the goal of creating a classification model to send targeted ads.
Target variable:
• purchased: whether the customer used the discount offer
Attributes:
• repeat_customer: whether the customer has previously purchased a product from BabiesRWe
• total_spent: the total amount of money the customer has spent on BabiesRWe products
• children: how many children the customer has
• adults: how many adults live in the customer’s household
You also have the following information about the product and advertisement:
• Bottle warmer price: $40
• Bottle warmer cost: $10
• Advertisement cost: $0.50
A) [1 point] Create a cost/benefit matrix for this situation using the information above (I recommend just using the create table function in Word).
B) [2 points] Upload baby_data.csv to BigML. Split the data into Training (80%) and Test (20%) sets and enable the “linear split” option by clicking the button.
Use the training set to create 2 models with purchased as the target variable: a decision tree and a logistic regression. Evaluate both models on the test set and report the precision, recall, and ROC AUC for each model at a 50% probability threshold. Explain what each of these measures means in this context.
C) [1 point] Explain what the probability threshold means in this context and discuss the relationship between precision and recall that you see in each model as you vary the probability threshold.
D) [1 point] Report the lift of each model (which is a percent in BigML) at 10% of positive instances and interpret what lift of the model means in the context.
E) [5 points] Suppose that you are given a fixed budget of $50,000 to email targeted ads for the bottle warmer and that you decide to set your models to a modestly conservative 60% probability threshold.
Using the confusion matrices from the BigML output and the cost/benefit matrix from Part A, what is the expected profit for each targeted advertisement sent when using the decision tree and when using logistic regression subject to your budget constraint?
Based on your calculations, which model yields a greater expected profit, and would you recommend BabiesRWe send targeted ads?
This part requires several steps and new model evaluations (use the appropriate range to non-randomly get the test sample you need, see video for details). Think carefully about how to use the information available to you to make your calculations, and ultimately your modeling decision and recommendation. I recommend showing as much work as you can.