Udacity Data Science Nano degree project
For this project, three datasets were provided that mimics customers' behavior on Starbucks rewards mobile app. Few times a week Starbucks would send out offers to users that use the mobile app. Each offer has its own expiration date and time. Some of the challenges are follows:-
- Not all users receive the same offer
- Not many customers completed the offer
- Some users might receive the offer never open it but still complete the requirement.
Types of offers
- Discount(e.g buy $10 dollars and get $2 off)
- Bogo(buy one, get one free)
- Informational ( 7 days validity)
- The business problem that I would like to solve is how to get more customers to complete the offers
2. How can Starbucks reward customers that make purchases during the promotional period without view the offer, for their loyalty.
Data is contained in three files:
- portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers complete
Here is the schema and explanation of each dataset:
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
Exploratory Data Analyse on Datasets
- Load in all the necessary libraries. Then read in all three datasets- analyze information by printing the dataframe head, shape, info method, describe, and isnull. To get a better understanding of the data.
- profile shape (14,825, 8)
- transcript shape (306534, 4)
- portfolio shape (10, 6)
2. Check for missing values in all three datasets. No missing value in portfolio and transaction. However, there were 2175 missing values for income and age in the profile dataset.
3. Fill in a NAN values with zero and for income use the mean to replace the missing value. Also, there are outliers for age = 118. drop those which seems like fake accounts.
4. Split “channel columns” in portfolio dataset into 4 additional columns and drop original “channels columns”. New column names are channel_email, channel_mobile, channel_web, channel-social.
Visualized each dataset to check for insightful information before the merge.
Transaction insightful information
The above figures, show how many people received an offer, view the offer, and completed the offer. Only 11% of the customers completed the offer and that’s one of the biggest challenges, how can Starbuck get more customers to complete the offer?
Portfolio insightful information
Look like customers' favorite is BOGO deal which buys one and gets one free compare to the other offers. Informational offer is the less customer favorite and discount offers have some potential which can be improved to increase customer attraction.
The most effective channel for the promotion is mobile followed by social which is expected because most people don't check their email nor visit a website. Many people consistently using their mobile to check social media sites.
Profile insightful information
More males than females use the app and 212 without any gender information provided.
Here is an overview of the age, income, and data when a customer became a member and the different percentile. Membership keeps increasing over the years.
Age ranges from 18–101 with a mean of 54 and max 101. Only a small fraction of in the 100 range.
There are more older females than males and the other group (O)whose gender is undisclosed almost equal to female in the age range.
The income ranges from 30,000 to 120, 000. Most people fall within the mean which is about 65,000. and STD is about 21,000. The scatterplot show more males in the lower income range than female. This might be another opportunity to attract females to sign up and use mobile app.
Preprocessing Steps for Supervised Machine Learning Model
- Merged all datasets into one data frame and review for duplication and null values, fill in all Nan value with appropriate values for each column and reindex the column's name and convert the “ became_member_on” change to a Datetime value.
- Print the shape of merged dataset (272772, 17)
2. Perform One Hot Encoding — using pd.get_dummies to converted all categorical values into 1 and 0 for the machine learning model. Then drop all converted feature columns and merge the new data frame with the existing one.
3. Split the dataset into train and test with random state=1. Training dataset into 80 percent and test dataset into 20 percent. The model will be trained on 80 percent dataset and then used to make predictions on the 20 percent test dataset and evaluate the performance.
Problem Solving Strategy
Build a machine learning model to predict whether a customer will complete the offer base on the different features e.g Demographic, income, gender, age, offer_type, difficulty of promo, and reward.
Model Selection and Justification.
I have chosen two models Logistic Regression and Random Forest tree.
Logistic regression was used as the baseline model since I am trying to solve a binary classification problem. Logistic is more appropriate when the dependent variable is binary and there are two possible outcomes. In this situation, the possible outcomes are whether the customer likely to complete the offer or not complete offered. Logistic Regression also help to illustrate the relationship between the dependent variable and independent variable and gives a better understanding of which variable has a higher variance for effective result.
Random Forest Tree
Random forest classifier was used because it is more flexible and provide better result even without hyper-parameter tunning. Random Forest produces a more accurate and stable prediction. In my analysis, there was a large number of missing data for income, age, and gender. Random Forest was more appropriate use because it is capable to handle large number is missing values in the dataset and still able to predict accurately.
Metric used to evaluate the model
- Logistic Regression: 0.94
- Random Forest: 0.93
F1-score measure the model accuracy for binary classification and the harmonic mean of the model precision and recall score. F1_score produces a balance between precision and recall when there is an uneven class distribution. which in this situation was most suitable to measure the accuracy of the model prediction because we have a slight imbalance in gender 57% male, 41 % female, and 1.4% undisclosed gender.
Despite accuracy for both Logistic Regression and Random Forest is 0.88 it is not the best metric to measure the model performance when there is a slight class imbalance. Therefore, F1 -Score is the better fit to measure accuracy.
I used the GradientBoostingClassifier model to improved the accuracy without any hyperparameter tunning and evaluate how well the accuracy was reported. There was not much change in the result. Hence, more can be done to improve the model by performing hyperparameter tuning and using other algorithms. ( This can be for future work)
Complication during the process
The challenges faced during the process is cleaning the data and selecting relevant variable that will make an impact on the business problem. Also, dealing with data format error, missing value, categorical variables, outlier, and duplicates.
Based on all the findings the conclusion is more customers prefer the BOGO( buy one and get one free). Since Bogo promotion is doing better than the discount. It’s time to revamp the discount promotion to attract more customers which will help to increase the completion rate. The current discount promo is spent $10 within 10 days and gets a $2 reward. This seems like a lot of pressure for customers because 10 days is not a long period and it goes by very fast.
New option for the discount promotion. For example, spend $25 from February 1st to 28th and get $5 in reward for freebee or gift card to spend in the future. By extending the time period, it will give more customers the opportunity to complete the offer and take advantage of the promo.
For those loyal customers that spend without taking advantage of the offer. A loyalty tracking system should be in place to reward them in the form of freebee when they met promo requirements unknowingly. It’s like a sweet surprise to appreciate customers for their loyalty.
Future work for this project would be to improve the model performance by hyperparameter tunning and evaluate by using different parameters and metrices.