Starbucks Capstone Project

Udacity Data Science Nano degree project

Photo by KAL VISUALS on Unsplash

Project Overview

For this project, three datasets were provided that mimics customers' behavior on Starbucks rewards mobile app. Few times a week Starbucks would send out offers to users that use the mobile app. Each offer has its own expiration date and time. Some of the challenges are follows:-

  • Not many customers completed the offer
  • Some users might receive the offer never open it but still complete the requirement.

Types of offers

  1. Discount(e.g buy $10 dollars and get $2 off)
  2. Bogo(buy one, get one free)
  3. Informational ( 7 days validity)

Business Objective

  1. The business problem that I would like to solve is how to get more customers to complete the offers

Dataset Review

Data is contained in three files:

  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers complete
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)
Portfolio . shape
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record
Transcript. shape 306
  • transcript shape (306534, 4)
  • portfolio shape (10, 6)
Analyzing data information
Transaction value count
gender value count
Describe method showing age, becoming a member, and income statistic
Age group
Gender and age group
Income range

Preprocessing Steps for Supervised Machine Learning Model

  1. Merged all datasets into one data frame and review for duplication and null values, fill in all Nan value with appropriate values for each column and reindex the column's name and convert the “ became_member_on” change to a Datetime value.
Merge datasets

Problem Solving Strategy

Build a machine learning model to predict whether a customer will complete the offer base on the different features e.g Demographic, income, gender, age, offer_type, difficulty of promo, and reward.

Model Selection and Justification.

I have chosen two models Logistic Regression and Random Forest tree.

Logistic Regression

Logistic regression was used as the baseline model since I am trying to solve a binary classification problem. Logistic is more appropriate when the dependent variable is binary and there are two possible outcomes. In this situation, the possible outcomes are whether the customer likely to complete the offer or not complete offered. Logistic Regression also help to illustrate the relationship between the dependent variable and independent variable and gives a better understanding of which variable has a higher variance for effective result.

Random Forest Tree

Random forest classifier was used because it is more flexible and provide better result even without hyper-parameter tunning. Random Forest produces a more accurate and stable prediction. In my analysis, there was a large number of missing data for income, age, and gender. Random Forest was more appropriate use because it is capable to handle large number is missing values in the dataset and still able to predict accurately.

Metric used to evaluate the model


  • Random Forest: 0.93

Improve Model

I used the GradientBoostingClassifier model to improved the accuracy without any hyperparameter tunning and evaluate how well the accuracy was reported. There was not much change in the result. Hence, more can be done to improve the model by performing hyperparameter tuning and using other algorithms. ( This can be for future work)

Complication during the process

The challenges faced during the process is cleaning the data and selecting relevant variable that will make an impact on the business problem. Also, dealing with data format error, missing value, categorical variables, outlier, and duplicates.


Based on all the findings the conclusion is more customers prefer the BOGO( buy one and get one free). Since Bogo promotion is doing better than the discount. It’s time to revamp the discount promotion to attract more customers which will help to increase the completion rate. The current discount promo is spent $10 within 10 days and gets a $2 reward. This seems like a lot of pressure for customers because 10 days is not a long period and it goes by very fast.

Data Science and Machine Learning student at Udacity. Looking to transition into new career path from Banking to Technology

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store