Starbucks customer segmentation

13 min readDec 6, 2020

Project Overview

Coffee has now become a very popular beverage option in the modern society. Over the past century, a number of coffee shop chains have been expanding themselves to become international corporations. Starbucks is no doubt the most prominent one.

Starbucks has not only been famous for its outstanding quality and taste of coffee, but also its customer care. Over the past decade, where digital evolution has transformed many of our lives, Starbucks has evolved to become more digitised in terms of customer interactions. Starbucks has utilised channels such as web and mobile apps to promote its products to customers. There is also a membership programme, where customers can enjoy some exclusive offers from Starbucks.

Problem Statement

Recently, Starbucks wants to know the efficiency of its offers to the customers, and the responsiveness of the customers to the offers. Using data sets provided by Starbucks, we can perform analyses and fit a model to predict the success rate of a customer responding to a particular offer.

The data precessing includes cleaning of the data sets (remove missing and outliers), merge the data sets (so that the information regarding a customer or a transaction can be accessed all in one instance), encode categorical variables (using one-hot encoding), and create labels (based on the outcome of the offer / marketing activities).

Metrics

Since we are creating a model to predict the outcome (successful / unsuccessful) of a particular offer on a particular customer, we need certain metrics to assess how well the model performs.

To fully assess the performance, both to determine the effectiveness of training and predictive ability on unseen data, I split the data set into training and testing sets, where the testing set is held out from the training process, this is to simulate and estimate the model’s performance on out-of-time unseen data points.

The chosen metrics are accuracy and F1-score, accuracy is computed as the total number of correctly predicted instances divided by the total number of instances. F1 score is the harmonic mean of precision of recall. More information on F1 score can be seen here. Both metrics are applied to both training and testing data sets to evaluate the performance. Accuracy and F1-scores are very common metrics to measure the performance of a classifier. Accuracy can be used here because the success classes and unsuccessful classes are almost balanced. A good F1-score represents good trade-off between precision and recall, so that there are not too many false negatives or false positives.

Data Exploration

Before data cleaning

Data exploration is done first on the granular data before the merge. This is to get an idea of the percentage of missing data — if there are more missing than we can handle, then there would be very limited insights we can gain should we proceed with further analyses. Therefore, it is worth determining at an early stage whether the project is doable and if there will be any value generated.

For example, in customer profile data there are some missing values, to get an idea if we can safely get rid off the missing data, we need to know the percentage of missing.

We can see that gender and income contain missing values, but not so many to prevent further analyses. Therefore, we can remove them to perform analysis on the rest, where the information is available to us.

2. After data cleaning

After the data set has been cleaned, exploratory data analysis has been performed to understand the distribution of the data points. If we were not given the opportunity to use the advanced machine learning, we might need very detailed insights to drive our modelling such as for a scorecard. Nevertheless, it is still important to perform exploratory data analysis with the cleaned data even if we have the benefit from machine learning algorithms. For example, with regards to each offer, we can output the below statistics to describe the customer profile:

From the above results, we can see that the offers tend to work better when promoted to older customers and customers with higher income. Also, female are more likely to react to the promotion. And average transaction value of a customer are very different for successful and unsuccessful offers.

Data Visualisation

Beyond outputting these statistics above, we can also visualise the distributions through plotting tools.

Take the above metrics as an example, we can plot the graphs below:

We can see from the plots that the points made in the exploratory analysis stage are shown clearly in the plots. One advantage of using data visualisation is that we can sometimes spotted anomalies more easily than looking at pure numbers. From the plots above, we can see that the offers work better on older customers indeed, but there are much fewer customers with longer membership history than newer ones. Therefore, it is crucial for Starbucks to turn this around because the return from offers will likely be limited if there is limited considerations with respect to newer customers.

Data preprocessing

Before proceeding with formal preprocessing steps, it is noted that there is a date assigned to the members to indicate the time that they started their membership. We know that there might be some form of relationship between the time the customer spent with Starbucks and the likelihood of accepting an offer. However, it might be sufficient to only consider the year the customer became a member in. Although there might be argument around seasonality for us to consider the month, there is limited motivation to do so in this project. This is because for certain seasonal products, it might make a difference. However, we do not have further information on the offers themselves to indicate seasonal products. Therefore, adding the month (or even days) might introduce additional noise which might give us wrong signals.

Data preprocessing steps include removing missing entries, removing outliers, one-hot encoding categorical variables, merging tables, and assigning labels.

First is treating missing values, we can see that gender and income are missing for many customers. Since only 13% of them are missing, we should remove those entries.

Second is treating outliers. We see that some customers have age of 118, which is peculiar. This might be because they were originally missing, and for some reasons, it was processed by someone to impute missing age with 118. However, we cannot trace the origin of the numbers, and are not sure whether these are appropriate. Therefore, we remove customers with age of 118 as well.

Third is one-hot encoding. This becomes an important step because we have categorical variables, where the values are strings. These are texts with no meaning to the model after all. Therefore, it is important for us to transform the categorical variables into a binary matrix to indicate which category the variable belongs to. Note that for age we first bin the age groups of the customers, then we proceed with one hot encoding.

The reason for merging the tables is that it is important to have a single training set where each instance is a single vector. Otherwise for many modelling setups it would not work with multiple tables. When merging the data tables, we select a particular customer to merge the tables on. This makes it easier for us to keep track of the event log associated with one particular customer.

The last step is assigning labels — here we give the following definition: an offer is deemed successful on a particular customer if there is a transaction in the offer period. Since the offer periods are often short, it is highly likely that the customer make the transaction because of the offer. In the worst case, we are still predicting based on the correlation between offers and transactions.

Analysis on Training data

The size of the training data contains 21 features, including customer features (gender, age, year joined, income, etc) and offer features (reward, duration, difficulty, channel information, etc). This means that the feature matrix is likely to be a good representation of the system / problem we are trying to model.

There are about 54 thousand rows, which is a decent size for modelling. The size is not too small so that the model cannot generalise the data, nor too large that it might not be worth spending the time (or the runtime memory is not able to manage).

Model Implementation and refinement

To fit a model to the cleaned data set, we first create a benchmark with a dummy model. Then since this is a binary classification problem, a logistic regressor might be suitable. Also, random forrest might be another option since this is a classification problem. Another advantage of using random forrest is that we can view the feature importance to see if the trained model makes sense to domain experts or stakeholders.

To refine the model, we should test the performance on the test set so that we can identify any over-fitting. Over-fitting happens when a model is tuned to the idiosyncracies that only exist in the training data set. This means that the model fits itself to correctly predict the noise in the training set, and hence having poor performance on unseen data. A way to test the model’s performance on unseen data is to use a hold-out testing set, which was taken from 20% of the entire original data set. Note that this set is representative of the population because it is randomly chosen from the entire population, and it is unseen because it is excluded from the training process

The models are tuned using a randomised search for the best set of hyper-parameters to give a good model performance. The tuning time for a single epoch will not be very long because the training data has a reasonable size. Therefore, we can afford doing a thorough search over the hyper-parameter space to find the optimal combination of hyper-parameters. For example, hyper-parameters that were involved in the tuning process include penalty function for logistic regression, and number of estimators for random forest.

The search space for logistic regression is as below:

The search space for random forest is as below:

It is also worth noting that if there are more computing power and memory, it might also be possible to use a grid search. The difference between random search and grid search is that random search only selects random combinations of hyper-parameters to find the local optimal solution in the search space. However, grid search gives the global optimal solution in the search space. The only thing to watch out for is probably the computing constraint of the machine.

Model evaluation and validation

First step is to use a dummy model to serve as a benchmark on the performance, because if a developed model delivers performance worse than the dummy model, then there must be something wrong with the model.

The dummy models chosen are: 1. a model that predicts every offer as successful on any customer, and 2. a model that predicts every offer as unsuccessful on any customers. The performance of the dummy models are shown below:

Using a logistic regression with default setup, we get the following performance:

This is a good performance, at least much better than the dummy model. However, we may be able to fine tune the model using random search over the hyper-parameter space to improve the performance. Then using a tuned logistic regression we get the performance as below:

This is a much better outcome, and because the testing performance is comparable to the training performance, this means that there is minimal over-fitting in the training process.

To better visualise the performance, a precision-recall curve and a receiver operating characteristics (ROC) curve are drawn to reflect the distinguishing power of the logistic regression.

We can see that the model is doing very well in terms of ability to differentiate between successful and unsuccessful outcomes of an offer.

Then a random forest model is trained on the data as well to see if there can be any performance uplift as compared to regression. A random forest model with default setup is trained and the performance is as below:

We can see that the training results are perfect, but testing results are not so good, which indicates that the model might be overfitting. Therefore, we need to further tune the model to eradicate the problem of overfitting, because an overfitted model is likely to deteriorate fast when it encounters unseen data. This is because the model is tuned to the idiosyncrasies (noises) that are only present in the training data. From a randomised search, we get the tuned random forest model. The performance is as below:

We see that the training and testing performances are similar to each other, indicating that there is unlikely to be over-fitting. Hence, the robustness of this model is validated. The final setup is a random forest model with the following hyper-parameters:

Justification

As mentioned above, the model can be justified through further analyses from the feature importance. The feature importance of a particular feature indicates the magnitude of its impact on the model’s prediction. The most important features in the random forrest classifier are as below:

We can see that the transaction amount is the most important feature. This makes sense because the transaction amount is a good way to describe the customer’s relationship with Starbucks, if the customer regularly buy products from Starbucks, it means that the customer is likely to be more responsive when receiving an offer from Starbucks. This argument is justified because since Starbucks products are more or less similar in prices, the transaction amount actually reflects the frequency of a purchasing event happening between the customer and the coffee shop.

The next few are related to the offer itself. This is because different offers have different expectations of success rates. Unfortunately the expectations are not given as a part of the project. Therefore, it is difficult for us to justify whether the success rates meet the expectations. However, this still tells us that the model is making predictions on reasonable grounds because the mechanism of how the offer works will likely to affect greatly on customer’s reactions.

The next most important customer feature is income. This is again justified because for customers with higher income, they naturally tend to spend more on goods like coffee. Therefore, they tend to be more willing to accept an offer and spend. From the results presented above, we already see that newer customers tend not to convert the offer to transactions, which is again proven in the feature importance ranking.

The least important feature is whether the offer is sent through an email. This is the case because every offer is at least sent through an email. Therefore, this gives the model no valuable information because there is no observable distinction between the success group and unsuccessful group who receive an email.

Since the feature importance ranking makes much sense to us, the modelling setup of a random forrest is well justified. But please note that this might not be the only viable solution — there can be better tuned models which also serve our purposes well.

Reflection

This result shows that the most effective way for Starbucks to improve the success rate of its offers is to modify the offers themselves in terms of reward amount and duration of the offer.

This model can also help Starbucks decide which customers to offer the promotions. From the model output, we can get the predicted probability, which indicates a customer’s likelihood to react to a particular offer. Starbucks can rank order the customers based on the likelihood to get the group of customers to show a promotion, or other marketing prompts. For example, if Starbucks wants to target its promotion to 50% of the customers, it can pick those in the top 50% ranking based on the likelihood. In this way, Starbucks can maximise its returns from that offer.

Improvement

This piece of work is rather rudimentary to consider if Starbucks wants to improve the success rate of the offers and decide which customers to show the offers to. Further analysis which might contain more details in segmenting the customers can be performed to gain more insights into the problem.

Also, more advanced modelling techniques have not been explored in this project. For example, gradient boosting trees, or even neural networks may be used to deliver more accurate predictions.

Reference

The GitHub repository containing the analysis is https://github.com/xixiaodong/Starbucks-customer-segmentation.