High-level overview:
This blog is going to investigate a dataset provided by Starbucks. This dataset includes loads of information about their customers, their offers (promotions). By merging three separate tables, this project illustrates some main points about data, some problems which can be fixed, and also, discuss about how to predict using existing features.

Problem domain

This project is going to find out the possibility of predicting the Starbuck users' behaviour in response to offers given.

Project origin

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app.

Description of Input Data

Data sets

portfolio.json - containing offer ids and meta data about each offer.
profile.json - demographic data for each customer
transcript.json - records for transactions, offers received, offers viewed, and offers completed.

More details about each data set:

portfolio.json

id (string) - offer id
offer_type (string) - type of offer ie BOGO, discount, informational
difficulty (int) - minimum required spend to complete an offer
reward (int) - reward given for completing an offer
duration (int) - time for offer to be open, in days
channels (list of strings)

profile.json

age (int) - age of the customer
became_member_on (int) - date when customer created an app account
gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
id (str) - customer id
income (float) - customer's income

transcript.json

event (str) - record description (ie transaction, offer received, offer viewed, etc.)
person (str) - customer id
time (int) - time in hours since start of test. The data begins at time t=0
value - (dict of strings) - either an offer id or transaction amount depending on the record

Problem Statement

This project aims to predict the effect of offer on buyers' behaviour.

Strategy for solving the problem

Investigating data; Cleaning data; Merging data; EDA; Modelling and Predicting. In the modelling section, the project uses Random Forest Classifier as the main method.

Expected solution

Modelling based on a consolidated dataframe from three original tables: portfolio, profile and transcript. I hope to find out meaningful variables which have effects on decision of Starbucks users

Metrics

As the model aims to predict event_transaction - a binary variable - F1 score would be incorporated to determine whether the model is doing well as F1 score is the harmonic mean of precision and recall and is a better measure than accuracy.

Investigating data
In the beginning, each project is examined (by having a peek at top 5 rows - df.head()). This helps us understand how many columns there are and what data type of each column is.

EDA

Clean data

First, the portfolio is being examined. Here, I change name of id to offer_id to distinguish from user_id. Then, I create dummy features for channels column.

Second, in the profile table, I rename id to user_id to distinguish from offer_id. After that, became_member_on column is formatted to datetime.

Third, in the transaction, I rename id to user_id to distinguish from offer_id. After finding out that there are four unique values in event column, which are 'offer received', 'offer viewed', 'transaction', 'offer completed', I get dummy variables of these. Finally, in value column, I find something stored in terms of dictionary. I extract them out to two separate columns: offer_id and amount.

Last, I merge three tables together using mutual columns.

Data visualisation

As you might have a look on the histogram, there is probably an outlier at age 118. My decision is to remove this row from df.

A heatmap is also provided to see the correlation among variables.

Data Preprocessing

Before applying Random Forest Classifier to predict event_transaction being 0 or 1 (whether there is a transaction arised from the event or not), a couple of processing steps have been taken.

First, I get dummies of these variables: offer_type, offer_id, gender. In these cases, I prefer one-hot encoding to label encoding because these are not ordinal variables. However, in case of user_id, as there are too many user_id, I transform this column to be label encoded instead. Despite its cons, its pros is to save the memory size when conducting the data modelling.

Second, I extract day, month, and year from the original datetime column, representing the date a user joins Starbucks membership program.

Third, I get rid of null values by filling them with the mean of each continuous variable. Specifically, acknowledging that when there is a null value in amount column, it should be interpreted that the amount is 0. Therefore, in this particular column, I fill null with zeros.

Modeling

Train set and test set are splited using train_test_split with the ratio 70:30.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)

Using RandomForestClassifier() as the main method

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

Fit the train data X_train, y_train to random forest model

rf.fit(X_train, y_train)

Predict y based on X_test and trained model

y_pred = rf.predict(X_test)

Import classification_report and confusion_matrix to see the performance of this model. As a result, the f1-score of test set is 0.94, very good for the initial try.

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_pred, y_test))
print(confusion_matrix(y_pred, y_test))

Hyperparameter Tuning

To revise the model, I apply GridSearchCV taking some slight changes in random forest classifier hyperparameters.

Results

The f1-score of test set is 0.96, a slight increase compared to the initial model.

Comparison Table

This screenshot below shows the results from the initial model, without tuning the hyperparameters.

Compared with this one, the second screenshot illustrates a slight change in the f1-score of the model after tuning the hyperparameters.

From these results, it can be concluded that just by tuning only two hyperparameters, we already get a better result. Therefore, it is believed that if we keep tuning more hyperparameters, the result might be even more accurate.

Conclusion

Firstly, about the analysis, there is a high likelihood that Starbucks might predict the effect of their offers on buyers' behaviours. According to the feature importance ranking, offer_id coded ae264e3637204a6fb9bb56bc8210ddfd seems to have the largest effect. Turning to characteristics of users, income and age matters more than gender.

Second, in terms of difficulties, because I am not really familiar with how the offer works in reality, that could lead to some confusion at first.

Improvements

In the next version, more algorithms should be employed to have a variety of results. Therefore, there might be a better forecasting ability of the model using the existing data.

Acknowledgment

Thank Starbucks team for the data and the Udacity education team for a well-written guideline

References

Confusion Matrix, Accuracy, Precision, Recall, F1 Score: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd

Search This Blog

Three facts from Stackoverflow survey data

Starbucks Project

Problem domain

Project origin

Description of Input Data

Data sets

Problem Statement

Investigating data
In the beginning, each project is examined (by having a peek at top 5 rows - df.head()). This helps us understand how many columns there are and what data type of each column is.

EDA

Clean data

Data visualisation

Data Preprocessing

Comments

Post a Comment

Popular posts from this blog

Three facts about Stackoverflow survey data

Starbucks Project

Problem domain

Project origin

Description of Input Data

Data sets

Problem Statement

Investigating dataIn the beginning, each project is examined (by having a peek at top 5 rows - df.head()). This helps us understand how many columns there are and what data type of each column is.

EDA

Clean data

Data visualisation

Data Preprocessing

Comments

Post a Comment

Popular posts from this blog

Three facts about Stackoverflow survey data

Investigating data
In the beginning, each project is examined (by having a peek at top 5 rows - df.head()). This helps us understand how many columns there are and what data type of each column is.