Random Forest: a Machine Learning Model in Marketing Strategy & Decision Making, Better than A/B testing ?
Starbucks optimizing the results with Random Forest after A/B Testing Part II
1. Introduction
The aim of this article is to optimize our A/B testing results from last article (“Why should your company do an A/B testing? Starbucks A/B Testing Step by Step Part I"). This time I will use Random forest (machine learning algorithm) to improve our efficiency by comparing various solutions until an optimum result is archived so as to get a better score than the one we had before with AB testing.
The article will use supervised learning Random Forest Classifier, Parameter Tuning and Feature Importance. The above method will improve the results so that on one hand our advertisement managers knows which customers to target thus optimize our product sales advertisement campaign and, on other hand minimising our advertisement spending’s by not targeting the non-responders.
In the precious article, we could see why A/B Testing is so important as well as how to evaluate the results and draw the conclusion. For this purpose, I used the data provided by Starbucks with the aim to find out whether the sending of advertisement promotion significantly increase the profit or not.
Even though we could detect the significant improvement on the Incremental Response Rate (IRR) metric, we could not achieve satisfied results on Net Incremental Revenue (NIR) metric. For this reason, we couldn’t roll out the advertising promotion for all clients. Thus, new approaches are required in order to improve the sells cost effectively.
2. Task Description
One of the solutions for successful and cost-effective promotion campaign is to use the collected data from A/B Testing as well as Machine Learning techniques to detect hidden patterns and so to identify the target audience. The crucial point here is to send the promotion only to the customers who are more receptive instead to all.
In our case we are dealing with classification problem: either the customer bought the product 1 or not 0 (column purchase). The columns V1-V2 are the abstract features about customer.
In addition, the major challenge is, that we are working with imbalanced dataset. From all the customer only 1% actually made the purchase.
3. Random Forest Classifier
In this project, I will use one of the powerful approaches especially in case of imbalanced dataset Random Forest Classifier
Random forest is a supervised learning algorithm which is used for both classification as well as regression and can help companies to make strategy decisions especially in E-commerce, Banking, Medicine sectors etc. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting.
Why Random Forest (RF)
- Good classification performance
- Better generalization performance and less prone to overfitting especially in comparison to Decision Trees
- Is effective approach in case of imbalanced data due to additional feature of class weighting
- The ability to work with large datasets with higher dimensionality
- Measure of feature importance
- Can handle missing values in the dataset
How does Random Forest work?
The random forest ensemble method, is made up of a large number of small decision trees, named estimators, with each of them producing their own predictions. The random forest model combines different classifiers into a one meta-classifier to insure better generalization and accuracy in comparison to individual classifier.
Steps
1. Create a bootstrapped data with replacement of size n (same sample can be drawn multiple times. For more information, please go to 5.3 Bootstrapping )
2. Build a decision tree from bootstrap data. At each node:
a) Randomly select subset of features d (not all)
b) Split the node using the feature that did the best job for the separating the samples (e.g. maximizing the information gain)
3. Repeat steps 1–2 k times (this variety of many tries empower the random forest compared to individual decision trees)
4. Aggregate the predictions of all trees by voting (“Bagging”).
Bagging or bootstrap aggregating means that we do final prediction by majority voting — selecting the class label that was predicted by the majority of the trees. However, instead of using the same training data, we use bootstrap samples.
4. Random Forest Parameter Tuning
The scikit-learn implementation of RandomForestClassifier offers us many parameters which we can tune in order to improve the classification performance. For more information, please go to official documentation of scikit-learn.
In this project I will concentrated on the major ones. Due to the fact, that Random Forest is an ensemble method we do not need to prune the trees thanks to robustness to the noise (the final decision is made from the individual trees).
1. n_estimators: the number of trees in the forest.
The most important parameter which we should take into optimization is the number of trees. On the one hand, in general, the large number of trees leads to the better performance. On the other hand, this increases the computational cost
2. class_weight {“balanced”, “balanced_subsample”, None}. This is the most important parameter in our project to work with imbalance data! Because this issue is a huge disadvantage for machine learning algorithms being biased towards the majority class. Our goal is to detect more willingly customer to make a purchase instead of labelling all customers as not and so achieve good accuracy.
The scikit-learn library offers us three options:
a) None: assign to all classes the weight one (binary classification).
b) balanced: automatically adjustment of the weights “inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
c) “balanced_subsample” : is almost the same as “balanced” except the weights calculation is based on the “bootstrap sample for every tree grown”
To sum up, weights are used to identify the importance of the class and are crucial to calculate the loss function. During the training process at each point, the error will be multiplied by the weight. So that’s why, the estimator will try to minimise the error on most important classes (with heavier weights). Weighting the minor class higher will lead to the fact, that this class will be treated with more importance.
3. max_depth: int, default=None. This parameter controls the maximum depth of the tree. In case of default parameter “then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples”.
This parameter is worth to optimize because large unpruned trees can course:
a) overfitting
b) huge memory consumption
4. max_samples: the size of sampled data (if bootstrap=True (default)). It is less common parameter, however helpful in case of overfitting. Decreasing the size of the bootstrap samples leads to increase of randomness — diversity of the individual trees. However, it can cause poor performance overall. Nevertheless, the increase of the size can increase the degree of overfitting (individual trees are more similar to each other). That’s why the finding of the optimal size can improve the performance of the classifier. Default is the size of training data.
5. Random Forest Feature Importance
Feature Importance is a great method that I really like and use to evaluate the results. In addition, the good news is that scikit-learn even provide us already implemented toolbox for it. Feature importance prevents us to treat the algorithm as black box and help us to explore the relative importance of each feature. Why do we need it?
- feature importance can underline which feature we can drop from our model and so
a) decrease computational costs
b) deal with overfitting. Unfortunately, more features in machine learning doesn’t mean automatically better performance - to get better understanding/ insights of the customers. What does have the most impact on the behaviour e.g. to buy or not
Warning! Be very careful with feature importance, you can gather useful information looking at it. However, it isn’t absolute truth! In case, that one feature has higher importance score than another can mean that this feature is more important. But you can’t be 100% sure!
If some features are statistically depended/ correlated, they will get lower score in comparison to an equally important uncorrelated feature.
5. 1. Random Forest Feature Importance: feature_importances_
Is “computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.” This score is calculated automatically and is scaled so that the results can be sum up to one.
Thereby, this feature is available out-of-box it is a good starting point to explore the results. However, this impurity-based feature importance computes statistics using training dataset.
This can lead to several disadvantages such as:
- biased feature importance towards features with high cardinality (too many unique values)
- the calculated feature importance has difficulties to generalise so do not reflect the real situation on test set.
5.2 Permutation Importance permutation_importance
The alternative to the impurity-based feature importance is permutation importance. The advantage is, that we can compute the feature importance on the training set as well as on the test set.
The permutation importance is calculated using estimators (“an object which manages the estimation and decoding of a model“) for the given dataset.
Overall, the calculate of feature importance of the test set can show the generalization power of the model and highlight that some features can be less important on the test set rather than on the training and so make it clear, that the model suffers from overfitting.
6. Results
The table below shows clearly, that the optimization and applying Machine Learning techniques is worth doing. In comparison to “blind” randomly sending promotion to the customers (A/B Testing) we can see the increase on both metrics. To be more precise, the results on IRR metric almost doubled in comparison to Benchmark (1.98 times raise). Moreover, Random Forest could achieve even 2.1 times increase.
Very appreciable and noticeable are the results achieved on the NIR metric. Compared to A/B Testing with triple digit negative NIR values, the Benchmark could perform positive outcome. At the same time, the Random Forest approach could beat the Benchmark by increase from 189.45 to 430 what means 2.3 times more.
7. Conclusion
In this part of the Random Forest project, I have showed the positive impact of Machine Learning on the advertisement / promotion strategy and explained how to apply our last outcomes from A/B testing and now Machine Learning techniques to optimize our advertising strategies so as to increase sales and marketing resources efficiently by identifying and targeting the right customers — with more profitable returns(sales) than wasting valuable resources in advertisement to wrong markets/ customers bringing no additional value.
In addition, I have proved, that finding hidden patterns and using Random Forest Classifier is worth. Since sending the ads to the customer who are more receptive is more cost effective (ROI) rather that to all.
Recommendation to the company: Random Forest (RF) is a very resourceful approach for making accurate predictions needed in strategic decision making in organizations. For the above reasons, I would recommend companies seeking improvement of their products and services to use (Random Forest). The fact that (RF) is a continuous process that requires innovative ideas, entrepreneurial spirit, and courage should not deter companies from using it, especially in cases after A/B testing as RF surely will help your companies to minimise risks and maximise the outcomes by using Data Driven Decisions for your Marketing strategy plans. In General the it’s advisable to make Data Driven Decisions and the following methods recommended by me can help your company to increase Return on Investment (ROI):
1. Use A/B Testing
2. Investigate/ Explore your data and results
3. Apply Machine Learning to optimize the outcomes
For more information, please check out my code on GitHub
Coming soon:
I) Sentiment Analysis for Amazon Consumer Reviews with LLM
II) A Deep Dive into XGBoost