Mail-order company customer analysis

Clare
17 min readOct 18, 2021

As part of the Udacity Data Science Nanodegree, for the final capstone project I have chosen to work with data from a mail-order company.

Project Overview:

The data available is as follows:

  1. Demographics data for a sample of ~900k individuals from the general population of Germany
  2. Demographics data for ~190k customers of a mail-order company
  3. Demographics data for ~86k individuals subject to a marketing campaign.

The demographics data includes 366 features, varying from information at an individual level to grouped-level data about their household, building and neighbourhood, and includes mostly categorical with some continuous variables. The marketing campaign data is equally split into a test and train set, where the labels of whether the marketing campaign was successful are withheld for the test set. In addition to the data, data dictionaries are also provided, giving insight into the meaning of features and their values.

Problem Statement:

The aims of the project are two-fold:

  • Product a customer-segmentation report, outlining how the customer population (data source 2) differs from the general population of Germany (data source 1). This must be done using unsupervised learning techniques.
  • Use supervised learning techniques to predict which customers are likely to respond positively to a marketing campaign run by the mail-order company (data source 3).

In order to complete this project, I have worked through the following steps:

Customer Segmentation Report:

  1. Data cleaning and preparation
  2. Dimensionality reduction
  3. Clustering

Marketing Campaign Model:

  1. Data cleaning and preparation
  2. Customer segmentation model
  3. Initial Approach
  4. Re-sampling Approach

The report below outlines the process and results for each step, including conclusions drawn. To view the analysis in more detail please check out the github repository.

Metrics:

As will be seen later, for the marketing campaign model the data was highly class imbalanced, with ~1.2% positive responses. This meant that metrics other than accuracy had to be used, since any model predicting 100% negative responses would automatically be 98.8% accurate. I chose to use ROC AUC as the main metric to determine good performance, as well as graphing the % of positive responses captured as a result of the predicted positive probability. The ROC AUC should be a good indicator of a model picking up true positives, whilst at the same time being an indication of the ability of a model, using a different threshold, to do a better job in picking up the positive responses.

Customer Segmentation Report

Splitting customers into segments based on aspects such as behaviour, product holdings, or interactions is a common tool used across data science and analytics to help companies better understand the needs of their customers, and to predict how they may respond to products or interventions. Customer segments can be used as a filter to break down campaign results to better understand them, to understand a portfolio of customers more deeply, or it can even be used as one of many inputs to a larger model that predicts future behaviour.

In this particular case, very large and thorough datasets capturing information about the general population and the customers of a mail-order company are used to try to tease out the most significant differences between the two populations, and analyse why these differences may be present.

Data Exploration

1. Data Cleaning and Preparation

The first step with all data analysis is to ensure the data is clean and can be worked with. The data provided was not clean and contained many nulls, as well as many features with ‘unknowns’ that were mapped to different values.

Histogram showing the percentage of nulls for columns in the general population dataset.

After extensive investigation of the data and the data dictionaries, a few steps were carried out:

  1. Some columns were updated or simplified based on the values of other columns to avoid nulls but retain information. For example, ALTER_KIND<X> was binned into new categorical column denoting whether:
    0 — no children
    1 — child(ren) under 5
    2 — child(ren) under 5–10
    3 — child(ren) under 10–15
    4 — child(ren) under 15–20
    5 — child(ren) over 20
    rather than having separate columns for the age of each child.
    In addition, many of the D19__ONLINE_QUOTE columns were updated based on the values in the _ONLINE_DATUM columns.
  2. Many columns that had nulls also had an ‘unknown’ category, so null values were filled with this.
  3. Any columns that at this point were more than 50% null were dropped (this affected one column in both the general population and customer population dataset — EXTSEL992)
  4. Any rows that were more than 50% null were dropped (this affected no rows in either the general population or customer population dataset)
  5. Otherwise, nulls were filled with the mode, as this is more suitable for categorical data, and most of the features are categorical
  6. The ‘Unknown’ categories were amended so that only one category for each column represents unknown (-1)
  7. Mixed type columns were fixed by casting as strings
  8. Categorical columns were converted to categorical type

After this, some columns that were either irrelevant, or the meaning completely unknown due to not being present in the data dictionary, were dropped (EINGEFUEGT_AM, MIN_GEBAEUDEJAHR), and the datasets were randomly sampled to allow analysis as follows:

  • 5% of the general population data was taken (44561 rows)
  • 15% of the customer population data was taken (28748 rows)

Implementation:

2. Dimensionality Reduction

Given the number of features remaining in the datasets, it is necessary to apply a dimension reduction technique in order to view the spread of the data and ascertain where the most variation within and between populations exists.

Different dimension reduction techniques should be used for different kinds of data. Other people have provided much better explanations of the different techniques, and when they should each be used than I could, but for the purposes of this analysis it is only necessary to be aware that PCA, a very common dimension reduction technique is best suited to continuous data, while other techniques or preparations should be used for data that is categorical, or a mix of the two. After trialling FAMD for mixed data, which should be the natural choice for this dataset, I found a problem in the implementation and was unable to continue. Therefore I instead split the data into separate parts, one of only categorical data, and one only continuous data. I then applied MCA to the categorical data, and PCA to the continuous data.

Both techniques look to find the perpendicular axes (or components) along which the most inertia (or variability) in the data is present, in descending order of inertia explained. The influence each feature has on each component is referred to as the ‘loading’, and by multiplying the value of the feature by the loading of that feature for each component and for each observation (in this case customer), component ‘scores’ for each customer can be obtained. By doing this for the first two components, a lot of the variability in a dataset can (hopefully!) be accounted for, and observations can then plotted in two-dimensional space, allowing for clustering analysis.

In order to get around problems with different values present in my general population and customer population datasets I had to one-hot-encode the categorical datasets before applying MCA, and then create dummy columns in the customer data for categories that were not present. This is a result that is interesting in and of itself — showing which values for different features are in one dataset, but not the other, and may be worth additional attention. However, one-hot-encoding so many multi-categorical columns creates a very sparse dataset and possibly means that only two principle components cannot explain such a large share of the inertia.

Fitting MCA and PCA objects to the general population dataset, then applying the same decomposition to the customers dataset results in the following distribution of observations:

Here we can see that the populations are similar overall, with small differences in the distribution of individuals. By analysing the loadings of each component we can get a feel for what features have high impacts on the spread of data within the components. Looking at the features with the most negative loadings tells us about the features for which large values would cause an observation to be towards the negative end of the axis, which looking at the largest positive loadings tells us what pushes an observation to the positive end.

For the categorial features, since the data has been one-hot-encoded, each feature is actually a feature-value combination, so that the effect of the specific category an individual falls in can be assessed. Therefore, from the table of loadings (below) we can deduce that component 1 varies from high-online, high-telephony purchasers at low values, to highly educated, WFH (work-from-home), possibly renting individuals at high values, while component 2 varies from retired, low-online affinity customers at low values, to high-online/telephony purchasers at high values.

Principle component loadings for MCA fitted to general population dataset. Question marks denote where the feature was not present in the data dictionary and its meaning was inferred from similar features in the data dictionary

Similar can be done for the PCA on continuous variables, though only 6 continuous data features in total are left after cleaning. Here we can see that component 1 goes from older customers with more cars at low values to more highly educated customers at higher values, while component 2 indicates higher density of housing and more academic title holders at high values, and more densely occupied houses and greater car ownership at low values.

Principle component loadings for MCA fitted to general population dataset. Question marks denote where the feature was not present in the data dictionary.

In order to quantify and better understand the differences between the two sets of data, these distributions are split into groups using clustering algorithms.

3. Clustering

At first, K-means was trialled for this dataset, but as can be seen below, it was unable to capture and differentiate the clusters.

Using the elbow method, three clusters were found to explain a good amount of the spread of data, but this did not pick up data points in the PC1 0.5–1 range. Increasing the number of cluster centres resulted in further splitting of the leftmost cluster before picking up the middle one.

Refinement:

DBSCAN was more successful, owing to the elongated shapes of the clusters that can be seen in the plots. DBSCAN builds clusters of points based on their distance from each other, and as such clusters of any arbitrary shape, if they are of sufficient density, can be found, unlike with K-means. However, this does mean that since the technique does not find cluster centres, these centres and the fitted K-means object cannot be applied to an unseen dataset. Instead DBSCAN must be used to find clusters in each dataset and compare them. This is not such a problem for this dataset as the two populations are quite similar.

The clusters found by the DBSCAN algorithm, using an epsilon of 0.1 for MCA, and 0.5 for PCA are shown below, with noise points in black:

Here we can see that in each case, the same clusters have been found (though the labels are not the same for the MCA clusters), but they are slightly different shapes, sizes and densities. By looking at the percentage of each population that falls into each cluster we can see where the biggest differences lie.

For the categorical features:

Table showing the percentage of each population (general and customer, with customer labels changed to match those found in the general population) that sits in each cluster, as well as the difference in these values, given as a percentage of the general population percentage

We can see that that for the MCA, clusters 1 and 2 and 3 are under-represented in the customer population, while clusters 0 and 3 are over-represented. The main difference is in clusters 2 and 3, with clusters 0 and 1 showing less of a difference.

Cross referencing this back to the loadings of each of the components, we can say that clusters 1 (orange) and 2 (yellow), with high values for component 1 and medium values for component 2 represent highly educated, possibly working-from-home individuals, with not-so-high online affinity.

Cluster 3 (green) sits between clusters 0 and 1, with around 0 for component 2, and moderately positive component 1 scores. This cluster may represent customers who are similar to clusters 1 and 2, but perhaps less highly educated or less likely to work from home.

Representation of the loadings for each MCA component shown for the general population dataset clusters

Interestingly, cluster 0 (maroon), the largest cluster in each case, with lower component 1 values and the whole range of component 2 values, represents possibly two groups of customers (see below), one with high online affinity and high online purchase rates, and one with low online affinity, high telephony purchase rates and likely to be retired. Although these groups are not differentiated by DBSCAN at this epsilon level, from visually assessing the clusters, it may be that the customer dataset has fewer individuals in the second group (retirees), compared to the overall population dataset.

Possible split of cluster 0 in the general population MCA plot fitted with DBSCAN clusters

Summarising this, it may be said that the mail-order company might be better focussing on attracting customers that fall into cluster 3 —not so highly educated, possibly working from home, but not high online purchasers (as you might expect) — rather than retired customers, or those who already purchase most of their items online.

For the continuous features:

Table showing the percentage of each population (general and customer) that sits in each cluster, as well as the difference in these values, given as a percentage of the general population percentage

We can see that for PCA, there are only two clusters, and both seem to be slightly over-represented in the customer dataset.

Again looking back to the loadings of each of the components, we can see that cluster 1 — with higher component 1 values than cluster 0 — may represent customers with fewer cars and higher education levels. This cluster also has less positive component 2 values, indicating that perhaps these customers live in housing with lower densities, but in professional roles.

Summarising this, it could be that this cluster represents individuals who are not in city centres, but are less likely to have cars to travel and shop. Higher education and professional titles may indicate a likelihood to be working from home. By combining PCA component 1 with the MCA results and plotting in three dimensions, we can see that, indeed, high PCA component 1 scores coincide with the clusters that have high values for the small office/home office flag — clusters (orange) 2 (yellow), with high MCA component 1 scores):

3-dimensional plot showing how the PCA component 1 aligns with the MCA clusters

Marketing Campaign Model

A predictive model can learn to pinpoint individuals more or less likely to respond positively to a marketing campaign, and by targeting marketing materials towards those identified, companies can use resources as efficiently as possible. Even if a predictive model is only slightly better than chance at identifying promising leads, the cost/benefit ratio of a marketing campaign can be improved, so the business incentive for this is clear.

1. Data Cleaning and Preparation

The mailout data was cleaned and prepared in exactly the same way as the previous data.

2. Customer Segmentation Model

Given the large differences observed in percentages of customers in each cluster between the general population and the customers dataset, my approach to this task was to use insights from the segmentation report to engineer an additional feature denoting which of these segments an individual would fall into. I did this using a multi-class classification model trained on the general population dataset, with the cluster labels learned from MCA appended. Given the MCA clusters aligned well with the PCA PC1, I did not include information about the PCA clusters.

Trialling two different approaches (sklearn’s GradientBoostingClassifier, and RandomForestClassifier) I found that both did a good job predicting the rarer segments on a 0.33 test set:

Given that performance was so similar, due to its speed, the Random Forest model was taken forwards. This model was then used to predict the segments (clusters) for the observations in the mailout training set, which would then be used as a feature for the final predictive model.

Looking at the predictions made by the segmentation model for the mailout training set, we can see that no customers were predicted to be in cluster 3, and only a very few in cluster 2, which were the most indicative of an individual being a customer of the company. This could suggest that the dimension reduction/clustering is not very good, that the data it was trained on was not representative, or that a positive response to the marketing campaign is not necessarily a 100% match to becoming a customer of the company. It may, in fact, be better to do something else with this customer segmentation report, other than use it as an input to the final model.In any case it is included as a trial.

3. Initial Approach

Initially the class imbalance was addressed via specifying balanced weighting (where necessary). However, a variety of classification models and data preparation approaches did not do a great job in predicting positive responses to the marketing campaign. Each baseline model was trained on different combinations of:

  • Data preparation: (1) one-hot encoding of string-containing categorical columns only, (2) one-hot encoding of all categorical columns, or (3) one-hot encoding of all categorical columns followed by ensuring all columns in test match those in train. Note that although one-hot-encoding all categorical variables should be the best approach (since the values for each category are not related in magnitude), it does result in a sparse dataframe.
  • Scaling approach: (1) MinMaxScaler set to between 0 and 1 or (2) StandardScaler fitted to train and applied to test set.

Model Evaluation and Validation:

Performance of baseline models of various types to assess suitability of model type. Due to the application of the model (finding most promising leads), the accuracy is not a good metric, and attention should be paid more to the recall and ROC AUC

Hyperparameter tuning of the most promising model using GridSearchCV did not achieve improvement, and if anything seems to have resulted in worse performance, possibly due to the test and train sets being different for the GridSearchCV than for the initial baseline model survey:

A gradient boosting classifier, optimised through GridSearchCV, does not improve the ROC AUC, and the false negative rate is high.

Here the best parameters were found to be:

  • ‘learning_rate’: 0.01,
  • ‘max_depth’: 10,
  • ‘max_features’: None,
  • ‘n_estimators’: 100

I had hoped that perhaps setting a cap on the number of features used, those with a high percentage of unknown values could have been removed and a better model found, but this does not seem to be the case.

It seems that the model that achieves best performance classifies all instances as negative, and in fact all the positive instances are assigned the lowest probabilities of being positive. This is due to the high class imbalance leading to biased models. For this reason another approach to address the class imbalance was trialled — under-sampling the negative class with imblearn’s RandomUnderSampler and (due to the very small number of positive class instances), boosting the positive class with one of imblearn’s RandomOverSampler or SMOTE, whichever performs best.

4. Re-sampling Approach

Baseline Model Comparisons

In order to compare different models with different over-sampling techniques, each baseline model was trained on different combinations of:

  • Data preparation: as before.
  • Scaling approach: as before.
  • Resampling: (1) no resampling done at all, (2) undersampling of the negative class followed by oversampling of the positive class to result in a 50% split of classes using three different ratios (undersampled to 10%, 20% or 30% before oversampling to 50%) and two different over-sampling techniques (RandomOverSampler and SMOTE). Note that in future SMOTE-NC should be trialled as this is suited to categorical data.

Models trialled were: RandomForestClassifier, GradientBoostingClassifier, LogisticRegression, LGBMClassifier, MLPClassifier, AdaBoostClassifier.

Predictions on the test set (that had not been resampled, but had otherwise been prepared in the same way as the train set) were calculated and performance metrics collected for all combinations of the above, with most attention paid to ROC AUC.

Comparing over 250 individual model runs, the best baseline performer was found to be a Logistic Regression model. The top approach uses one-hot-encoding for all categorical variables followed by a random train-test-split, undersampling of the negative class to a 10% positive class rate, then SMOTE oversampling to a 50% positive class rate, and a MinMaxScaler. Recall values, though important for this application, can be improved by adjusting the threshold for predictions.

Final Model Hyperparameter Tuning

Training and testing data was prepared according to the results of baseline model testing (one-hot-encoding for all categorical variables followed by a random train-test-split, scaling with MinMaxScaler), and in order to ensure that over/undersampling was only applied to the train set in cross-validation, imblearn’s pipeline module was used to apply this. After tuning, in terms of ROC AUC, the final model is no better than before hyperparameter tuning (since a different test set was used in the initial baseline model testing), but results are an improvement on the model attained after tuning without any re-sampling of the classes.

A logistic regression model, optimised through GridSearchCV, does not improve the ROC AUC, and a large proportion of positives still have a probability of 0, so the false negative rate for this model is higher than would be wanted.

Justification:

We can see that the approach of oversampling the positive class and undersampling the negative class did improve performance in regards to capturing more true positives at a non-zero predicted probability. This makes sense — by ensuring that the classes are more balanced a less biased model can be found. Many of the best performing models were Logistic Regression models, and most had one-hot-encoded features. This is as expected, as doing this pre-processing step ensures that the values of the categorical features are not interpreted as being related in magnitude to each other. No one re-sampling or scaling method seemed to be clearly better than others.

Conclusions

Reflection:

In terms of the solution, the model tuned using re-sampled data to balance the classes performed slightly better than the model trained on highly imbalanced data and adjusted weighting. Therefore I would suggest that the mail-order company could implement this model, with a lower prediction threshold, to (slightly) narrow down the population of customers to target with a marketing campaign, but that this would mean the majority who are likely to respond would still unfortunately be missed! I could not therefore recommend this solution to be put into practise as it is now, but I would recommend that further work be done.

For the customer segmentation report, I would be more confident in recommending the outputs for use, as clusters were well defined and stable between populations/runs. Some clear and not so surprising conclusions could be drawn from this work, and these could be a real advantage to the company in terms of anticipating customer need and behaviour.

Improvement:

The final model is still not good and if I had more time I would certainly look to improve performance.

My first priority would be feature engineering and feature selection to reduce the feature space. This would be easier if I was able to work with someone more familiar with the datasets provided, who had the subject matter expertise to guide in what features might be promising, and what all the features mean exactly. Initial trials in reducing features by dropping columns with high rates of ‘unknown’ values, and using Random Forest stumps to find the features that drive the most information gain (discarding the features with zero importance) did not result in better performance of baseline models, so more investigation is needed.

In addition I would look to incorporate SMOTE-NC, and perhaps look at more over/under-sampling strategies, trial more baseline models (KNN Classifier, XGBoost, Support Vector Classifier…etc) or FLAML, and investigate stacking models.

Due to resource constraints, the customer segmentation report has only been run using random subsamples, which are quite small and may not be representative. In future I would look to scale this up and run the procedures on larger subsets of the data.

A huge thank you to Udacity for their support and the opportunity to work on this interesting and complex project.

--

--