Mail-order company customer analysis

Project Overview:

  1. Demographics data for a sample of ~900k individuals from the general population of Germany
  2. Demographics data for ~190k customers of a mail-order company
  3. Demographics data for ~86k individuals subject to a marketing campaign.

Problem Statement:

  • Product a customer-segmentation report, outlining how the customer population (data source 2) differs from the general population of Germany (data source 1). This must be done using unsupervised learning techniques.
  • Use supervised learning techniques to predict which customers are likely to respond positively to a marketing campaign run by the mail-order company (data source 3).
  1. Data cleaning and preparation
  2. Dimensionality reduction
  3. Clustering
  1. Data cleaning and preparation
  2. Customer segmentation model
  3. Initial Approach
  4. Re-sampling Approach

Metrics:

Customer Segmentation Report

Data Exploration

1. Data Cleaning and Preparation

Histogram showing the percentage of nulls for columns in the general population dataset.
  1. Some columns were updated or simplified based on the values of other columns to avoid nulls but retain information. For example, ALTER_KIND<X> was binned into new categorical column denoting whether:
    0 — no children
    1 — child(ren) under 5
    2 — child(ren) under 5–10
    3 — child(ren) under 10–15
    4 — child(ren) under 15–20
    5 — child(ren) over 20
    rather than having separate columns for the age of each child.
    In addition, many of the D19__ONLINE_QUOTE columns were updated based on the values in the _ONLINE_DATUM columns.
  2. Many columns that had nulls also had an ‘unknown’ category, so null values were filled with this.
  3. Any columns that at this point were more than 50% null were dropped (this affected one column in both the general population and customer population dataset — EXTSEL992)
  4. Any rows that were more than 50% null were dropped (this affected no rows in either the general population or customer population dataset)
  5. Otherwise, nulls were filled with the mode, as this is more suitable for categorical data, and most of the features are categorical
  6. The ‘Unknown’ categories were amended so that only one category for each column represents unknown (-1)
  7. Mixed type columns were fixed by casting as strings
  8. Categorical columns were converted to categorical type
  • 5% of the general population data was taken (44561 rows)
  • 15% of the customer population data was taken (28748 rows)

Implementation:

2. Dimensionality Reduction

Principle component loadings for MCA fitted to general population dataset. Question marks denote where the feature was not present in the data dictionary and its meaning was inferred from similar features in the data dictionary
Principle component loadings for MCA fitted to general population dataset. Question marks denote where the feature was not present in the data dictionary.

3. Clustering

Using the elbow method, three clusters were found to explain a good amount of the spread of data, but this did not pick up data points in the PC1 0.5–1 range. Increasing the number of cluster centres resulted in further splitting of the leftmost cluster before picking up the middle one.

Refinement:

Table showing the percentage of each population (general and customer, with customer labels changed to match those found in the general population) that sits in each cluster, as well as the difference in these values, given as a percentage of the general population percentage
Representation of the loadings for each MCA component shown for the general population dataset clusters
Possible split of cluster 0 in the general population MCA plot fitted with DBSCAN clusters
Table showing the percentage of each population (general and customer) that sits in each cluster, as well as the difference in these values, given as a percentage of the general population percentage
3-dimensional plot showing how the PCA component 1 aligns with the MCA clusters

Marketing Campaign Model

1. Data Cleaning and Preparation

2. Customer Segmentation Model

3. Initial Approach

  • Data preparation: (1) one-hot encoding of string-containing categorical columns only, (2) one-hot encoding of all categorical columns, or (3) one-hot encoding of all categorical columns followed by ensuring all columns in test match those in train. Note that although one-hot-encoding all categorical variables should be the best approach (since the values for each category are not related in magnitude), it does result in a sparse dataframe.
  • Scaling approach: (1) MinMaxScaler set to between 0 and 1 or (2) StandardScaler fitted to train and applied to test set.

Model Evaluation and Validation:

Performance of baseline models of various types to assess suitability of model type. Due to the application of the model (finding most promising leads), the accuracy is not a good metric, and attention should be paid more to the recall and ROC AUC
A gradient boosting classifier, optimised through GridSearchCV, does not improve the ROC AUC, and the false negative rate is high.
  • ‘learning_rate’: 0.01,
  • ‘max_depth’: 10,
  • ‘max_features’: None,
  • ‘n_estimators’: 100

4. Re-sampling Approach

  • Data preparation: as before.
  • Scaling approach: as before.
  • Resampling: (1) no resampling done at all, (2) undersampling of the negative class followed by oversampling of the positive class to result in a 50% split of classes using three different ratios (undersampled to 10%, 20% or 30% before oversampling to 50%) and two different over-sampling techniques (RandomOverSampler and SMOTE). Note that in future SMOTE-NC should be trialled as this is suited to categorical data.
A logistic regression model, optimised through GridSearchCV, does not improve the ROC AUC, and a large proportion of positives still have a probability of 0, so the false negative rate for this model is higher than would be wanted.

Justification:

Conclusions

Reflection:

Improvement:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store