Link to Github repo
Introduction
We apply in this project various unsupervised machine learning clustering algorithms covering k-Means, HDBSCAN and agglomerative clustering to the Kaggle credit card transaction dataset in order to segment our credit card customers based on distinct transaction behavior patterns. We will use silhouette scores to determine which of our algorithms produces optimally separated clusters which in turn will serve as the basis for administering differentiated marketing treatments to each customer group.
Dataset
The Kaggle credit card transaction dataset encompasses in its raw form 8,950 distinct anonymized customer observations spanning 18 independent features such as customer’s tenure, credit limit, total purchase amounts, total payment amounts, etc aggregated over a six month period. The inclusion of variables capturing frequencies and spend amounts for different transactions types such as installment purchases, large one-off purchases, cash advances, etc over the course of this six month period allows for painting a rather colorful picture of different behavioral patterns which can be used to segment customers into different groups.
A complete data dictionary of these different variables can be found below in Figure 1 in addition to a truncated sample of this dataframe head as shown in Figure 2:
Data Preprocessing
i) Missing value imputation
The first step in this data science process is to clean our data and impute any missing values. Somewhat fortunately we find only two columns in our dataset with missing values, namely MINIMUM_PAYMENTS and CREDIT_LIMIT with 313 (3.5%) and one missing values respectively. Regarding MINIMUM_PAYMENTS, as we notice the dataset does not include any non-missing values less than or equal to zero for this variable and that PRC_FULL_PAYMENT , or the percent of full payment paid by the user, is furthermore zero for all these missing value observations, we proceed with zero imputation for these values.
Similarly given length of credit history and payment history are important factors impacting credit limits per several industry sources such as Investopedia, and noticing that the average credit limit of customers with <10y tenure hovers around <$3K versus >$4K for >=10y tenure customers in our dataset as shown in Figure 3 below, we choose to impute this missing value of a customer with 6y tenure to $3,000 exactly.
ii) Outlier Treatment
We find several outlier values for CASH_ADVANCE_FREQUENCY greater than one that should be impossible given that this variable by definition varies between 0 and 1. We therefore elect to winsorize these outlier values to 1.
Additionally while we further observe that values of certain variables such as PURCHASES_TRX and CREDIT_LIMIT meet the strict outlier definition of [> 75th percentile + 1.5 * interquartile range], as it should still be plausible for certain customers to have posted >300 credit card purchases in the last six months (with a dataset maximum of 358 PURCHASES_TRX for reference) or to have >$25K credit limits (with a dataset maximum of $30K CREDIT_LIMIT for reference), we choose to keep these values in our dataset for now.
Exploratory Data Analysis
In order to gain intuition about some of the linear relationships between variables included in our dataset we visualize the correlation values with regards to PURCHASES and CREDIT_LIMIT given we would expect these two variables capturing customers’ credit card purchase and payment histories to play significant factors in segmenting our various customer types.
While unsurprisingly PURCHASES shows highly statistically significant and positive relationships with our different purchase transaction subcategory variables of PURCHASE_TRX, INSTALLMENTS_PURCHASES and ONEOFF_PURCHASES_FREQUENCY, we interestingly further observe negative correlations between PURCHASES and cash advance variables, potentially suggesting that both these variables could be taken as proxies for financial health given customers exhibiting greater spending behavior appear conversely at a lower risk of needing to take out high interest credit card cash advances in order to for instance meet emergency cash shortfalls in their daily lives.
Visualizing correlations with our CREDIT_LIMIT variable we interestingly observe a statistically significant (p-value < 0.01) positive correlation with CASH_ADVANCE_FREQUENCY, suggesting that higher credit limits may not provide as strong an indicator of financial stability as we would have first hypothesized given these customers also display a higher likelihood of taking out high interest cash advances. We furthermore somewhat unsurprisingly observe a highly statistically significant and positive relationship between CREDIT_LIMIT and TENURE suggesting that older customers with longer payment histories are rewarded with higher credit limits.
In order to further inform our feature engineering we use boxplots to investigate the relationship between TENURE and some of our higher variance variables such as CREDIT_LIMIT and CASH_ADVANCE. We observe that the median and interquartile ranges of CREDIT_LIMIT rather expectedly increase with increasing tenure, in line with the intuition that customers should reach improved financial stability with increasing age and that credit card companies should be able to decide with increasing skill over time which customers to provide with credit increases as more information on customer spending behavior is gathered, such that this absolute spread of credit limits should increase with time.
Similarly we observe that the median values of cash advances taken as loans are highest among the 7-9yr TENURE groups and that average cash advance frequencies are highest among the 6-9 TENURE groups, seemingly further confirming our previous intuition that CASH_ADVANCE could be taken as a indicator of unexpected short term financial stress, given we would expect younger customer groups all else equal to display lesser financial stability than their older counterparts. The observation that the >=12y TENURE group exhibits a significantly higher number of outliers is in line with the observation this tenure group accounts for approximately 85% of our dataset observations.
To better visualize our previous positive correlation values between CREDIT_LIMIT and PURCHASES and PAYMENTS respectively we can visualize scatter plots including linear regression lines of these variables. Our first visualization below interestingly appears to show that PURCHASES may not be as strongly impacted by CREDIT_LIMIT as we would have hypothesized, suggesting that simple increases in a customer’s credit limit may not translate to sustained increases in purchasing behavior over time:
Similarly we can observe a moderately positive linear relationship between CREDIT_LIMIT and PAYMENTS seemingly providing additional support for the above insight given that a customer’s payments to a credit card issuers should necessarily be related to previous purchases done with that card:
Lastly to build better intuition around some of our purchase subcategory variables we bin our ONEOFF_PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY variables and plot these versus ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES respectively, interestingly showing that frequency of one-off and installment purchases increase with increasing one-off and installments purchase amounts. This insight may potentially indicate that certain customer sub-groups may more exclusively use their credit cards for certain purchase subcategories such as one-off or installments purchases, which may provide additional dimensions on which to separate our customer groups later on.
Feature Engineering
As our EDA showed that one-off and installment purchase behaviors may provide informative dimensions on which to separate our customer groups we create ONEOFF_PURCHASES/PURCHASES and INSTALLMENTS_PURCHASES/PURCHASES variables capturing the percent of all purchases accounted for by one-off and installment purchases respectively. Leveraging our EDA insights we further create variables capturing the average value of Purchase and Cash Advance transactions, as well as the percentage of all payments accounted for by minimum payments for each customer following the below code:
df['ONEOFF_PURCHASES/PURCHASES'] = df['ONEOFF_PURCHASES'] / df['PURCHASES']
df['INSTALLMENTS_PURCHASES/PURCHASES'] = df['INSTALLMENTS_PURCHASES'] / df['PURCHASES']
df['MINIMUM_PAYMENTS/PAYMENTS'] = df['MINIMUM_PAYMENTS'] / df['PAYMENTS']
df['Average_Purchase_Amount'] = df['PURCHASES'] / df['PURCHASES_TRX']
df['Average_Cash_Advance_Amount'] = df['CASH_ADVANCE'] / df['CASH_ADVANCE_TRX']
Clustering
i) KMeans
From a high level, our KMeans algorithm follows the below steps in order to find optimal cluster groups:
- Select k as the number of groups to cluster for
- Randomly pick k points in our data as centroid points
- Assign each non-centroid point to its closest centroid
- Recalculate centroid points by taking the average of all surrounding points assigned to that cluster
- Repeat steps 3-4 until calculated centroid points do not move anymore. Done
As by definition in any unsupervised ML task there exist no ‘ground truth’ labels telling us the number of customer segments to aim for per step (1), we will produce an elbow plot of KMeans cluster numbers vs inertia, or the sum of samples’ squared Euclidean distances to their assigned cluster center, showing that n = 4 clusters looks to optimize this tradeoff between number of clusters and cluster separability:
Given our feature engineered variables capturing the percent of total purchases attributable to one-off and installment purchases should provide better insights into customer purchase behavior versus our untransformed ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES variables we will choose to exclude these latter variables from our clustering.
We further apply standard scaling (z = (x – mean) / std) to our dataset given that not doing so would cause our KMeans algorithm to place undue emphasis on larger scale variables (i.e. PURCHASES or PAYMENTS varying from 0 to 50,000) vs lower scale variables (i.e. frequency features varying from 0 to 1) in optimizing spatial positioning of cluster centroids. Fitting our KMeans model with n=4 centroids on this scaled dataset produces the below clusters which interestingly appear to confirm our previous EDA-based hypotheses that cash advance and purchase subcategory variables would provide informative dimensions upon which to segment our customers given these figure prominently in differentiating these various groups:
- Blue (12% of customers): Lower tenured, higher credit limit customers using their credit cards disproportionately for cash advance transactions
- Red (33% of customers): Medium tenured, lower credit limit customers primarily using their credit cards disproportionately for installment purchases
- Green (38% of customers): Lower tenured, lower credit limit customers displaying relatively more muted credit card spending activity
- Purple (17% of customers): Older tenured, higher credit limit customers likely using their credit cards on a daily basis across all types of purchases and for relatively more one-off purchases
We can finally develop customer persona types leveraging the inverse transformed average values of these variables across our groups, producing the below persona categories:
- Blue (12% of customers): Cash advance customers with high credit limits
- Red (33% of customers): Installment purchase customers with low credit limits
- Green (38% of customers): Lower spend activity customers with low credit limits
- Purple (17% of customers): All-purchase type, daily credit card users with high credit limits
For the sake of experimental completeness we can fit KMeans with n=3 and n=5 centroids, which while producing somewhat more dispersed results interestingly shows similar differentiating patterns emerging across groups as our n=4 clustering, with variables related to cash advances, one-off purchases and installment purchases similarly working as important differentiating variables:
ii) HDBSCAN
In order to potentially improve upon our KMeans results we can fit HDBSCAN on our dataset. While we won’t delve into the math underlying HDBSCAN and how it improves over its predecessor DBSCAN given other existing tutorials provide detailed overviews regarding this, it should be understood that HDBSCAN allows for uncovering point clusters of varying densities by considering a range of density metrics and selecting the result that finds the best clustering stability over these density values. As such this varying density-based approach can result in improved clustering outcomes in datasets including high numbers of noise observations and displaying varying point densities in different regions as demonstrated in the following tutorial.
Despite these theoretical improvements over KMeans’ more linear clustering approach, our HDBSCAN algorithm fails to produce improved clustering following performing an exhaustive search over a range of values for HDBSCAN’s two main min_cluster_sizes and min_sample_size hyperparameters as our optimal combination of min_cluster_sizes = 150 and min_sample_size = 5 fails to assign 62% of our observations to any cluster and therefore classifies these as noise while our two detected clusters account for 31% and 7% of our observations respectively. While we will be taking our KMeans results into greater consideration over these HDBSCAN clusters we can visualize a dendogram displaying how the sizes of our two detected clusters vary with our density-based density metric:
import hdbscan
from utils2 import tdec
import pandas as pd
import numpy as np
import time
@tdec
def HDBSCAN_hyperparameter_search(df, min_cluster_sizes, min_sample_sizes, scaler, cols):
dfs =[]
for min_cluster_size in min_cluster_sizes:
for min_samples in min_sample_sizes:
start_time = time.time()
clusterer = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size, min_samples = min_samples)
cluster_labels = clusterer.fit_predict(df)
clusters_unique = np.unique(cluster_labels)
print(clusters_unique)
clusters = pd.DataFrame(cluster_labels, columns = ['label'])
dfg, data_inv = get_hdbscan_cluster_stats_unscaled(df, scaler, cols = cols, clusters = clusters)
dfs.append((dfg, data_inv, clusters, min_cluster_size, min_samples))
print('{} sec to complete HDBSCAN with {} min_cluster_size and {} min_samples'.format(np.round(time.time() - start_time, 0), min_cluster_size, min_samples))
return dfs
def get_hdbscan_cluster_stats_unscaled(data, scaler, cols, clusters):
data_inv = inv_transform2(data = data, scaler = scaler)
data_inv = pd.DataFrame(data_inv, columns = cols)
data_inv['labels'] = clusters['label']
cols = [col for col in data_inv.columns]
groupby_dict = {}
for idx, col in enumerate(cols):
if idx == 0:
groupby_dict[col] = ['mean', 'count']
continue
groupby_dict[col] = ['mean']
df = data_inv.groupby('labels').agg(groupby_dict).reset_index()
df_cols = ['_'.join(col) for col in [c for c in df.columns]]
df.columns = df_cols
df['perc'] = df.iloc[:,2] / df.iloc[:,2].sum()
return df, data_inv
iii) Hierarchical Agglomerative Clustering
We can finally apply hierarchical agglomerative clustering to our dataset leveraging sklearn’s implementation of the same following the below code:
from sklearn.cluster import AgglomerativeClustering
cluster4 = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster4.fit_predict(df_scaled)
Visualizing the distributions of our cluster groups versus some of our most important variables such as PAYMENTS and CASH_ADVANCE we can observe our algorithm is picking up on similar group differences as our previous KMeans algorithm:
We can further run our clustering algorithm with n=3 clusters once more showing similar clustering patterns as our KMeans clustering with n=3 centroids:
We can finally visualize these Agglomerative Clustering results with n=3 groups in 3-dimensional space, broadly showing
- Blue: Low-tenure and low-purchase customers
- Brown: Low-tenure and higher-purchase customers
- Light blue: Higher-tenure and higher-purchase customers
iv) Silhouette Scores
We can finally user silhouette scores to better quantify how well our various clustering algorithms are able to separate our data. The equation describing silhouette scores can be seen below where b represents the mean distance of each sample to the nearest cluster it is not a part of while a represents the mean distance of each sample to all other samples included in that cluster:
As silhouette scores range between -1 and 1 where 1 indicates clusters are well distanced and therefore separated, 0 at a somewhat closer distance with greater blending and -1 highly dispersed and blended, we can furthermore summarize these possible silhouette scores values under the following scenarios of different cobinations between a and b:
- b is large (meaning cluster centroids are far apart; good) and a is small (meaning each cluster is tightly grouped around centroid; good) –> numerator reduces to b while denominator becomes b == 1 (good)
- b is large (meaning cluster centroids are far apart; good) and a is large (meaning each cluster is dispersed around centroid; bad) –> numerator reduces to 0 while denominator becomes a or b == 0 (indifferent)
- b is small (meaning cluster centroids are close together; bad) and a is large (meaning each cluster is dispersed around centroid; bad) –> numerator reduces to -a while denominator becomes a == -1 (bad)
Calculating silhouette scores for each of our algorithms leveraging sklearn’s implementation of silhouette scores lends support to our previous cluster interpretations that our KMeans and Agglomerative clustering algorithms are able to achieve moderately good separability in segmenting our customer base with silhouette scores in the 0.16-0.20 range while HDBSCAN produces relatively poor separability with a -0.07 silhouette score. Interestingly we can further see that n=5 clusters appears to provide improved separability versus n=4 in the case of both KMeans and Agglomerative Clustering, suggesting that incorporating silhouette scores as an additional data point in the analytical process of determining optimal number of clusters may be beneficial for future projects instead of solely relying on inertia-based elbow plots as in our original methodology.
from sklearn.metrics import silhouette_score
labels4 = cluster4.fit_predict(df_scaled)
print(f'Silhouette Score: {silhouette_score(X, labels4)}')
[Out]: Silhouette Score: 0.16082164138890545
Model | Silhouette Score |
KMeans (n=3) | 0.181 |
KMeans (n=4) | 0.183 |
KMeans (n=5) | 0.195 |
HDBSCAN | -0.065 |
Agglomerative Clustering (n=3) | 0.158 |
Agglomerative Clustering (n=4) | 0.161 |
Agglomerative Clustering (n=5) | 0.167 |
Conclusion
We applied in this project different unsupervised clustering algorithms across KMeans, HDBSCAN and Hierarchical Agglomerative Clustering to the Kaggle credit card transaction dataset in order to segment our customer base on the basis of different behavioral spending patterns uncovered through this clustering. Additionally applying silhouette scoring to quantify the degree of separability of cluster groups produced by our various models showed a relative outperformance on the part of KMeans vs both HDBSCAN and Agglomerative Clustering. Next steps in the project would be to investigate additional clustering algorithms such as mean-shift clustering and Gaussian Mixture Models (“GMM”).
Thanks for reading!