• Berthold

Customer Segmentation from Customer Behavior: A Quantitative Approach

Whether a new marketing strategy has to be devised or product pricing needs a revision, their positive response largely benefits from detailed knowledge on the customer base. Here, a customer segmentation analysis can help you to learn how your customers can be separated into different groups each of which have different needs and shopping behaviors.

In general, there exist two main strategies for customer segmentation. A priori segmentation divides customers into groups by apparent customer properties, gender or age for instance. This approach is inexpensive but may fall short in capturing the whole picture and providing margins beyond manager intuition.

Post hoc segmentation instead is a data driven approach where input variables such as customer behavior are fed to algorithms that derive valuable insights from customer data. While different post hoc techniques exist, all share the ability to reveal insight from your customer data, which likely remained hidden when combing through your data by hand.

Post hoc segmentation is of course, what we enjoy doing here at DataQuotient. In this blog post, our goal is to provide you with an introduction to customer segmentation applying clustering techniques using python.

Customer Segmentation from Customer Behavior: A Quantitative Approach

In this tutorial, we focus on how customers can be separated into different groups based on their shopping behavior. Doing so we demonstrate ways to derive customer behavior from standard customer data you may have stored already in your data lake. The data set we use for this demonstration is the publicly available Instacart Market data, which can be downloaded as .csv files from Kaggle here. To segment the customers based on their shopping behavior, we will employ the K Modes clustering algorithm.

The Data Set

But let's get started by first setting up the appropriate environment we need. The K-Modes library we use for clustering is not a standard package but can be loaded from here.

For this example, we will focus on the main data set orders that contains time stamp and some user information for each order, which will allow us to distill out information on customer behavior.

The data set contains information on more than 3 million orders from more than 200 thousand unique customers - wow! To keep the computational costs within in limits, we here subset the entire data set and only work with orders from 10000 randomly chosen unique customers.

For simplicity, we also drop all data points, which contain missing values. This corresponds to 5 % of the total data. In a realistic use case scenario, one likely aims to fill up these gaps via different approaches, but let us keep it simple for now.

In the second step, we merge the orders data set with an auxiliary data set using the order_id as the identifier for an inner join. The second data is telling us how many products each individual order contained. Information on how many products each individual customer purchases per order probably renders useful to learn about his shopping behavior.

This merged data set contains ten columns and about 1.4 million rows. Regarding the columns, we will, in the following, focus on:

  • add_to_car_oder: Its maximum number allows us to fetch the number of products in the order.

  • order_dow: day of the week when an order has been placed encoded in integer values.

  • order_hour_of_day: hour of the day when an order has been placed.

  • days_since_prior_order: days between present order and previous order.

while most of the others are identifiers or of miscellaneous content. Each row represents one product related one order id that has been ordered by one customer. Each order typically contains multiple products and, in addition many different orders belong to one individual user identified by an user id.

Data Transformation Part 1

In the next step, we will therefore work on data transformation and condense all information related to one user found in many rows into one single data point represented by one individual row for each user.

Concretely, we will loop through all orders of each user -- by the way, this is a step, which has great potential to be parallelized when working on massive data sets -- and determine the average number of products ordered by each user. In addition, we take a majority vote to determine the day, hour of the day and frequency at which each user typically places an order and store all results together with the user id in a new data frame df1. Beside, this example demonstrates a relatively easy and quick transformation step, but one can envision how to capture even more fine grained detail of customer behavior in this step.

Looking at the newly transformed data set below, it is great to see how we could reduce the table size significantly from (1424924 x 10) to (10000 x 5) but still retain most of the contained information.

Data Transformation Part 2

Before we move on and apply a clustering algorithm to these data, we should take a quick brake and have a second look on the actual data type of our transformed data frame. All four data columns take integer values, but only ordsum and ordfreq are of countable quantity that is the average number of products per shopping cart and typical temporal distance in days between two orders for each user, respectively. Instead, ordday and ordh are so-called categorical variables, where each integer number in a column represents a category, such as 0 for Saturday or 1 for Monday.

This may sound like a technicality, but the mixed type of variables in our data set does require our attention before we can move on and do all the magic. Without going into too much detail, it is worthwhile noting that many clustering algorithms work on calculating the distance between two data points. Using an Euclidian metric, for instance, one can infer that the distance between two items and one item in the shopping cart is smaller compared to the distance between three items and one item. On the other hand, categorical variables cannot be projected onto an axis and therefore one cannot infer that Friday is larger than Monday or vice versa, for instance.

The solution to this problem of mixed types of variables is either to convert all numerical variables into categorical variables or to project the categorical variables into a Euclidian space using multiple correspondence analysis. Certainly, both approaches have their strengths and weaknesses. We'll here go with the former approach and convert the two numerical variables ordsum and ordfreq into categorical variables.

In addition, we also want to keep the total number of categories at a reasonable number. First, a vast number of categories gives the cluster algorithms a hard time to clearly distinguish between different data clusters, second, a customer who typically has one item in her cart may not display different shopping characteristics from a person having two items in the cart but likely can be distinguished from someone with ten items in her cart. Hence, we want to bin each of the four variables that describe the shopping behavior of the customer and thereby reduce the total number of categories To perform this task most efficiently, we'll inspect the data distribution within each category:

As can be seen, the order frequency displays a seven day periodicity, indicating that many customers place weekly orders. Hence, we'll use a week as a bin for that variable telling us whether a customer orders on a weekly, two-weekly etc. basis yielding a total of 5 bins. Second, regarding the weekday binning, we see that a similar amount of orders is placed on Sundays and Saturdays, 0 and 1, whereas orders on weekdays are smaller by almost a constant offset. Here, we choose a binary binning whether an order has been placed on a weekend or not with two bins in total. In view of hour of the day, a clear pattern does not emerge from the histogram above other than that orders are low at night and peak between 9am and 6pm. What we can do instead as one viable approach among others is cut the day into four 6 hour bins starting from midnight to 6am, from 6am to 12pm and so on. As to the number of products per order, a clear pattern is missing as well. Deliberately, we choose to to separate scenarios where customers may just buy one individual product, from decent basket sizes of up to 10 items and large baskets with more than 10 items, yielding three bins.

Clustering with K-Modes

Finally, we are in good shape to move on, do the magic and find segments among the 10000 customers based on their shopping behavior. To perform the clustering of our data, we here use the so-called K-Modes algorithm. This algorithm is the counterpart to the well-known K-Means algorithm for numerical data and assigns each data point to the cluster of smallest distance, whereas the total number of clusters is pre-defined. It should be noted that the number of different algorithms for clustering categorical data is much smaller compared to the number of available algorithms for clustering numerical data -- a nice overview and comparison can be found here. Hence, moving all data into an Euclidian space by using multiple correspondence analysis may render helpful if using K-Modes does not suffice.

Nevertheless, as we will see in the following, K-Modes does indeed deliver a great job in clustering categorical data. To determine the optimum number of clusters one can employ the so-called elbow method. Briefly, one calculates the loss, which is a measure of how well the clustering assignment works, for a different number of clusters. Once, the loss curve starts flattening out starting from the elbow, the optimum number of clusters has been reached. In this way we determine the optimum number of clusters to be about 10 for our example as shown below.

So far so good, let's run the clustering with a total number of clusters on our data and see how we can separate the customers on their shopping behavior. The shown model output below denotes each cluster center by a row vector, where the first column denotes the time of the day bin, the second column the weekend bin, the third column the cart size and the last one the order frequency. As one can see immediately, three out of 10 clusters relate to weekend orders, but let us take a closer look.

Looking at the shopping behavior of customers assigned to cluster '5', we see that all these people like to place large orders on weekends mostly in the morning and afternoon hours with less of a clear pattern in terms of the order frequency. Customers belonging to this segment seem to place their weekly grocery orders on weekends.

Inspecting customers assigned to label 9, we can see a different pattern evolving. Here, we look at people who place their small orders of a few items only in the morning of weekdays and on a weekly bases in the majority of cases.

For the sake of brevity, we will stop here but ensure you that similar clear cut pattern are found for the other eight clusters not shown.


With this tutorial we aimed to provide a brief introduction into customer segmentation clustering techniques. More specifically, we were interested to see how we can find customer segments based on the shopping behavior. In this context, we also showed an example of how one can derive customer behavior variables from regular customer data and reduce the total data size while preserving most contained information. Lastly, we employed the K-Modes algorithm to find customer segments based on the derived customer behavior variables.

Even the simplistic approach shown here is able to find ten well separated customer segments. In comparison, most manual approaches typically yield only a handful of segments. Insights found from this analysis, such as the customers who order their weekly groceries on weekend mornings and afternoons, will already help in optimizing marketing campaigns. One can easily envision how each customer segment can be individually targeted with well-tailored campaigns.

A next possible step were to use the customer segments as data labels. In this way, one turns the previously unlabelled data set at hand into a labelled training data set for a predictive classification algorithm. Once the classifier is trained, the correct customer segment of new customers is easily predicted.

We hope you enjoyed reading our introduction into quantitative customer segmentation. Comments and constructive criticism are well appreciated.

© 2017-2018 by DataQuotient GmbH. All rights reserved.