Selecting the Right Data for Your ML Model

Andrew Engel
Selecting the Right Data for Your ML Model

If you search for help on churn modeling with machine learning, lots of articles on building churn models are returned. These articles provide lots of valuable information about building the models and some provide information on the kinds of feature engineering that will drive better models. However, these articles rarely discuss several other aspects of this problem: what is the definition of churn? What do we do when we predict a customer is likely to churn? How do you select the observations for the training set?

The first is key and not always obvious. In a subscription model, churn is reasonably straightforward as we can identify cancellations and non-renewals. In businesses marked by individual transactions, such as retail or casinos, has a customer churned if they have not purchased in the last week, month, year?  This is a business question and needs to be decided in conversations with the business stakeholders (informed by data analysis on customer behavior). 

The second is vital, but out of the scope of this blog. Keep in mind that understanding what actions will be taken to prevent churn will influence many of the modeling decisions such as how far in advance does churn need to be predicted and do all forms of churn matter or should the model focus on just a few.

The selection of observations to include may seem straightforward. Just include all of the records and use random sampling. Let’s examine how this can cause problems and discuss several possible approaches to select these observations and their effect on the performance of the models.

To generate the churn data, we are using the data from the WSDM - KKBox’s Churn Prediction Challenge with a modification that we have defined churn for every month and will be training (including validation or testing) on the data with expiration dates between January 1, 2015 and June 30, 2016 and will use records with expiration dates from July 1, 2016 through January 31, 2017 as our external test set to check the performance of the selected model. We used Rasgo to hold the data and perform the feature engineering and DataRobot to build the models.

Random Sampling

Many data scientists will simply take this data, perform either random or stratified sampling to generate their train and test set and build their models. When we do this, we see what looks like a good model, with the lift chart

and ROC curve with an AUC of 0.9230.

This is astonishing performance. Although to be fair, I am generally concerned when my model performance is too good. It turns out this concern is well founded as there is a significant problem when we make predictions on our external test set. On the external test set, we see the following lift chart

and ROC with an AUC of 0.6228.

What went wrong?  Random and stratified sampling is the wrong choice for this problem as most customers are on a month-to-month contract. This means the data contains a record for each customer for each month they subscribed. Random and stratified sampling will spread the records for a given customer across both the training and test set in a way that causes problems. It is entirely possible that for a given customer a month in the training set may be later than a month in the test set. This allows the model to, in effect, use future data to make predictions on the past. This means the test set is not an accurate representation of the kind of data that the model will see in future months.

So what do we do?  One of the common responses, since we have lots of data, is to take a single month and use it for training. This means the training and test portion after we split it won’t suffer from the problem we just saw.

Single Month

We selected March, 2015 and used DataRobot in the same way as above. This gives the following lift chart

and ROC with an AUC of 0.8087.

This performance is much worse on validation than what we saw for the prior approach, but the real test is on the external test set. Examining that set, we see the following lift chart

and ROC with an AUC of 0.5849.

The performance of the model on the external test set is barely better than a random guess. While the model was not biased by the train test split, the selection of a single month may have caused a problem if there are seasonal or trends over time impacting the churn rate that the model is not given enough information to pick up.

To alleviate these problems while avoiding the issues from the purely stratified sampling, we can randomly pick one record for each customer. Again, we ran this through DataRobot.

Single Record per Customer

Using a single record per customer gives the following lift chart

and ROC with an AUC of 0.8318.

The lift chart for the external test set is

with the ROC and an AUC of 0.8026. 

The lift chart from the external test set is a little concerning as the model seems to be having more trouble identifying churn in the external set than it did during training. However, the overall performance is still close between training and the external set. As a data scientist who never likes throwing data away if it can be helped, this approach, while better than the two above, was never my favorite. However, there is a way we can use all of the data, but avoid the issues from the first method.

Group stratification

If we select customers randomly, instead of selecting random rows, for our train and test set, we will avoid the problem we discussed in the first method, but still be able to use all of the data. This approach leads to the following lift chart

with the following ROC and an AUC of 0.8302.

The model’s lift chart on the external test set is shown below

with an ROC shown below and an AUC of 0.8243.

This is much more like it. Both the lift chart, ROC curve and AUC hold up in the external test set. There is some degradation in performance, but it is slight and not unexpected. So with a minor modification in how the training data is partitioned, we can build models that show strong performance on external, out-of-time data.

No-code/low-code data prep and visualization