Quick note: I intend to write shorter posts for this blog. My desire to write ebbs and flows, and and I’d rather release interesting pieces on a somewhat consistent basis instead of slowly producing more comprehensive ones. I find it easier to write two 1500-word pieces than one 3000-word one. As such, this series will contain multiple parts, each with their own subparts. Also, part 3 isn’t even about machine learning. Sorry to spoil the plot twist.
Readers can recall three types of machine learning algorithms from Part 1: prediction, grouping, and reinforcement learning. This discusses the first type: prediction. For these algorithms, we need a set of features (i.e. known values for an entity) that can help us predict a target (i.e. a variable we know for past values but not future ones). One common business example being customer churn. We might know the number of products a customer has purchased, their date of signup, or their age. These are features. We know whether or not that customer has churned in the past, but not whether or not they will churn in the future. That’s the target. In the first post, I argued that feature engineering (i.e. picking the right variables) constituted the most important and difficult part of machine learning. Still, once that’s done, we need to a) find out if our model can help the business and b) share our results with the relevant stakeholders. That will be the topic of this article, because I either see this ignored or misunderstood in most machine learning pedagogy.
Why do we make predictions?
All good data science problems must start with an underlying business question. What problem is your team trying to solve? Without a solid answer to this question, you will wind up performing bullshit work that no one uses. To avoid abullshit work, let’s stay with a basic question: why do we make predictions? My answer: we make them to replace bad decisions with good ones. Let’s run through an example.
Imagine you work for the Church of Scientology and you’re trying to convert students on a college campus. You have limited time and money, so you can’t speak to every student. Thus, you need to target the students who are most likely to convert. Let’s say you’re camped out near the library, and you see two people one walking by. One man looks lonely, dejected, and a bit lost. The other man is Pope Francis. To maximize the number of Scientologists in the world, we want to reach out to the person with the highest probability of converting. This requires an implicit probability prediction for both individuals, or, at least, a ranking of which man is more likely to convert. If we chose Pope Francis, we’d be wasting our resources. I imagine he’s pretty dogmatic in his religious beliefs, and he’s probably only on campus for the parties anyway. By making this distinction, we replace a bad decision (trying to convert the Pope) with a good decision (trying to convert the awkward guy).
In business, we lack the time to create an implicit probability for every observation. Thus, we need to create a model that separates the likely events from the unlikely ones. Some common business cases include:
Customer churn. We may want to spot risky customers and convince them to stay. Retaining customers involves both costs (employees from a customer success team contacting the customer), and a potential loss of revenue (offering discounts for the customer to stay). In this case, we’d want to replace offering concessions to reliable customers (bad decision) with offering concessions to shaky customers (good decision)
Marketing. Sending an offer may involve physical costs (like mail) and opportunity costs (sending a discount to a customer who would have bought at full price). Here, we’d like to replace unnecessary discounts (bad decision) with offers that induce a new purchase (good decision). If the offer costs a lot to send, we’d also like to replace sending offers that don’t result in a purchase (bad decision) with, uh, not doing that (good decision).
A sub-problem might involve which offer to send to a customer. A restaurant might send some customers a free beer and others a free wine. We’d want to know which customers would be enticed by which offers.
Medical diagnosis. The breast cancer dataset is a popular one for teaching predictive modeling. Here, we’d like to replace incorrect diagnoses (bad decisions) with correct ones (good decisions).
Predictive accuracy allows us to create value. However, data science content often forgets that adding value, not making accurate predictions, is the point of the exercise. Without this distinction in mind, the data scientist will report the wrong metrics. I will return to this distinction later in the article.
Class vs Probability
To start, I’ll clarify a few ideas. First, we might predict categories or values. In data science terminology, we call the former classification and the latter regression. I don’t like these terms, but they’ve stuck. Categories refer to discrete events, in contrast to continuous values. If we want to predict whether the Bengals or Rams win the Super Bowl, we would perform classification. If we want to predict how many points the Bengal score in the Super Bowl, we would perform regression. Most prediction problems in data science are classification, so I do not discuss regression problems in this article.
Within classification, we can produce two types of predictions: classes and probabilities. Class refers to which category our model predicts. In the Super Bowl example, the classes would be the 1) Bengals and 2) Rams. In our churn example, the classes would be 1) will churn and 2) won’t churn. In the breast cancer dataset, the classes are 1) the tumor is benign and 2) the tumor is malignant. A model can also predict non-binary classes. For example, a software company might want to predict if a customer purchases a base, medium, or deluxe plan. In contrast, probability refers to the likelihood of each class occurring. One Super Bowl model might predict a 90% chance of Bengals victory, while the other might predict a 51% of the same outcome. In both cases, the model would produce the same class prediction (a Bengals victory), but they would differ in their chosen probability.
This distinction matters for data science problems. Consider the churn example above. In this case, our customer success agents might only have time to reach out to 200 customers each month. Thus, they’d want to know which 200 customers are most likely to churn. For the breast cancer or marketing examples, the decision might involve a complex calculation that accounts for potential costs and benefits.
Despite this, so much data science curriculum focuses on class-based metrics. One example being the confusion matrix and the various calculations created from it. The confusion matrix displays the four possible outcomes of a binary prediction: we predict it to happen and it does, we predict it to happen and it doesn’t, we predict it to not happen and it doesn’t, we predict it not happen and it does.
However, most machine learning models (including the random forest that began this series) can provide us the probability of an outcome occurring. From this, we can create a class outcome by predicting a positive (e.g. customer churns, customer makes a purchase, patient has breast cancer) when the probability is greater than or equal to 50% and a negative otherwise. We could do this, because we probably don’t need to.
Consider the churn example above. Our customer success team will contact 200 customers every month. In this case, the class prediction doesn’t matter. In fact, the class predictions are likely negative for every customer. Unless you’re going bankrupt, very few customers churn in a given month, so your model might never produce a probability prediction north of 50%. In a more extreme case, like credit card fraud, the predictions may never exceed 5%. Imagine a situation where 99.9% of transactions show a less than 1% chance of being fraud. This could still help the company by allowing them to investigate the 0.1% of transactions with fraud probabilities above 1%.
Wait, why do we make predictions again?
Of course, we could just change the positivity threshold. Maybe we consider a transaction to be “likely fraudulent” if its probability sits above 5%, rather than the intuitive 50%. We could also try 6%, 7%, 8%. 9%... and, well, you can count. If we try all the thresholds from 0 to 100, we create the receiver-operator curve. This metric provides two benefits. One, the chart produces an interesting visual. A curve closer to a right angle indicates a more accurate model, and a curve closer to a 45 degree angle indicates a worse one. Secondly, we can also calculate the area under the curve to grade the model. A value close to 0.5 represents a model no better than random, and value near 1.0 indicates a perfect model. This allows us to select between competing models with a single statistic.
Unfortunately, the receiver-operator curve poses a couple practical issues. One, it’s kinda hard to understand and explain. If you didn’t understand the previous paragraph, don’t worry, none of the stakeholders at work will either. In data science, your work only matters if it changes the operations of the business. If you can only meet with the Chief Marketing Officer for an hour each month, it’s probably not a good idea to spend that limited time explaining the receiver-operator curve. Secondly, remember the point of predictive models: to replace bad decisions with good ones. Let’s think about the area under the receiver-operator curve for a second. Why should the business care if your model improved from having a 0.7 area under the curve to a 0.75 area under the curve? How does that translate into profit? Chances are, few data scientists can answer these questions. Thus, I recommend only using the receiver-operator curve “internally.” In other words, use it to evaluate your own models (and select between them), but don’t waste time presenting it to the business leaders. In Actually Existing ML & AI Part 2B, I will explain alternatives to the confusion matrix and receiver-operator curves: calibration, lift, and expected value.
Thank You for this and previous article on ML, M. Klaus. You're very funny. I don't mean to offend, and dunno this of interest. I found a couple typos:
1) "we predict it to not happen and it doesn’t, we predict it not happen and it doesn’t."
One of the "doesn't"s wants to be a "does."
2) "including the random forest that began this series), alo provide us the probability of an outcome occurring"
"alo" -> "also"
I proofread a book on AI, and it had some ML in it. Yours is a lot better. TY again.