Last week, I found an ad from this Capitol One campaign that boasted about its use of random forests to spot credit card fraud. I encourage every read to stop right now and guess, on a scale of 1-5, how complicated and cutting edge of an algorithm a random forest is, 5 being the most complex. Capital One probably hopes that you guessed 4 or 5. I had a different reaction. As someone who’s deployed this model at my job, I thought “uhh... kinda weird that they’re bragging about using one of the simplest models.” On the 1-5 scale, I’d put a random forest in the 1.5-2 range. And that’s fine! Machine learning doesn’t need to be complicated.
Yet, multiple parties insist on the complexity of machine learning for a few reasons. One, companies want to appear cutting edge to both consumers and potential employees. They want to announce that they do what Google and Apple do. Data scientists also want to mystify their work, both to negotiate a higher salary and look more sophisticated on Tinder. And, of course, a lot of it really is complex. In my own projects, I’ve found academic papers about AI and ML that I can't understand.
However, this purported complexity doesn’t represent how most practicing data scientists use machine learning. A lot of online machine learning articles look something like this. A quick scroll through this article will bewilder the layperson. XGBoost? Hyperparameters? Gamma? Impurity measure? Grid search? Must be pretty advanced stuff, right? Well, maybe. Here’s the secret though: most data scientists don’t care about any of it. I’ve seen many machine learning at work, and produced some of my own. Almost nobody bothers grid searching, and those that do just pull a few numbers out of their asses. That article’s meticulous search for the optimal parameters just doesn’t happen.
Employing the models themselves requires little effort. Data scientists on LinkedIn often state that machine learning requires a solid understanding of linear algebra. It probably does if they research AI at a university, but the average data scientist doesn’t need to understand linear algebra anymore than the average stewardess at Southwest Airlines needs to understand aerospace engineering. Someone already programmed these models, so we don’t need to do any of the math.
In this series, I’ll explain how practicing data scientists actually use AI and machine learning at work. As in, what these professionals physically do when they sit down at their computers. I admit I’m acting a bit snarky with the phrase “actually existing.” Of course, grid searching the gamma hyperparameter actually exists; it’s a thing people do. In this context, I’m referring to what “actually exists” in the majority of data science that occurs in private sector offices. Let’s start with the basics: what is machine learning?
What Is Machine Learning?
I’ll start with an apology: I don’t have a good definition of machine learning. I find the term difficult to define, and I don’t find most definitions illuminating. Wikipedia defines machine learning as “the study of computer algorithms that can improve automatically through experience and by the use of data.” This definition makes sense to me, but it doesn’t seem like it would help someone who wasn’t already familiar with machine learning.
Imagine you met a prehistoric caveman who somehow spoke and understood English. He heard of this thing called an “apple,” and wondered what that was. How would you explain the concept of an apple to him? I’m sure there’s some scientific definition of an apple, based on the genus of the trees or some genetic code. You could tell him about that, but he probably wouldn’t understand it (and neither would you). Personally, I’d prefer to show him some examples of apples and say “it’s fruit with a shape, texture, and taste kinda like these.” Such a definition would both illustrate the concept and probably represent how our own brains define the word.
I’ll do the same with machine learning. To me, machine learning consists of three sets of algorithms that, as Wikipedia reminds us, “improve automatically through experience by the use of data.” More that more sophisticated blogs will cut machine learning up in two categories: supervised and unsupervised. This is “technically correct,” but I think three provides a better picture of the concept.
Before discussing the three categories, I need to define some terms. In most machine learning models, you will work with a “dataset” that looks like this:
The data contains columns, referred to as “features.” The one exception being that sometimes one column is a “target.” The rows, meanwhile, represent “observations.” In the table above, each observation represents an iris flower.
Prediction (supervised learning)
The first type of task is prediction. Here, we need historical (i.e. in the past) data from which we can learn the relationship between the features (columns that we always know) and the target (the column that we don’t know and want to predict). Machine learning can’t be used for all types of predictions. It wouldn’t help us predict whether Russia will invade Ukraine, for example. For us to apply machine learning to that, we’d need thousands of universes, including some where Russia does invade and some where it doesn’t. Either these universes don’t exist, or we live in one of universes that doesn’t know about the other ones. Or it’s all a simulation. Regardless, we don’t have the sort of historical data needed to make such a prediction.
Recall the credit card fraud example from above. For this, Capital One data scientists will utilize a historical dataset that contains many non-fraudulent transactions and (hopefully not very many) fraudulent ones. The target variable will be binary of whether or not a given transaction was fraudulent. The data scientist will also use a dataset of features about these transactions: time of day, amount, types of items purchased, etc. From this historical dataset, data scientists can “fit” (also known as “train”) a model to learn the relationship between the features and the target.
They will then have a dataset where they know the features, but not the target. In other words, they’ll know information about the transaction, but not whether or not it’s fraudulent. The data scientist will take the trained model and use it to predict the probability that one of these new transactions is fraudulent.
I’ve mentioned a mysterious “model” here, and readers might wonder what that looks like. Good news, I can show you pretty easily. In fact, let’s use the random forest that Capital One adores. For the Python code below, X refers to a dataset of features, while y refers to the target variable (fraudulent or not fraudulent).
from sklean.ensemble import RandomForestClassifier()
rf = RandomForestClassifier()
rf.fit(X, y)
Done! We have now fit the random forest to our historical data. Next, imagine we have a new set of transactions from which we want to make predictions. I’ll refer to this as “new_data.”
rf.predict_proba(new_data).
That’s it! You now know machine learning. Add it to your resume, and I’ll send you a certificate in the mail. Also, in case you’re wondering, “proba” is short for probability.
Grouping (unsupervised learning)
Recall the iris dataset above. Unsupervised models can answer two types of questions about these flowers. First, what types of iris flowers are there? Second, what are the latent features of the iris flower? Unlike the prediction, we lack a “target” variable. There’s nothing to predict here; we just want to flatten a large dataset into more digestible chunks
For the first question, we “cluster” the rows together to find similar observations. In a company, you might do this to demarcate different types of customers. Maybe there’s some that only buy discounted products, some that only buy around Christmas, some that purchase frequently at full price, etc. Clustering algorithms like K-means and Hierarchical Clustering can help us find these “types'' in a more systematic way. Note that we have to tell the model how many groups we want it to find, and some heuristics help us pick the optimal number of clusters.
For the second question, we group the columns (rather than the rows) to find latent features. These features exist in the data, but we don’t have an easy way of measuring them. For example, we can measure someone’s height, weight, and age. However, we can’t pull out a ruler and measure their personality. Thus, these personality features are latent. Instead, psychologists use a test like this one, and these tests create a dataset with a lot of columns. For instance, a standard personality test may contain 100 questions. From such a test, we would like to produce a 100-column dataset, where each column represents one question. However, that dataset wouldn’t tell us anything interesting. Instead, psychologists have reduced these sorts of test results to five “latent” features, better known as the Big Five: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism.
Reinforcement Learning (also supervised learning)
Like prediction, reinforcement learning also involves a target variable. Because of this similarity, blogs and textbooks refer to both as “supervised” learning. However, I think these models use the target variable much differently, so I don’t find much utility in grouping them together.
To explain reinforcement learning, I’ll start with the multi-armed bandit problem. Imagine two slot machines. For simplicity, these two slot machines provide binary outcomes: you either win nothing or win $1. Let’s say you have $1000 dollars to spend between them, and you don’t know which machine pays out more frequently. You could play each 250 times, find which one’s better, and then choose that for the last 500. However, we could waste a lot of money on the crappier machine. What if, after only 100 plays of each, it looks like machine A pays more. Should we stop playing machine B? Is that enough of a sample? What about after 50 plays? 20?
To solve this problem, we can use a reinforcement learning algorithm. We start with a prior for each machine. Maybe we think, for instance, that each machine pays out 50% of the time. We start with machine A and lose. Now we update our prior a bit, thinking it only pays out 49% of the time. From there, we simulate a random draw from each machine based on these priors. In this case, our simulation is slightly more likely to pick machine B than machine A, since we have a loss with one and not the other. We then check which machine performed better in the random simulation, and pick that one in real-life. Finally, we update our priors based on the real-life result and play again. Over time, we’ll become less likely to pick the stingier machine in our simulations, until we eventually stop playing that machine altogether.
I can provide more details (such as how the simulation might work) in a future post. Here, I’m trying to impart the basic idea. We have a target we want to reach (win money from the machines), and we create an algorithm that progressively becomes better at reaching the target through trial and error. Modern chess AIs work in a similar way. They simulate playing a series of moves, and choose the move that reached the target (checkmaking the opponent) most frequently in the simulated games. In a corporate setting, a data scientist might run two types of ads, and use a reinforcement learning algorithm to gradually increase the presence of the better one.
So Is Data Science Easy?
By now, you have the code for Capital One’s fraud algorithm and some use-cases for machine learning. Are you ready to apply for a 200k job? Not so fast. At one job, I held a recurring data science discussion question. After discussing a clustering algorithm, one of the junior analysts asked “ok, but how do you choose which variables to put in the model.” That’s a good question. The answer: well, uhh.... That’s the hard part of machine learning. The models might take 5 minutes to code, but obtaining the right features (referred to as “feature engineering”) takes a ton of time and thought.
Let’s imagine a straightforward problem that might occur on an e-commerce site. The company wants you to predict the probability of a customer purchasing a product in the next thirty days. They want to send offers to those with a low probability of buying, and they want to avoid sending discounts to those who would have purchased something anyway. You’d need to find features that would help you determine the probability of a purchase. To do so, we’d have to look at past purchasing behavior. The company probably owns hundreds of relevant tables, but let’s imagine a really simple one. This hypothetical dataset contains every customer’s visit to the website. It has three columns: user ID, login time, logout time.
Just from this dataset, let me brainstorm some potential features
Number of logins in the past 30 days
Number of logins in the past 90 days
Number of weekend logins in the past 30 days
Number of logins in the first 30 days after account creation
Number of logins per day in the past 30 days
Average time spent per login
Of all days with logins, the 75th percent of of time spent logged in
Of all days with logins, the standard deviation of time spent logged in
Of all days with logins, the kurtosis of time spent logged in
Number of consecutive days with logins
Number of consecutive days with logins that lasted at least 5 minutes
...and we could go on like this. You could probably think of dozens more, and that’s just for this one table! The data scientists must explore the data to figure out which features predict future purchases and which ones don’t - the proverbial signal and noise.
Furthermore, data scientists often describe real-life data as “dirty,” roughly meaning that users have to manipulate the data in a myriad of ways to obtain basic information. It might not be straightforward, for instance, to answer a question like “how many logins did we get last Tuesday?” When exploring this dataset, for example, the data scientists may end up noticing oddities like:
Login time sometimes occurs after logoff time
Many sessions don’t contain either a login time or a log off time
Some sessions last for over 200 hours
Some customers registered thousands of logins per day
Some customers purchased products online without a corresponding login record for that day
Some customer logged in months before their account was created
Some customers logged in months after their account churned
Some customer logged in in 1935
Others in 2035
Last March, the software developers pushed an update to the system, which tripled the number of logins in the data. They don’t know why this occurred
Oh, and these aren’t actually logins. It’s actually authentication calls, which is kinda like a login but...
...and we could go on. This is how corporate data works, and it’s nothing like the iris dataset that college students will summarize in their bootcamps. Oh, and remember this is to predict which customer will make a purchase in the next 30 days. The data engineering team will inform you that the purchase data can be found in 5 separate tables, but some of the records aren’t actually purchases. Oh, and what’s a “customer?” Well, the engineers said that “customer id” doesn’t actually represent a customer. Sometimes multiple customers use the same id. Sometimes a customer deletes her accounts and creates a new one, triggering two ids. Sometimes a customer deletes her accounts and creates a new one, keeping the same id. Oh, and some check out as a guest even though they have an account. Also, last august, something broke and added a bunch of leading zeroes to the customer ids, but some customer ids do actually being with 0s.
I think most practicing data scientists understand this, but I don’t think many outside the data science world do. I’ve seen some talk about AutoML, and how that could revolutionize data science. It won’t. Remember the random forest code from above? Cross out RandomForestClassifier(), enter the name of the AutoML model, and you’re done. The model is the easy part. Unfortunately, we still have to understand the business, think about the problem, discuss the data with key stakeholders, and construct a plan to reach a solution. You know, boring work stuff.
ML & AI: What is Machine Learning?
I was also like, OH Razib Khan’s substack name is a double entendre!
I am going to have to read this slowly when it is not midnight. I think I will learn a lot. I had no idea. Is there a 3rd grade version?