If you’ve been following the replication crisis in social science, you’ve probably heard the term “p-hacking.” This refers to researchers twisting the data until it provides interesting and publishable results.
First, I’ll provide a brief background on the p-value. Roughly, the p-value tells you the probability that a result at least as extreme as the one you observed occurred due to random chance. In most cases, “at least as extreme” means “at least that far away from zero.” Let’s say I’m trying to measure the gender wage gap. I find that, in my model, women earn $2 less than men per hour, with a p-value of 0.03. I would therefore conclude that there was a 3% that this difference occurred due to random noise.
Social scientists generally aim for a p-value below 0.05 or 5%. There’s nothing special about 5%, by the way, but that threshold gained popularity for whatever reason. Hence, researches find results useful when there is only a 1 in 20 shot (or less) that these results could have occurred due to random chance. That seems like a reasonable threshold until you run multiple models. If I want to find a result with 1/20 shot of occurring due to randomness, I just need need to take 20 shots. Researches can therefore run hundreds of comparisons and just report the ones that meet the the threshold. This can lead to a bunch of randomness looking like a meaningful result.
I also encounter this at work. Stakeholders will filter dashboards by segment, sub-segment, timeframe, day of week, etc. They might find that segment A outperformed segment B in 2020 on weekdays for metric X but not Z. This could tell us something interesting about the two segments, but it could also just be a multiple comparison problem.
If you compare enough variable, some of them will differ, even if they’re randomly generated. You’ve probably read this point before, but I don’t know if this idea really “sinks in” among those who don’t work with data in their daily lives.
To illustrate this, I will p-hack a medical study. The study will look at the five patient outcomes before and after a treatment. The patients are randomly split into a treatment and control group. I have also grouped the patients into 8 segments via the 2 sexes and 4 blood types. This leads to 40 possible comparisons, with 5 medical outcomes for each of the 8 segments.
Remember, means that I’d expect about 2 statistically significant results (given the 1/20) from these comparisons, even if the data is totally random. That’s relevant for my study because, well, my data is totally random.
Yup, to illustrate the points, I fabricated 10,000 patients, randomly assigning them a sex, blood type, and treatment group. I also randomly generated each of the five medical outcomes for the patients, with a different outcome before and after the treatment. You can see the code below, and it should run in any standard Python compiler. I don’t expect most readers to understand the code, but I leave it there for the curious.
So, how did this treatment perform?
Approaching Significance:
Lowered BMI by 0.3 points among men with type B blood (p=.11)
Raised A1C by 0.12 points among men with type B blood (p=.10)
Raised A1C by 0.13 points among women with type A blood (p=.11)
Raised Zombie Encounters by 0.2 among women with type B blood (p=0.08)
Lowered Systolic Blood Pressure by 1.3 points among women with type AB blood (p=0.09)
Raised Zombie Encounters by 0.2 among women with type AB blood (p=0.05)1
Significant:
Raised BMI by 0.42 among men with type O blood (p=0.04)
Raised A1C by 0.2 among women with type B blood (p=0.01)
I hope you understand p-hacking a bit better now.
# Package Imports
import pandas as pd
import numpy as np
import random
import copy
import statsmodels.formula.api as smf
# The __name__ = __main__ piece just ensures that the code runes
if __name__ == '__main__':
# Setting random seed ensures that the random numbers generators will create the same data
# every time the file is run
random.seed(0)
# Creating the initial user list, setting their blood type, sex, and whether they recieved the treatment or control
pre_data = pd.DataFrame(data={'blood': random.choices(['A', 'B', 'O', 'AB'], k=10_000),
'sex': random.choices(['M', 'F'], k=10_000),
'treatment': random.choices([0, 1], k=10_000)})
pre_data['patient_id'] = pre_data.index
pre_data['after'] = 0
# Create dataset with before and after parts. The patient info (sex, blood type, treatment/control group)
# Will Stay the same, but the health outcome will change
post_data = copy.deepcopy(pre_data)
post_data['after'] = 1
combined_data = pre_data.append(post_data).reset_index(drop=True)
# Randomly Generate Health Outcome Data
combined_data['bmi'] = [random.normalvariate(25, 2.5) for _ in range(20_000)]
combined_data['aic'] = [random.normalvariate(5, 1) for _ in range(20_000)]
combined_data['systolic'] = [random.normalvariate(110, 10) for _ in range(20_000)]
combined_data['cholesterol'] = [random.normalvariate(160, 30) for _ in range(20_000)]
combined_data['zombie_encounters'] = [int(random.uniform(0, 5)) for _ in range(20_000)]
# Loop through all possible cohorts and health outcomes, printing the results
for sex in ['M', 'F']:
for blood in ['A', 'B', 'O', 'AB']:
subset_data = combined_data[(combined_data['sex'] == sex) &
(combined_data['blood'] == blood)]
outcome_li = ['bmi', 'aic', 'systolic', 'cholesterol', 'zombie_encounters']
for outcome in outcome_li:
formula = outcome + '~after*treatment'
result = smf.ols(data=subset_data,
formula=formula).fit()
print('sex: ', sex,
'; blood type: ', blood,
'; health outcome: ', outcome,
'\n', 'effect size: ', result.params['after:treatment'],
'\n', 'p: ', np.round(result.pvalues['after:treatment'] * 100.0, 1)
)
It’s slightly above 0.05 (0.052) so its not quite significant.
Let's p-hack
P-hacking really turned the field of experimental psychology on its head. Almost none of the famous studies turn out to be…true.
You should have just changed the seed a few times till you got the zombification factor over the significance threshold. Now that's (data) science!