Scott Alexander on Astral Codex Ten wrote an original post on Ivermectin as a treatment for Covid, and then more recently, a follow-up post on Ivermectin:
“Yeah, I have no idea what went wrong here, but a few big RCTs didn’t find an effect, plus I have a super-high prior for any new medical thing being false, so whatever, let’s move on”, which I admit is unvirtuous but I’m not sure how to avoid it.
I share this prior, and I also find it challenging to articulate this prior in a virtuous way, that doesn’t come across as “this looks wrong to me, because it looks like other things I’ve seen that were wrong.” I have wanted to walk through my heuristics before, and this is a reasonable case study, due to the high-profile and complexity of the issue..
I have zero ego attachment to whether Ivermectin works or not. I want to instead explain the specific heuristics I used that suggest the case for it being an extremely effective treatment is weak. e.g. ivmmeta.com uses a meta-analysis that claims the probability a positive signal is due to chance alone at about 1 in a trillion.
Modeling Heuristics.
I did not read every paper in detail, but I did read through a few from the set that Scott considered admissible. As an example, let’s do a deep-dive on a study Ghauri et al that Scott said this about: I don’t hate this study, but I think the nonrandom assignment (and observed systematic differences) is a pretty fatal flaw. I can’t find anyone else talking about this one. At least no one seems to be saying anything bad.
By looking through this paper for about two minutes, my initial credibility alarm from pattern matching to other papers that I think are wrong, says this paper is probably worthless. If I read deeper, let’s see if my initial impression is correct:
What I immediately notice is that they are using a statistical model with controls to deal with lack of randomization. The math and statistics in their model is not enumerated or documented properly. The logistic regression formatting is not clear on what the parameter encodings mean (how do I interpret the coefficient on gender?), and excludes a lot of important fit data. This is a strong signal that the practitioner doesn’t understand these tools or how to do statistics.
If we actually read the statistics, is this model saying being older is better for your outcome? Because I see a positive coefficient on age. This model also finds sex and age to not be good predictors, with non-significant p-values, which we know is wrong from other studies. This model also find zinc to be extremely useful, despite larger trials not finding zinc to be beneficial. Although the coefficient on Ivermectin is extremely predictive. What do we make of this? The parameters in this model that we can pin down from other larger trials don’t align with what stronger powered studies have found. If those parameters are wrong, why should we trust the treatment parameter?
In addition, If we actually look at the distribution between the treatment and control, we see the following in table 2. There are 18 women to 36 men in the control, and 18 women to 23 men in the treatment. We know covid is more deadly for men. As a result, it’s going to be necessary to introduce controls in the analysis to try and remedy the lack of randomization.
However, the moment you start adjusting for confounders you’re in trouble. A randomized trial allows you to asymptotically ignore all confounders. As long as you get the randomization correct, you can completely ignore controlling for confounders (e.g. are there more men or women in group A or B).
If you have a weakly powered study, or a non-randomized study, you now have to control for differences between these two groups. In fact, the theory underpinning this, is you need to set up controls to approximate a randomized trial. The entire theory of inference is built up around randomization, and if you don’t have it outright, you must build the statistical machinery to recover it.
However, with proper randomization you don’t need to reason about what controls you need. Once you lose randomization, not only do you need to introduce controls, but you need to be damn sure you’ve correctly guessed every dimensions that you did not correctly randomized across, or you’re stuck with omitted variable bias
To summarize my concerns here:
We have a non-randomized trial, that we know is imbalanced in critical ways.
Non-treatment parameter values are incorrect when compared to stronger-powered studies.
I have no idea how good the model fit is, because they don’t include those statistics.
I don’t understand why they chose an outcome variable of dispelling a fever, and we’re assuming this is a good proxy for more important outcomes.
The researcher degrees of freedom in design and model is very high, and the authors have not convinced us they did not go down the garden of forking paths.
I would consider this paper a reasonable argument in favor of more well-powered study. But, in fact, I would not consider it useful incremental data, even for meta-analysis. More on this later.
Methodology Heuristics.
Stepping away from this paper for a minute, any researcher who has worked with stats, begins to learn how easy it is to fabricate results without meaning to. Once you load a dataset into memory, and begin to make a model, it is irritating how easy it is to make it say what you want it to say. Randomization not only gives the asymptotic guarantees, it also removes a significant amount of model complexity, and disciplines the hand of the practitioner from needing to guess or gather additional data to control for the failure to randomize.
This is the part where it starts to feel unvirtuous, because it’s really hard to articulate what I mean. But let’s consider how the garden of forking paths might look here. First of all, the paper above was not randomized. Why were the genders so imbalanced? What was the mechanism that was employed? Who made the decision? Why did they make the decision? Why was an outcome variable of a fever chosen? When did they decide to go with a fever? Who took the temperatures? Did they track other negative outcomes, and if so, why didn’t they report them? Who wrote the code? Did they do robust code and data review? Who built the model? How many specifications did they try? Are they a capable statistician? How many comparisons of the data do they do?
Perhaps they only took a single path down this tree, and that’s what we see now. In that case, great. But what if they didn’t? What if at each juncture, they tried a few things, went with what looked good, then walked down to the next node in the tree? If that’s the case, then the results in this paper almost surely won’t replicate. They failed to credibly present an experiment and analysis that didn’t go down the tree of forking paths.
Part of the issue is that most scientists have no idea that you shouldn’t do this, and that’s why it’s so difficult. Even I know this, and I still do it, when trying to construct backtests for a model. I’ll try a few things a few times, and go with what works. After I do that a few times, I’ll have an algorithm that forecasts the future extremely well. It’s a strange feeling to comprehend without seeing it in front of you, but each time you try some new comparison, you truly do poison your results, and you can see it happen right in front of you.
It’s the culmination of all these individual flaws that chip away at the central argument of this paper, and lower our belief of its predictive validity. And unfortunately, our prior on published research being false is already pretty high. This paper would have been a lot better if they took a page out of Gelman’s blog: “In some sense, the biggest problem with statistics in science is not that scientists don’t know statistics, but that they’re relying on statistics in the first place.” The authors tried to do a statistical study, but they didn’t know statistics.
Before I move on to the next section, I also want to say that if it were day 1 of the pandemic, and you sent this paper through a wormhole to my past self, I would read it, say “this paper looks like really shitty research” and then go buy ivermectin. Why? Because however weak the evidence for this paper is, it’s strong enough, and the cost of taking ivermectin is low enough, that it still passes cost vs. benefit. In fact, I went crazy on this same point on how taking HCQ is probably smart in the early days of the pandemic.
I should also acknowledge that these researchers may have been doing the best they possibly could with their resources, while also treating patients during a pandemic. That seems really difficult, and I don’t begrudge them for any mistakes.
Randomization and Complexity:
Before I go to the next section, I want to expand a little more on why RCTs are so important. The consensus view in causal inference is that the counterfactual philosophy is how we generate knowledge. In order to understand causality in this world we inhabit, we wish we could fork reality, change one thing, and then compare these two worlds. A RCT is the closest we can get to this phenomena.
The sublime beauty of an RCT, is it allows us to randomize over things that we don’t even know could be confounders. If you lack that randomization property, you must then collect the additional data, and build a statistical model around it. And even if you do collect the data, there is no guarantee your model can adjust for it properly. Are you sure a linear adjustment is sufficient? Are you sure you’re not introducing more estimation error?
And without even going into statistics, collecting the data means humans transcribing data, putting them into computers, building databases (or CSVs sent by email, if we’re being honest), writing code by people who aren’t software engineers and don’t know how git works. It’s a disaster. It’s the worst parts of software and database management smashed together with the numerical issues of statistics. This all becomes significantly more complicated technically, and also allows far more researcher degrees of freedom in how to set up a model. Which is a clever way of saying an analyst can making different data collection choices, clean the data many different ways, fit multiple models with multiple controls, until they find one that works.
In the language of computer science, we would say that this non-randomized approach has significantly higher cyclomatic complexity within the code-base, compared to a randomized trial. In the language of statistics, we might say that the Bayesian Information Criteria would penalize a model with significant controls more than an extremely simple RCT comparison.
Effectively you are trying to build complex machinery to recover your counterfactual world estimates, but one small step, one small variable you forgot to control for, and your castle in the sky comes crashing down. (There remain lots of things RCTs can’t solve which I sketched out here, and Gelman also highlights a lot of issues RCTs can’t solve here).
I am terrified of complexity, since complexity is a world of smoke and mirrors, where models can be overfit, and researchers can find hidden pathways that result in compelling yet fake results. The RCT not only solves our confounding issue, but it drastically simplifies the research.
Can’t We solve this with Meta-Analysis and Triangulation?
To wade back into the actual debates taking place, in his section The Bermuda Triangulation Alex makes the argument that we have a lot of studies on Ivermectin, a lot of them positive, some negative. The strategy he claims that Scott (and others) have taken, is to systematically chip away at the negative studies, in order of weakest to strongest, and then stop when you’ve dismantled enough that all you’re left with is null results.
It’s a point worth taking seriously given a lot of the motivations we have seen from bad science™ actors. I suspect that is what many people have done, as sort of a ‘arguments as soldiers’ war of attrition, where they looked at the proverbial soldiers on Ivermectin, and tried to spot the weakest ones, and take them out. This is stupid, and if people did that they are wrong. In fact, there is a lot of mood affiliation going on in these Ivermectin debates, which is disappointing, and many people are trying to score points by pointing out flaws.
(mood affiliation is particularly bad here, because in a world where there is vastly more information on any one subject than one person can comprehend, if you spend all your time trying to support or reject a point, you can amass tremendous amounts of evidence in favor of your side, yet be hopelessly biased).
Against this chipping away approach, in Alex’s argument, he references the error correcting models for the signal processing literature: “If you’re worried about implementation issues in your sensors, use many different sensors.” He also links to this wiki on consilience, which uses as the definition: “evidence from independent, unrelated sources can "converge" on strong conclusions.” This seems to be what the IVMMeta guys are doing as well, since their meta-analysis shows that there is a 1 in a trillion probability the IVM results are from chance alone, which would suggest they have the same opinion on all these sources being unbiased and independent.
So what is the best justification for what Scott did? Why would it make sense to dismiss data outright? Shouldn’t we consider it strictly superior to always make full use of our data?
To start, let’s convert Alex’s argument to the language of statistics. Suppose each paper is a measurement of the signal of Ivermectin’s efficacy. Each independent paper samples from that process, using some methodology. The samples are independent and identically distributed, and each paper is able to get a sample.
If you have this property, then you are able to count on the errors cancelling out across your papers. This is the statistical foundation behind multiple-sensor analysis or model averaging. If the errors cancel out then you reduce the variance across all your signals. However, if your errors do not cancel out, or are correlated with one another, then you do not achieve variance reduction, and your multiple-sensors only give the illusion of independent and unrelated sources, and you don’t get that convergence to a strong conclusion. You don’t get the one in a trillion guarantee.
So consider a given paper studying a medical treatment – not specific to Ivermectin. The vast majority of biases tend towards type 1 errors: finding an effect where one does not exist. How do we meta-model this? Well, suppose that each study that comes out draws from a set of potential biases that it could have. In order for the uncorrelated property to hold, we would need these biases to be just as likely to result in a type 1 error as a type 2 error.
Let’s consider two taxonomies of biases. The first is biases internal to a given analysis. The most commonly cited, and potent one, is p-hacking. Taleb enumerated it well in this brief paper, but effectively each study, particularly lower sample-sized studies, or studies that are not properly randomized, or studies which the researcher has done any multiple comparisons without corrections, will be biased towards finding significance that doesn’t exist.
But there are also many more boring biases. We tend to double check our work less when we get results that we like, whereas we spend far more time deep-diving for bugs in our code when the results don’t look how we want.
The second taxonomy are biases that result in what research ends up in front of us in the first place. The top contender here is publication bias, or in-the-drawer bias, where null results don’t get shared.
If these two exist, then we lose the property of independence. To return to the sensor analogy now we have two issues:
We have reason to believe our sensors will be wrong in the same direction. (Due to each study having biases in the same direction.)
We have reason to believe there is a set of sensors that we were never able to observe. (Due to publication bias)
The consequence of this is errors do not cancel out when combined, and your meta-analysis breaks.
So how do you solve this issue?
To do a proper meta-analysis, we want to get back to the point of having uncorrelated sensors, which means needing to prune your results.
To start with, you have to purge a lot of weak evidence, like the example in the first section, where you have no credible way to trust the quasi-experimental modeling approach, since the biases in these papers will be correlated across studies. We know that for all the pitfalls that plague all research, it’s much, much harder to detect fake signals from multiple-comparison errors or p-hacking in well-powered randomized studies, so these will survive that purge.
Next, we need to remove the set of data from weaker studies, that we believe is unbalanced due to publication error. The reason this penalizes smaller, quasi-experimental studies more, is that these are far easier to run. And due to the smaller sample-size, it’s really easy to get a powerful and significant result, due to randomness alone.
When that happens, the author will share it, and when it’s all null results, they may not bother, and instead move on to their next experimental medicine. (would you rather spend an additional two months writing up your results once you have a null result? Or would you rather move on to the next drug to help your patients?)
Let’s also take a moment here to appreciate how frustrating it is that a pivotal point for our evidence is approximately how many null results we weren’t able to observe. And in trying to correct for this, the sensitivity of a meta-analysis will fluctuate from detecting a signal, to not detecting a signal, based on these questions of judgement.
Lastly, if once you do all this, and the amalgamation of small, more poorly run trials shows a strong signal, but the single well-powered and randomized trial shows a weak signal, it seems less likely that you’re aggregating uncorrelated weak signals of the actual effect, and more likely that you’re aggregating a lot of small, correlated, biases.
What you can do at this point, which I have not done, is really dive into the nitty-gritty details of each paper. What was the dosage? When did they start giving the treatment? Were the plasma concentrations different in different patients? Did the larger seemingly more credible RCTs actually not give the correct dosage, and the seemingly less credible trials gave a more correct dosage?
Through this you can propose an alternate weighting scheme of credibility, where on one dimension of weighting you do what I did above, which is try to adjust for publication bias and proper randomization. I don’t really know anything about this, so I am not going to wade in here. Although if I’m wrong, not knowing more about this would probably be why.
Conclusion:
A remaining counter is still that there exists metis (knowledge we learn by practicing) that we aren’t integrating. I’m extremely sympathetic to this, as a lot of the most hard-fought knowledge humans have ever learned has been generated this way. The British finally learned that vitamin C cures scurvy through the genius of early A/B tests, by giving one ship citrus, with the control ship receiving none. But the Native Americans for some reason had been chewing bark with Vitamin C for centuries prior, and I don’t want to come across as racist here, but I’m pretty sure they hadn’t developed A/B testing at the time.
Of course, the counter to this counter, is that people have been taking folk remedies for ages that do absolutely nothing, and it’s not clear that in the limit we eventually discard these useless remedies, which is why we then created large randomized trials to isolate a good signal.
Finally, I don’t care if people want to take Ivermectin. It’s possible it has a weak effect, and I’d be very shocked if it ended up having an incredibly strong effect, but I’d have no issue with being wrong here, or further studies trying to tease this effect out. I dislike bad actors who dismiss it outright, and those who boldly claim it is a substitute for vaccinations, but I didn’t want to engage with those points in this post, since it’s not interesting to me. Mostly I’m frustrated with medical ethics groups who have prevented human challenge trials and hindered our ability to get clear signals from volunteers.
more stats stuff! more stats stuff!