Richard Tol’s 97% Scientific Consensus Gremlins
Last year Cook et al. released a paper that analysed the scientific consensus on anthropogenic global warming in the peer-reviewed scientific literature.
What they did in that study was look at almost 12,000 abstracts from 1991 to 2011 that matched the search “global climate change” or “global warming.” What they found after analysing these abstracts is that among those that expressed a position on global warming, 97% endorsed the consensus position that humans are causing global warming. They also contacted 8,547 authors to ask if they could rate their own papers and received 1,200 responses. The results for this again found that 97% of the selected papers stated that humans are causing global warming.
For anyone who is aware of other studies that did something similar these results weren’t a surprise. As studies like Oreskes 2004, Doran 2009 and Anderegg 2010 showed similar results. No matter how these studies approach this subject they find this level of agreement among experts and in the scientific literature. This remarkable agreement exists because a scientific consensus is reached on the weight and amount of research that is available in the literature. It’s also this scientific evidence that led to the scientific consensus on for example evolution, plate tectonics, the big bang, germ theory, and so on. Such a consensus only arises through meticulous study and hard work by scientists.
Which is also the reason the Cook et al. study is so relentlessly attacked by science deniers and pseudo-sceptics. It’s the only tactic they really have as they can’t base their case on scientific research, they just don’t have the supporting evidence to show that they are right. Most of the time they can only allude to nefarious going ons that prevent such evidence from getting into the literature. But it ignores that falsifying a well established scientific theory or concept advances the career and a reputation of a scientist more than confirming it does.
One attacker of the Cook et al. paper surprised me though: the econometrician Richard Tol. Since the release of the Cook et al. paper he’s been criticising it and attempting to show his criticisms have merit. Most of it has played out via his blog, on twitter, and in the comment sections of various websites. What he said there seems to be the basis for his paper critiquing Cook et al. After 4 attempts at 3 different journals his fifth attempt at getting this paper accepted finally succeeded. After reading it I can understand why he had trouble getting it accepted and I don’t understand how it managed to survive peer-review.
The biggest issue I have with the paper Tol wrote is that it cites some questionable sources or cites material that doesn’t support what he’s claiming. He references one of these when Tol says that “Legates et al. tried and failed to replicate part of Cook’s abstract ratings, showing that their definitions were inconsistently applied.” The paper he’s referring to is ‘Climate Consensus and ‘Misinformation': A Rejoinder to Agnotology, Scientiﬁc Consensus, and the Teaching and Learning of Climate Change‘.
The biggest problem with this claim is that this paper didn’t try to replicate the Cook et al. paper. Legates et al. used different categories than the Cook et al. paper which gave them a lower consensus percentage (which is why they did this). If that wasn’t already bad enough they also excluded several categories when they calculated their consensus percentage based on the entire literature sample, which includes papers that didn’t say anything about global warming. That’s the reason they found a 0.3% consensus in this paper instead of 97%.
But that’s nonsensical, you can’t use papers that don’t say anything about the question you’re trying to answer. Take for example a literature search on HIV to answer the question if HIV causes AIDS. When you do this you won’t only get papers that talk about this link, the majority will talk about something entirely different. For example how HIV is being tested as a possible carrier of genetic material in gene therapy (don’t worry, it doesn’t contain the RNA of HIV so it can’t cause AIDS). A very interesting topic and very promising for helping people with genetic disorders, but it doesn’t tell you if HIV causes AIDS. This simple analogy shows how weak the reasoning in this paper is.
However, what truly amazed me, was that he got away with citing blog posts. For example:
Twelve volunteers rated on average 50 abstracts each, and another 12 volunteers rated an average of 1922 abstracts each. Fatigue may have been a problem,8 with low data quality as a result
For his fatigue point he cites the blog post ‘I Do Not Think it Means What You Think it Means‘ (archived here) that uses out of context quotes from private conversations. These were obtained by a hacker who managed to bypass the security measures of a private forum to gain access to this material. This means that Tol uses illegally obtained material from a private forum that is quoted out of context to support a claim. Not one of these quotes talk about fatigue, he’s basing this particular claim on a partial quote from a comment on that blog post (as explained in his footnote):
Indeed, one of the raters, Andy S, worries about the “side-effect of reading hundreds of abstracts” on the quality of his ratings.
This is the full quote:
Like Sarah, I sometimes get a “deja lu” feeling. But I’m not sure if that’s real or just a side-effect of reading hundreds of abstracts. I’ll maybe note the title when it happens so that John can check the database.
What Andy is talking about isn’t fatigue, just that some abstracts look really similar. Andy wasn’t sure if his impression that some abstracts were repeated in the database was real or just a side-effect of reading hundreds of abstracts. But it was a side-effect, there weren’t any duplicate abstracts in the database. Just abstracts on similar subjects with similar results that had a similar phrasing. That will create a déjà vu feeling when you’re rating abstracts during the course of a couple of months, but that’s not fatigue.
Tol citing stolen material and showing a willingness to use ethically questionable sources is part of a larger pattern. For example he cites a different hack on his blog that exploited a security hole to gain access to proprietary data used for the Cook et al. paper (archived here):
[The hacker] has now found part of the missing data.
Unfortunately, time stamps are still missing. These would allow us to check whether fatigue may have affected the raters, and whether all raters were indeed human.
Rater IDs are available now. I hope [the hacker] will release the data in good time. For now, we have to make do with his tests and graphs.
Cook’s university is sending legal threats to a researcher who found yet another chunk of data.
The reason this ‘legal threat’ was sent is that the materials that were stolen are proprietary and would violate the ethical approval for the Cook et al paper. Some raters were promised anonymity (several aren’t credited in the Cook et al. paper) This was also done to prevent attacks on raters for how they rated papers as this isn’t relevant to verify if the ratings were done correctly. With the hack of the private forum it’s impossible to anonymize this last bit of data to prevent that. The legal statement by the university that Tol is referring to explains this. Yet Tol still misrepresents this statement and apparently has no qualms about using stolen materials.
But this whole point about fatigue is nonsense. What Tol is referring to is survey fatigue, the tendency of people to quit or get less accurate when they fill in long surveys or a lot of surveys. But this was a team of raters who were free to rate abstracts at their leisure, they could start whenever they wanted, continue as fast or as slow as they wanted, and could take a break when they wanted. There wasn’t a deadline for submission or for finishing the ratings. This is a similar method as used by Oreskes 2004 which Tol refers to as one of the “excellent surveys of the relevant literature.”
You also expect raters to become more proficient in these type of situations. The author of the book that Tol cited to make his fatigue point confirmed that this is the case for this set up. If this wasn’t the case there wouldn’t be a 97% consensus from the abstract ratings and then also a 97% consensus from the authors rating their own papers.
However, one of the biggest points made by Tol is that the 97% consensus is actually a 91% consensus. This is based on a misapplication of Bayesian statistics, which should be a part of Tol’s expertise as an econometrician. He assumes that 6.7% of all abstracts were incorrectly rated, no matter the category they’re in, and then applies this to the entire dataset. However, the same data he used for this error rate tells us that the error rate is different per category. When this error correction is applied correctly it actually reaffirms the found consensus in Cook et al.:
In other words if you account for what the raters actually did and then correct for this the consensus is just as strong. Tol’s incorrect method basically creates almost 300 papers rejecting the consensus from thin air. One hell of a mistake to make for an econometrician who has access to the information that he needs to calculate the different error rates. Especially when this misapplication of Bayesian statistics is already present in rejected versions of Tol’s paper. One of the reviewers had the following to say (bolding mine, archived here):
In large part, these figures suggest mostly that it was human beings doing the research and thus humans can get fatigued in such a large and daunting research project. This can lead to lower quality data, but rarely are data perfectly transcribed and meet all assumptions, especially when humans are categorizing and qualitatively assessing literature, and there’s no indication that this biased the findings. In fact, all indications, such as the self-rating by paper’s authors, suggest that the analysis was robust and the findings not influenced by fatigue or low data quality.
Another comment from a reviewer says the following about Tol’s accusations of hiding data (Tol mentions this in my earlier quotes about stolen materials, and is still present in his paper):
This section is not supported by the data presented and is also not professional and appropriate for a peer-reviewed publication. Furthermore, aspersions of secrecy and holding data back seem largely unjustified, as a quick google search reveals that much of the data is available online (http://www.skepticalscience.com/tcp.php?t=home), including interactive ways to replicate their research. This is far more open and transparent than the vast majority of scientific papers published. In fact, given how much of the paper’s findings were replicated and checked in the analyses here, I would say the author has no grounds to cast aspersions of data-hiding and secrecy.
But the most damning is the response from Environmental Research letters in their rejection letter:
I do not see that the submission has identified any clear errors in the Cook et al. paper that would call its conclusions into question – in fact he agrees that the consensus documented by Cook et al. exists. [...]
Yes, you read that right. Tol actually agrees with the consensus, and I quote from the paper that got accepted by Energy Policy:
There is no doubt in my mind that the literature on climate change overwhelmingly supports the hypothesis that climate change is caused by humans. I have very little reason to doubt that the consensus is indeed correct
He has even stated in a earlier version of this paper that “It does not matter whether the exact number is 90% or 99.9%.” In the version of the paper I just cited he also said that “The claim that 97% of the scientific literature endorses anthropogenic climate change (Cook et al., 2013, Environmental Research Letters) does not stand”
So to which Tol should I listen? The one that says that the scientific consensus on global warming is real, or the Tol that says it wasn’t found? The latter is based on errors and unsubstantiated accusations. Which leaves the question why he is doing this, and in this way. Fortunately he has explained why he is doing it this way:
I have three choices:
a. shut up
b. destructive comment
c. constructive comment
a. is wrong
c. is not an option. I don’t have the resources to redo what they did, and I think it is silly to search a large number of papers that are off-topic; there are a number of excellent surveys of the relevant literature already, so there is no point in me replicating that.
that leaves b
Even for a “destructive comment” the quality is appalling and is the reason this paper got rejected multiple times with scathing remarks from reviewers. Probably the only reason he managed to get this error riddled paper in the scientific literature is that the journal that accepted it, Energy Policy, normally addresses “the policy implications of energy supply and use from their economic, social, planning and environmental aspects.” Which could simply mean they lacked the expertise to pick up on the mistakes Tol was making. However, it does baffle me why they allowed dubious sources and unsupported claims to slip past them.
This paper is just a slightly more formal unfounded attack on the Cook et al. paper. From the very start he has attacked it with the same baseless accusations that are present in this paper. He has said them on his blog, twitter, in comments sections of other blogs, he even showed up here on Real Sceptic to repeat them. They were even repeated during a Committee Hearing where he said:
The 97% estimate is bandied about by basically everybody. I had a close look at what this study really did, and as far as I know, as far as I can see, this estimate just crumbles when you touch it. None of the statements in the paper are supported by any data that is actually in the paper. So unfortunately – I mean it’s pretty clear that most of the science agrees that climate change is real and most likely human-made, but this 97% is essentially pulled from thin air. It’s not based on any credible research whatsoever.
But it’s not pulled from thin air, the data and method it’s based on are robust. And it was confirmed by the author self-ratings that gave the same 97% consensus.
Tol also instantly dismissed one of the first articles pointing out the flaws in his paper based on the reviewer comments, yet Tol said that no errors were mentioned in that article. He has even gone as far as stating on Twitter that his original paper was unfairly rejected by Environmental Research Letters and that he had addressed all points of criticism from their reviewers. But by now it should be obvious that most of these points aren’t addressed at all in his published version. To me it looks like Tol cannot be critical towards himself when he thinks he’s right, which leads to him rejecting valid criticism.
A good example of his behaviour to reject criticism is what happened with his paper ‘The Economic Effects of Climate Change‘ that he corrected. Tol attributed the mistakes that lead to this correction to “gremlins,” but that’s an understatement. The errors were so significant that the found results changed from showing economic benefits from warming below about 2°C to showing that impacts are always negative after the correction. Yet Tol has said that “Although the numbers have changed, the conclusions have not. The difference between the new and old results is not statistically significant. There is no qualitative change either.”
There’s no shame in being confused—statistics is hard. But if your goal is to do science, you really have to move beyond this sort of defensiveness and reluctance to learn [...] I’m sure you can go the rest of your career in this manner, but please take a moment to reflect. You’re far from retirement. Do you really want to spend two more decades doing substandard work, just because you can? You have an impressive knack for working on important problems and getting things published. Lots of researchers go through their entire careers without these valuable attributes. It’s not too late to get a bit more serious with the work itself.
At one point Tol said to Gelman “You’re a sound statistician. You got all the data. Get to work. Show me how it’s done.” Which is exactly what the team behind Cook et al. have said all this time. Tol has the data to replicate their research, to check their results, yet he didn’t do this. He only managed to produce a paper that Environmental Research Letters described in their rejection letter as “[reading] more like a blog post than a scientific comment.” The now published version of this paper isn’t that much better, it also has a small army of gremlins still present in it. John Cook and his team have already identified 24 errors in Tol’s paper, 11 of these errors were already identified by the reviewers of Environmental Research Letters.
Fortunately science is self-correcting and this paper will fade from the literature in due course, hopefully sooner than later. It won’t prevent the spread of misinformation that is based on this flawed paper by the usual suspects. But if this was the best Tol could do “shutting up” would have been his best option in this case. Especially considering what Tol said would be the consequences of publishing a flawed paper:
If I submit a comment that argues that the Cook data are inconsistent and invalid, even though they are not, my reputation is in tatters.