**The Climate Model Muddle**

Guest post by Ed Zuiderwijk

This is a posting about the epistemology of climate models, about what we can learn from them about the future. The answer will disappoint: not much. In order to convince you of the veracity of that proposition I will first tell you a little story, an allegory if you want, regarding a thought experiment, a completely fictitious account of what a research project might look like, and then apply whatever insight we gained (if any) to the climate modelling scene.

**A thought experiment**

Here’s the thought experiment: We want to make a compound that produces colour somehow (the mechanism how it does that is not really relevant). However, we specifically want a well-defined colour, prescribed by whatever application it is going to be used for. Say a shade of turquoise.

Now, our geologist and chemistry colleagues have proposed some minerals and compounds that could be candidate materials for our colourful enterprise. Unfortunately there is no information whatsoever what colours these substances produce. This circumstance is compounded by the fact that the minerals are exceedingly rare and therefore extremely expensive while synthetic ones are really difficult to make and therefore even more pricy. So, how do we proceed, how do we find the best compounds to try? Getting a sample of each of the many compounds and testing each of them for the colour it produces is out of the question. Therefore, what we do is to use modelling of the physics of the colour-producing process for each of the proposed compounds in order to find those which render turquoise, if there are any. Sounds straightforward enough but it isn’t because there are several different codes available, in fact 5 in total, that purport to do such a simulation, each with their own underlying assumptions and idiosyncrasies. We run these codes for the proposed compounds and find that, unfortunately, the colours they predict are inconsistent for individual compounds and generally all over the place.

For instance, take the compound Novelium1. The predicted colours range from yellow-green to deep violet with a few in between like green, blue or ultramarine, a factor 1.3 range in frequency; similar for the other candidates. In this situation the only way forward is doing an experiment. So we dig deep into the budget and get a sample of Novelium1, and see what colour it actually produces. It turns out to be orange-red which is pretty disappointing. We are back where we started. And because of our budgetary limitations we are at the point of giving up.

May we here introduce a member of our team. Let’s call him Mike. Mike is a bit pushy because he fully realises that were we to succeed in our aim it would get us some prestigious Prize or another, something he is rather keen on. He proposes to do the following: we take the model that predicted the colour closest to the actual one, that’s the model that gave us yellow-green, and tweak its parameters such that it predicts orange-red instead. This is not too difficult to do and after a few days jockeying on a keyboard he comes up with a tweaked model that produced the observed colour. Alacrity all around except for one or two more skeptical team members who insist that the new model must be validated by having it correctly predict the colour of compound Novelium2. With that Prize riding on it this clearly is a must, so we scrape the bottom of the budget barrel and repeat the exercise for Novelium2. The tweaked model predicts yellow. The experiment gives orange.

We gave up.

**What does it mean?**

Can we learn something useful from this story? In order to find out we have to answer three questions:

First, what do we know after the first phase of the project, the modelling exercise, before doing the experiment? Lamentably the answer is: nothing useful. With 5 different outcomes we only know for certain that *at least 4 of the models are wrong *but not which ones. In fact, even if the colour we want (turquoise) shows up we still know nothing. Because how can one be certain that the code producing it is the ‘correct result’ given the outcomes of the *a priori* equally valid other models? You can’t. If a model gave us turquoise it could just be a happy coincidence when the model itself is still flawed. The very fact that the models produce widely different outcomes tells us therefore that *most* *probably all models are wrong*. In fact, it is even worse: *we can’t even be sure that the true colour produced by Novelium1* *is inside the range* yellow-green to violet, even if there were a model that produces the colour we want. In the addendum I give a simple probability based analysis to support this and subsequent points.

Second, what do we know after the unexpected outcome of the actual experiment? We only know for certain that all models are wrong (and that it is not the compound we are looking for).

Third, why did Mike’s little trick fail so miserably? What has happened there? The parameter setting of the original un-tweaked model encapsulates the best understanding – by its makers, albeit incomplete but that’s not really relevant – of the physics underpinning it. By modifying those parameters that understanding is diluted and if the ‘tweaking’ goes far enough it disappears completely, like the Cheshire Cat disappears the more you look at it. Tweaking such a model in hindsight to fit observations is therefore tantamountto* giving up the claim that you understand the relevant physics *underlaying the model. Any pretence of truly understanding the subject goes out of the window. And with it goes any predictive power the original model might have had. Your model has just become another very complex function fitted to a data set. As the mathematician and physicist John von Neumann once famously said of such practice: ‘with four parameters I can fit an elephant, and with five I can make him wiggle his trunk’. The tweaked model likely is a new incorrect model that coincidently produced a match with the data.

**An application to climate models**

Armed with the insights gleaned from the foregoing cautionary tale we are now in a position to make some fundamental statements about IPCC climate models, for instance the group of 31 models that form the CIMP6 ensemble (Eyring *et* *al*, 2019; Zelinka *et al*, 2020). The quantity of interest is the Equilibrium Climate Sensitivity (ECS) value, the expected long-term warming after a doubling of atmospheric CO2 concentrations. The predicted ECS values in the ensemble span a range from 1.8C at the low end to 5.6C at the high end, a whopping factor 3 in range, more or less uniformly occupied by the 31 models. Nature, however, may be cunning, even devious, but it is not malicious. There is only one ‘true’ ECS value that corresponds to the doubling of CO2 concentration in the real world.

Can we make any statement about this ensemble? Only these two observations:

First, most probably all those models are incorrect. This conclusion follows logically from the fact that there are many *a priori *equally valid models which can not be simultaneously correct. At most only one of these models can be correct, but given the remaining 30 incorrect models the odds are against any model at all being correct. In fact it can be shown that the probability that none of the models is correct can be as high as 0.6.

Second, we even cannot be sure that the true ECS is in the range of ECS values covered by the models. The probability of that being the case is 1.0-0.6=0.4, which means that the odds that the true ECS is in the range covered by the models are roughly 2 to 3 (and thus odds on that the true ECS is outside the range). The often made assumption that the ‘true’ ECS value must be somewhere in the range of outcomes from the models in the ensemble is based on a logical fallacy. We have absolutely no idea where the ‘true’ model – number 32, the ‘experiment’ – would land, inside or outside the range.

There are some qualifications to be made. What, for instance, does it mean: the model is ‘incorrect’? It means that it could be incomplete — there are concepts or principles missing in it that should be there — or, conversely, over-complete — with things that are but should not be there — or that there are aspects of it which are just wrong or wrongly coded, or all of those. Further, because many models of the ensemble have similar or even identical elements one might argue that the results of the ensemble models are not independent, that they are correlated. That means that one should consider the ‘effective number’ N of independent models. If N = 1 it would mean all models are essentially identical, with the range 1.8C to 5.6C an indication of the intrinsic error (which would be a pretty poor show). More likely N is somewhere in the range from 3 to 7 – with an intrinsic spread of, say, 0.5C for an individual model – and we are back at the hypothetical example above.

The odds of about 3 to 2 that none of the models is correct ought to be interesting politically speaking. Would you gamble a lot of your hard-earned cash on a horse with those odds? Is it wise to bet your country’s energy provision and therefore its whole economy on such odds?

**Hindcasting**

An anonymous reviewer of one of my earlier writings provided this candid comment, and I quote:

*‘The track record of the GCM’s has been disappointing in that they were unable to predict the observed temperature hiatus after 2000 and also have failed to predict that tropopause temperatures have not increased over the past 30 years. The failure of the GCM’s is not due to malfeasance but modelling the Earth’s climate is very challenging.’*

The true scientist knows that climate models are very much a work in progress. The pseudo scientist, under pressure to make the ‘predictions’ stick, has to come up with a way to ‘reconcile’ the models and the real world temperature data.

One way of doing so is to massage the temperature data in a process called ‘homogenisation’ (*e.g. *Karl *et al, *2015*)*. Miraculously the ‘hiatus’ disappears. A curious aspect of such ‘homogenisation’ is that whenever it is applied the ‘adjusted’ past temperatures are always lower, thus making the purportedly ‘man-made warming’ larger. Never the other way around. Obviously, you can do this slight of hand only once, perhaps twice if nobody is watching. But after that even the village idiot will understand that he has been had and puts the ‘homogenisation’ in the same dustbin of history as Lysenko’s ‘vernalisation’.

The other way is to tweak the model parameters to fit the observations (*e.g.* Hausfather *et al.*, 2019). Not surprisingly, given the many adjustable parameters and keeping in mind von Neuman’s quip, such hind-casting can make the models match the data quite well. Alacrity all around in the sycophantic main-stream press, with sometimes hilarious results. For instance, a correspondent for a Dutch national newspaper enthusiastically proclaimed that the models had predicted correctly the temperatures of the last 50 years. This truly would be a remarkable feat because the earliest software that can be considered a ‘climate model’ dates from the early 1980s. However, a more interesting question is: can we expect such a tweaked model to have predictive power, in particular regarding the future? The answer is a resounding ‘no’.

**Are climate models useless?**

Of course not. They can be very useful as tools for exploring those aspects of atmospheric physics and the climate system that are not understood, or even of which the existence is not yet known. What you can’t use them for is making predictions.

**References:**

Eyring V. *et al.* Nature Climate Change, **9**, 727 (2019)

Zelinka M. *et al. Geophysical Research Letters, 47*

**(2020)**

Karl T.R., Arguez A. *et al. * Science **348, **1469 (2015)

Hausfather Z., Drake H.F. e*t al.* Geophysical Letters, **46** (2019)

**Addendum: an analysis of probabilities**

First the case of 5 models of which at most 1 can possibly be right. What is the probability that none of the models are correct? All models are *a priori* equally valid. We know that 4 of the models are not correct, so we know at once that the probability of any model being incorrect is at least 0.8. The remaining model may or may not be correct and in the absence of any further information both possibilities could be equally likely. Thus, the expectation is that, as a matter of speaking, half a model (of 5) is correct, which means the *a priori* probability of any model being incorrect is 0.9. For N models it is 1.0-0.5/N. The probability that all models fail then becomes: F=(1-0.5/N)^N which is about 0.6 (for N > 3). This gives us odds of 3 to 2 that none of the models are correct and it is more likely that none of the models are correct than that one of them is. (If we had taken F=(1-1/N)^N the numbers are about 0.34 with odds of 1 to 2)

Now an altogether different question. Suppose one of the models does give us the correct experimental result, what is the *a posteriori* probability that this model is indeed correct, given the results of the other models? Or, alternatively, that the model is incorrect even when it gives the ‘right’ result (by coincidence)? This posterior probability can be calculated using Bayes’ theorem,

P(X|Y) = P(Y|X)*P(X)/P(Y),

where P(X|Y) stands for the probability of X given Y and P(X) and P(Y) are prior probabilities for X and Y. In this case, X stands for ‘the model is incorrect’ and Y for ‘the result is correct’, in abbreviated form M=false, R=true. So the theorem tells us:

P(M=false|R=true) = P(R=true|M=false) * P(M=false) / P(R=true)

On the right-hand side the first term denotes the false-positive rate of the models, the second term is the probability that the model is incorrect and the third is the average probability that the result predicted is accurate. Of these we already know P(M=false)=0.9 (for 5 models). In order to get a handle on the other two, the ‘priors’, consider this results table:

The ‘rate’ columns represent a number of possible ensembles of models differing in the badness of the incorrect models. The first lot still give relatively accurate results (incorrect models that often return the about correct result, but not always; pretty unrealistic). The last with seriously poor models which on occasion give correct results (by happy coincidence) and a number of cases in between. Obviously, if a model is correct there is no false-negative (TF) rate. The false-positive rate is given by P(R=true|M=false) = FT. The average true result expected is given by 0.1*TT + 0.9*FT = 0.82 for the first group, 0.55 for the second and so on.

With these priors Bayes’ Theorem gives these posterior probabilities that the model is incorrect even if the result is right: 0.87, 0.82 etc. Even for seriously poor models with only a 5% false positive rate (the 5^{th} set) the odds that a correct result was made by an incorrect model are still 1 to 2. Only if the false positive rate (of the incorrect models) drops dramatically (last column) can we conclude that a model that produces the experimental result is likely to be correct. This circumstance is purely due to the presence of the incorrect models in the ensemble. Such examples shows that in an ensemble with many invalid models the posterior likelihood of the correctness of a possibly correct model can be substantially diluted.

——–