Assessing climate model software quality

by Anonymous

[O]ur notion of software quality with respect to climate models is theoretically and conceptually vague. It is not clear to us what differentiates high from low quality software; nor is it clear which aspects of the models or modelling processes we might reliably look to make to that assessment. 

Assessing climate model software quality: a defect density analysis of three models

J. Pipitone and S. Easterbrook

Abstract. A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of defect reports and defect fixes in several versions of leading global climate models by collecting defect data from bug tracking systems and version control repository comments. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. We discuss the implications of our findings for the assessment of climate model software trustworthiness.

Citation: Assessing climate model software quality: a defect density analysis of three models, by J. Pipitone and S. Easterbrook, Geosci. Model Dev. Discuss., 5, 347-382, 2012,, doi:10.5194/gmdd-5-347-2012

JC comment: This journal is an online Discussion Journal (a format that I am a big fan of). To date, one Interactive Comment has been posted (oops just spotted another one here).  Both of these reviews are favorable.  I received the following review via email, from someone that wishes to remain anonymous (this review is not favorable).  This person intends to submit an online review to the Discussion Journal, and will peruse the comments here to help sharpen review before it is posted.  Note: Easterbrook has a blog post on the paper [here], with a good comment by Nick Barnes.

Review by Anonymous of Pipitone and Easterbrook’s paper 


The following issues in the paper by Pipitone and Easterbrook are addressed in this Comment:

  1. lack of uniqueness of Global Climate (Circulation) Models (GCMs) software development procedures and processes
  2. effects of the Advanced Strategic Computing Initiative (ASCI) project on the development of modern verification and validation methodologies
  3. lack of precision for simple defect counting as an indicator of software quality
  4. lack of consideration of the fitness for production applications for the software in the decision-support domain.

Scientific Software Development

Global Climate (Circulation) Models (GCMs) are not unique  in terms of  the large numbers of domains of deep expertise required, complexity of phenomena, large numbers of system response functions of interest, or in any other regard.

All scientific and engineering software associated with complex real-world applications require close attention by experts who are knowledgeable of the physical domain, and these experts must be deeply involved with the software-development processes. The earth’s climate, and other natural phenomena and processes, are not exceptions in this regard. Some engineered equipment systems and associated physical phenomena and processes are easily as inherently complex as the earth’s climate systems. Models and software for such systems also require deep knowledge of a variety of different phenomena and processes that are important to the modeling, and the coupling of these to the equipment and between equipment components. For the same reasons, the users of the software are also required to have deep experience and expertise in the domain of the physical problem,i.e. the application arena.

As mentioned in the paper by Pipitone and Easterbrook, the developers of the models, methods, and software for these complex systems use the development process to learn about both the physical domain and the software domain. Tight iterative interactions between physical-domain, mathematical-domain, and software-domain experts and the software are the rule. GCMs are not unique or exceptional in this regard.

The Problem of Software Quality in Scientific Software

Easterbrook and Johns (2009) have presented a review of a few of the techniques that are used by some GCM model-development laboratories during development of GCM software. All the techniques described by the authors are standard operating procedures used during construction of research versions of all scientific and engineering software prior to release of code for production applications.

The activities described by Easterbrook and Johns are sometimes called developmental assessments: they are utilized by developers during the development process. The techniques described by the authors, however, are not sufficient to determine that the models, methods, and software have been correctly constructed, or  that they are fit for application to production-grade applications. In particular, the developmental assessment activities described by the authors do not determine the order of convergence of the numerical methods and the fidelity of the calculated results to all system response functions of interest.

Stevenson (1999), and the first two references cited in Stevenson, Gustafson (1998) and Larzelere (1998), were among the first papers to consider the ramifications of the Advanced Strategic Computing Initiative (ASCI) of the Stockpile Stewardship Project relative to verification and validation of models, methods, and computer software. That project has an objective of replacing full-scale-tests experimentation with computing and smaller-scale experiments. The papers considered this objective to be among the more challenging ever under-taken for modeling complex physical phenomena and processes. These three papers rightly questioned the depth of verification and validation of the models, methods, and software that would be needed to ensure success of the Initiative. The authors of these papers were not optimistic that the immense verification and validation challenges presented  by ASCI could be successfully addressed.

The objectives were associated with software development within the National Laboratory system, for which such verification and validation procedures and processes were not formally in place. However, all the laboratories involved in the ASCI project readily accepted the challenges and  have made important contributions to development and successful application of modern verification and validation methodologies.

The developments in verification and validation initiated by Patrick Roache with significant additional developments by William Obefkampf and his colleagues at Sandia National Laboratory, and later joined by other National Laboratories, and with contributions by industrial and academic personnel and professional engineering societies, have answered all the concerns raised by those first papers by Stevenson and others.

Verification and Validation Methodologies Developed from ASCI

The modern methodologies for verification and validation of mathematical models, numerical methods, and associated computer software are far superior to counting defect density as an assessment of software quality. The books by Patrick Roache (1998, 2009) and Oberkampf and Roy (2010) have documented the evolution of the methodologies. Oberkampf and his colleagues at Sandia National Laboratory have additionally produced a large number of Laboratory Technical Reports, notably Oberkampf, Trucano, and Hirsch (2003). The methodologies have been successfully applied to a wide variety of scientific and engineering software. And they have been adapted by several scientific and engineering professional societies as requirements for publications in peer-reviewed journals. A Google, or Google Scholar, or, or peer-reviewed journal, search will lead to an enormous number of hits.

The book by Knupp and Salari (2002) on the Method of Manufactured Solutions (MMS), a methodology first introduced by Roache, demonstrates a powerful method for quantification of the fidelity of numerical methods to the associated theoretical performance, and for locating coding mistakes. Again, searches of the literature will lead to large numbers of useful reports and papers from a variety of application areas. The MMS is the gold standard for verification of numerical solution methods.

Counting Defects is Defective

The data on which the paper by Pipitone and Easterbrook is based have been gathered and presented by Pipitone (2010). That thesis, and the Discussion Paper itself, address the less-than-ideal characteristics of defect counting relative to determination of software quality. The book by Oberkampf and Roy (2010) devotes a single long paragraph to defect counting. The raw data given in the thesis show that very large numbers of defects, in absolute numbers, are present in the GCMs that were reviewed.

Defect counting does not lead to useful contributions in three of the most important attributes of software quality, as that phrase is used today: Verification and Validation (V&V) and uncertainty qualification (UQ). VIn modern software development activities, verification is a mathematical problem and validation is a physical-domain problem, including design of validation tests.

Defect counting would be more useful if the data were presented as a function of time after release of the software for production applications, so as to allow a measure steady improvement and close classification as to the type of defect. The counting also would be more useful if it is associated with only the post-release versions of the software, and none associated with the developmental assessment phase. And the number of different system response functions covered by the user community is also of interest: a very rough approach to model / code coverage. In general, different system-response functions will be a rough proxy for focus on different important parts of the mathematical models.

Defect counting would be much more useful if the type of defect was also accounted. There are four possible classes of defects as follows: (1) user error, (2) coding mistake, (3) model or method limitation, and (4) model or method deficiency. Of these, only the second counts coding mistakes. The first, user error, might be an indication that improvement in the code documentation is required, for the theoretical basis of the models and methods and/or the application procedures and/or understanding of the basic nature of calculated syatem responses. The third, model or method limitation, means that some degree of representation is present in the code but a user has identified a limitation that was not anticipated by the development team. The fourth means that a user has identified a new application and/or response function which the original development did not anticipate. These generally require significant planning for model, method and software modifications relative to correcting the deficiency.

Items (3) and (4) might need a little more clarification. A model or method limitation can be illustruated by the case of a turbulent flow for which the personnel who developed the original model specified the constants in the turbulence-closure model to be those that correspond to parallel shear flows, and a user has attempted to compare model results with experimental data for which re-circulation is imporant. Or a flow in helically-coiled square flow channels. Item (4) can also be illustrated by a turbulent flow. Consider a case in which the developers used a numerical solution method that is valid only for parabolic / marching physical situations and a user has encontered an elliptic-type flow.

The paper by Pipitone and Easterbrook  has  provided little information about the kinds of defects that were encountered.

Fitness for Production Applications

The paper by Pipitone and Easterbrook does not address the fitness-for-duty aspects of the GCMs; instead the paper focused on the model-as-learning-tool aspects. As noted in this Comment those aspects are common to all model, methods and software development whenever complexity is an important component – complexity in either or both the physical or software domains.

The objectives of model and software development are production of tools and application procedures for predictions having sufficient fidelity in the real-world arena. The foundation for sufficient fidelity is validation of these by comparisons with measured data from the application areas. All system-response functions are required to be tested by validation. The Review Paper by Easterbrook does not address any aspects of validation as this concept is defined in the reports and papers that have defined the concept.

Validation is required of all models, methods, software, application procedures, and users the rersults from which will form the basis of public-policy decisions. Validation must follow after verification. Generally, verification and validation for these tools and procedures are conducted by personnel independent of the team that develops the tools and procedures.

Counting defects, especially those that are encountered during developmental assessment, has nothing to offer in this regard.


The paper presents a very weak argument for the quality of GCM software. The widely accepted and successful modern verification and validation methodologies, which are used in a variety of scientific and engineering software projects, are not even mentioned in the paper. More importantly, the fitness of the GCMs for applications that affect public-policy decisions is also not mentioned. Simple defect counting cannot lead to information relative to validation and application to public-polcy decisions.


Easterbrook, Steve M. and Johns, Timothy C., Engineering the Software for Understanding Climate, Computing in Science & Engineering, Vol. 11, No. 6, pp. 65 – 74, 2009.

Gustafson, John, Computational Verifiability and Feasibility of the ASCI Program, IEEE Computational Science & Engineering, Vol. 5, No. 1, pp. 36-45, 1998.

Knupp, Patrick and Salari, Kambiz, Verification of Computer Codes in Computational Science and Engineering, Chapman and Hall/CRC, Florida 2002.

Larzelere II, A. R., Creating Simulation Capabilities, IEEE Computational Science & Engineering, Vol. 5, No. 1, pp. 27-35, 1998.

Oberkampf, William F. and Roy, Christopher J., Verification and Validation in Scientific Computing, Cambridge University Press, Cambridge, 2010.

Oberkampf, William F., Trucano, T. G., and Hirsch, C., Verification, Validation, and Predictive Capability in Computational Engineering and Physics , Sandia National Laboratories Report SAND 2003-3769, 2003.

Pipitone, Jon, Software quality in climate modeling, Masters of Science thesis Graduate Department of Computer Science, University of Toronto, 2010.

Roache, Patrick J., Verification and Validation in Computational Science and Engineering, Hermosa Publishers, Socorro, New Mexico, 1998.

Roache, Patrick J., Fundamentals of Verification and Validation, Hermosa Publishers, Socorro, New Mexico, 2009.

Roache, Patrick J., Code Verification by the Method of Manufactured Solutions, Journal of Fluids Engineering, Vol. 114, No. 1, pp. 4-10, 2002.

Stevenson, D. E., A critical look at quality in large-scale simulations, IEEE Computational Science & Engineering, Vol. 1, No. 3, pp. 53–63, 1999.

JC comment:  For background, here are previous Climate Etc posts on the topic of climate model V&V:

My personal take on this topic  is more in line with that presented by the anonymous reviewer than that  presented by  Pipitone and Easterbrook.

Moderation note:  This is a technical thread and comments will be moderated strictly for relevance.

290 responses to “Assessing climate model software quality

  1. Arcs_n_Sparks

    One of the important outcomes of the Stockpile Stewardship Program is the Quantification of Margins and Uncertainties (QMU) effort. Senior weapon program managers realize the limitatitions of theory, modeling and simulation (supercomputer notwithstanding), and must be able to convey such in the absence of full-scale system tests.

    Fortunately (and the climate modeling community does not have this advantage), there have been past full-up system tests, so assessments can be made of how models are comparing against past data, and if designs (and associated models) are moving significantly away from the experience base.

  2. Bernie Schreiver

    The widely accepted and successful modern verification and validation methodologies, which are used in a variety of scientific and engineering software projects, are not even mentioned in the paper.

    A significant weakness of Anonymous’ conclusions resides in the vague meaning of the phrases “modern verification and validation methodologies” and “used in a variety of scientific and engineering software projects”.

    The noun “variety” is of course nearly meaningless, and as for the adjective “modern”, it might more specifically have been replaced by (for example) Jet Propulsion Laboratory standards document Design, Verification/Validation and Operations Principles for Flight Systems (DMIE-43913, 2006).

    What fraction of scientific works, in any discipline, adhere to the standards of DMIE-43913? My guess of be, of order 00.1%. And yet it is fair to say too, that many (most?) practicing scientists and engineers are intimately familiar with DMIE-43913 and similar documents, and they *do* take care to incorporate these recommended verification and validation practices in their code-writing … and this accounts for the remarkably low density of software defects that Pipitone and Easterbrook report.

    In short, Anonymous’ critique is sufficiently vague as to contribute very little to the discussion of climate change modeling.

    • Pipitone and Easterbrook state: “In other words, modellers have ‘learned to live with a lower standard’ of code and development processes simply
      because they are good enough to produce legitimate scientific results.”

      That does not indicate that the modellers “take care to incorporate [DMIE-43913] recommended verification and validation
      practices in their code-writing.” It indicates the opposite.

      • The models have forecast warmer, warmer, warmer, for 15 years while the temperature has bounced around without warming. How is that good enough legitimate scientific results?
        Valid code based on flawed theory is still not forecasting what is really happening. The basic Consensus Climate Theory is flawed. They need to fix that first. Consensus Science is not science; they must become skeptical of their Theory and become Scientists again. They must recognize the uncertainty and question everything they believe and seek the right answers, even if they are not their own.

      • Latimer Alder

        ‘modellers have ‘learned to live with a lower standard’ of code and development processes simply because they are good enough to produce legitimate scientific results’

        Would I be even more than normally cyncical if I surmised that ‘legitimate scientific results’ in this contex should be interpreted as:

        ‘good enough to get me a paper pal-reviewed and published and so secure my next grant for another few years’

        rather than anything to do with accurate predictions of the climate?

        Because once the intimate connection between prediction and observation is broken (and in climate modelling this was severed a very long time ago), then there is nothing at all to anchor the once proud and trusty ship ‘Legitimate Result’ as it floats away across the Sea of Fantasy to the Land of Wishful Thinking.

        sorry – I am a

      • Steve Milesworthy

        Provide the full quote please. It was a quote from within a hypothesis put forward by the authors as to why defect counting *may* miss defects. It was not a statement that defects were ignored.

        The point was that “emphasizing scientific correctness may lead modellers to downplay or disregard defects which do not cause errors in correctness (e.g. defects in usability or “good” coding practices)”

      • Subpar usability or bad coding practices aren’t bugs, bugs WILL cause bad output. A project to improve usability or ease maintenance of the code would be an enhancement. A project to fix what truly are bugs would be, well, bug fixes.

      • Bugs MAY provide bad output, there are any number of trivial bugs which do not.

      • Your explanation still describes behavior incompatible with DMIE-43913. One of the principles delineated there is: “Written defect tracking shall begin at the earliest practical time.” If that principle were followed in developing the climate models, the authors would have had no reason to attempt counting defects since the defects would already be tracked.

    • “they *do* take care to incorporate these recommended verification and validation practices in their code-writing”

      354-30 “scientists typically run tests only to assess their theories and not their software”

      That paragraph explains how they generally do not even write tests for the code, so in fact they *don’t* incorporate anything resembling verification and validation, never mind more modern versions thereof.

  3. What’s the point of an open review system with anonymous reviewers?.

    • Nick,

      Are you kidding? I would think it’s a positive.

      That said, @ Judith – I don’t know what the format of these reviews are, but Anonymous’ review reads more like a journal “comment” or rebuttal. It doesn’t really read like a review, and I think it would be more effective to restructure it to follow the format of a typical journal review “Here, they said this, but…..”

      I don’t know enough about code to comment on the actual analysis. All my codes are full of bugs.

    • The reason I think it’s good is then you can’t play politics. It would be even better as double blind.

    • res ipsa loquitur

  4. Are the standards for defect detection and reporting between Climate models and comparison software systems identical? I seriously doubt it.
    I would expect the defect count to increase with increased usage, with well established criteria for identifying defects, and with rigorous reporting of the defects. With users largely consisting of the developers themselves, I doubt the statistics have much meaning. But then I haven’t read the full paper.

    • D Johnson | April 15, 2012 at 10:42 pm

      The same thoughts occurred to me also, and apparently also to the authors.

      It would be troubling were someone to ever claim, “the software has a low rate of complaint of defects, so the theory it represents is correct.”

      On the converse of such a claim, everything said in WordPress would be automatically invalidated.

  5. Norm Kalmanovitch

    The GCM climate models are fine; the problem is that these models are only capable of outputting short term predictions based on input initial conditions.
    The projection of global temperature 50 and 100 years into the future is based entirely on the projection of CO2 so the failure of the models is a combination of the false projection of CO2 increaases coupled with the fabricated CO2 forcing parameter used to ascribe increased forcing from increased CO2 in a manner that has no scientific validity.
    The models are based on projected increases in CO2 of over 5ppmv/year but CO2 is only increasing at 2ppmv/year and this rate has remained nearly constant for the last 15 years so it is hardly likely to suddenly change by two and a half times!
    Don’t blame the models; blame those who fraudulently used sophistocated models to give credence and justification to their self serving and ludicrous conjecture that when the CO2 that we breathe out comes from fossil fuels it will cause catastrophic global warming!

  6. A couple points on defect counting.

    Typically, defect counting shows one thing really well. Namely, the degree/scope of testing undertaken on a given project. The defect discovery curve over time will typically express as a sine. It should achieve a smaller peak initially. The error composition at this point should primarily be catastrophic. (Obviously if it does not peak rather quickly, it’s time to push the release.) The first peak then tails to a trough as more time is required to more deeply exercise test plans. The next up tick is typically higher than the first as more errors of a transactional nature are revealed by (hopefully) detailed and complete test plans.

    From experience I’ll posit that the comprehensive nature of the test plan in so far as it’s abiltiy to exercise all possible paths through the application (or release as the case may be) is more important than defect counting or analysis. Stated differently, the test plan has to designed to exercise the code base. Focusing overmuch on defects is a bit of the tail wagging the dog. In extreme cases it is necessary. However that’s not the sort of extreme you want to have to report out to executive management.

    One thing concerened me in the post above. Namely, Anonymous does not seem to address what I would see as a testing problem. If I read correctly, it was mentioned that at least one of the models showed a much lower than average defect count for a project of it’s size and complexity. At first glance, that might seem wonderful. In actuality what it usually indicates is that testing (either the plans, execution or the entire package) has been insufficient. You really really want to find bugs in testing because that is a much easier fix then rolling patches out to production.

    I’ve been asked to audit a few releases over the years (in telephony engineering) and the first place I go is to system integration test error counts. Second place is unit test error count. Third is regression test.

    • Nice perspective.

      Perhaps the methods of assesment of climate model software quality could be upscaled to provide an assesment of climate change science quality

    • Steve Milesworthy

      Typically, code in climate and forecasting models has a relatively long life (years to decades) where just parts of it evolve at any given release. So the amount of production testing in a variety of scenarios can be substantial and happens over many years. Probably only a few different scenarios need be run for a few days each to exercise every line of code, yet scenarios are run for years and years.

      Coupled with a decent set of regression tests (to look out for unexpected side-effects of improvements) and ongoing scientific validation of the outputs from various scientific tests, as well as input from wider use of collaborators from other climate institutions or academia (to pick up platform related issues for example), one can see how a normally expected level of defects can gradually be ironed out.

      Integration testing is always going to involve a lot of scientific judgement. At this point one has a working model that looks somewhat like the earth, and the question is more whether it really is like the earth and less whether it contains any defects.

  7. I’ve been developing software for over 20 years, and have had a long interest in testing and verification. I read this with great interest and more than a little disbeleif.

    The basic methodology of this study comes down to
    1) estimate the bugs during a six month span of a project (by searching for keywords in commit messages no less), and
    2) divide by the total lines of code in the entire project.

    They did have access to bug tracking systems, which list bugs individually, rates severity, dates them, list known open bugs etc. I have no idea why they didn’t use that instead of commit message scanning, which is almost meaningless (one commit does not equal one bug, even it if says bug).

    They didn’t differentiate between new code and old code. Clearly if you are adding new features your bug rate will be highest. As bugs are discovered and fixed, the code becomes more stable. So counting less-total-fixes as meaning better code is pretty much upside down.

    They didn’t account for how many lines of code were added in the timespan measured, just the total of the project. That is a totally flawed and unnecessary proxy (they can easily count the lines added/edited in the commit). I worked on Eclipse 5 years ago, it was millions of lines of code back then. New code has more bugs, more code has more bugs, more effort results in more code and more fixes. As it is, it’s like estimating defect rates by taking the assembly line glitches fixed per month and dividing by the absolute size of the factory. New product lines score low and the workforce on strike scores high.

    This all still assumes that bugs are being found and fixed in the first place. The initial review doesn’t give a lot of confidence there:

    352-1 “software engineering notions of quality do not apply to software constructed as part of a scientific effort”. Must be weird to work in that no-gravity zone.

    354-3 “even with a bound on acknowledged error it is impossible to detect unacknowledged errors that fall within those bounds” (the skip-small-bugs-because-they-are-hard-to-find methodology)

    354-30 “scientists typically run tests only to assess their theories and not their software” (testing is done to find bugs, without it, you will certainly have fewer fixes – not a good thing)

    358-9 “a way to compare it with previous model runs” etc (all three verification systems listed compare against previous models, but we are only interested in how they were initially validated, which is not explained).

    367-15 “One climate modeller we interviewed explained that the climate is a ‘heterogeneous system with many ways of moving energy around from system to system’ which makes the theoretical system being modelled ‘tolerant to the inclusion of bugs.’ The combination of both factors means that code defects are either made obvious (and so immediately fixed) or made irrelevant by the nature of climate models themselves and therefore never reported as defects.” (just a big wow on that one, and that from a report trying to show how good the quality is in there. Actually I can’t believe I read that.)

    In general, the heavy use of parameterization and lack of known target/tests makes tweaking to assumptions almost inevitable. It is easy to imagine a model that outputs what is expected (or less generously, desired) is considered ‘correct’. Given the lack of rigor as above, that is impossible to detect.

    There is also a question of the data being input – errors there will certainly propogate, and looking at things like the HARRY_READ_ME file makes one think there may be at least.. umm.. one or two stray errors left there.

    Finally, when people assess for quality in a new and radical way, it is very important that the assessment can be fully assessed. In that light:

    359-14 “In the interests of privacy, the modelling centres remain anonymous in this report.”

    This makes the whole thing little more that a wild claim. The number one way to assess code is to look at it. With that, anyone reasonably trained could give a pretty good estimate on quality in an hour or so, giving something to calibrate with. Look at Eclipse, look and the climate software, make a call. Instead they opt to make a human-free computer model based on flawed assumptions, parameterize it with a count of ‘bug’ like words written over 6 months, and divide by meaningless project-lines-of-code estimations. And then only reveal the results – not even a sample code snippet. Naturally it isn’t possible to reproduce or even examine their findings, so once again we are left to the mercy of the hidden models and their mystic caretakers.

    • peterdavies252

      Good post Robin and from showing us how you have assessed the results of this study in terms of some of the verification and validation standards that you use in your work is quite an eye brow raiser to me.

      The models subjected to the study appear to be beyond redemption and any underlying theory would have zero credence. As Chief has already said, such models diverge from reality as soon as they are used.

      This subject is just another example of the junk that underpins much of the mainstream climate science and the alarmist prognostications that issue from it.

    • robin | April 16, 2012 at 12:27 am |

      Very much agreed; what the authors purport to have developed is something of a holy grail of software development, indeed a parapsychic extension of the dream of a uniform standard for comparison of dissimilar software across all conditions that also reads the minds of the users and developers to (it is implied) prove which developers really know how the universe works.

      It’s tempting to think that all software with low metrics in bug tracking are well-developed software. This is patently false; Facebook is a buzzing hive of bugginess that has satisfied the low demands of perfection of its clients so its many technical failings are often overlooked.

      It’s tempting to link high quality in delivery of code (were the metric shown to be valid for this use, which remains in doubt) reflects some greater perfection of cognitive processes on the part of the developers or the design specification. This is patently false. A rare few hackers write beautifully elegant and robust hostile code that performs exactly to spec, but I wouldn’t exactly call a teenager living in his mother’s basement the epitome of scientific advancement.

      That said, I have a lot of regard for the diligence and the ambition of this research. Taken to a practical and moderate level (steering away from faulty implied rationalization), with appropriate caution, and developed with sufficient care to confirm results, maybe at least a little better picture of the world can be gotten through such a lens.

    • Latimer Alder

      ‘software engineering notions of quality do not apply to software constructed as part of a scientific effort’

      Let me finish the sentence:

      ‘and therefore we should not believe that the results are any better than garbage’

      If we are to take the output of climate models seriously (and ther inability to accurately predict anything useful is making this an increasingly unlikely event), then they need to be subjected to the most rigorous scrutiny by the best software guys in the world. And not just a bunch of acdemic amateurs self-asserting that they are pretty good and must be allowed to play by their own rules.

      This attempt has damaged the credibility of climate modellers yet further, I doubt that I’d want to employ one if they ever escaped into the real IT job market.

    • Robin R.E. 358-9, does this mean that if a bug in an early model gave a result that genereated a particular feature; all subsequent programs would be chosen to contain the same bug, because they are tested against this initial version?

      • Yes they would, good point. They list four techniques – there may be more, but all four test against either previous models or expected output.

        – visual inspection of graphical output – this either confirms expectations (eg we think clouds cause warming overall, and the model now shows this, which validates nothing), or is compared to previous output which propagates bugs as you say.

        – bit level comparisons of output – this will propagate bugs and assumptions.

        – comparing to the output of other models – the bugs and assumptions are now cross pollinating. This is a bit of a surprise. I’ve often read that ‘models are in agreement on X’, but if they are compared to each other and unsynced output is considered a bug to be fixed, then ‘agreement’ doesn’t mean much.

        – tweaking input parameters to see how it reacts. This is the only one that would potentially catch new bugs, however it seems to be focused on guiding the results (“358-16 This is done so as to compare the model’s response to different parameterizations, implementations, or to quantify output probabilities.”)

        If they were serious about improving their models they would spend a lot more time creating quality simulation data to test against (or any time at all, as it seems that isn’t done). Eg, when measuring SST you could have a data generator that creates random datasets algorithmically (thus can be measured to any desired resolution). They would still need to map accurately to known measurements over time (yes, this would be hard, as one would expect). This allows calibrating over known results, and leads to a better understanding of the types of variation that may be happening over distance and time. The data model can be continually refined as new data is measured (vertical temperature probes, horizontal changes at different depths, new measurements of currents etc), but needs to be based 100% on measured data and random interpolations, no theories on that side.

        This is still far from perfect, but at the moment things seem to be calibrated against expectations. Hard to imagine discovering anything new this way, never mind finding problems. The hardest bugs to find are the ones who’s output you already think are accurate.

      • Not to be too nitpicking, but Robin’s quote “visual inspection of graphical output – this either confirms expectations (eg we think clouds cause warming overall, and the model now shows this, which validates nothing)” overlooks points about SW and scientific modeling.

        First, matching model performance against expected performance is important to confirm the SW, as built, matches specifications. It is a step in verifying SW quality.

        Second, it is a mistake in modeling is to expect computer models to reveal new knowledge. They may, but it is likely they will reveal only our scientific expectations. When these expectations match nature, we think we have learned something. When they don’t we think we have an opportunity to learn something new (since we’re not matching nature).

        Third, one of the large unrecognized part of the science behind GCM is an active effort to experimentally confirm (or refute) the built in assumptions within the models. It would be nice to see even a statement of these assumptions.

        To discuss SW quality in the absence of these hidden assumptions is to put the cart before the horse.

      • “Second, it is a mistake in modeling is to expect computer models to reveal new knowledge”

        That is the most insane thing I have read in a while.
        This is precisly what models in science based fields do all the time. Indeed, the major difference between a model and a fit is that models are predictive in multidimensions. They show you things that were not obvious, then you design expirements to examine the area where the models indicate something is happening.

      • Philip, that is a fair point regarding the visual output, I guess it depends on how and where it’s done. Certainly if your software graphs equations you can have confidence in the expected output. If it tracks the dissipation of wind energy it is harder, but at least you can be sure all energy is accounted for etc. If it models behaviours you only have a theoretical understanding of, then the error bars in your code will always go up to and beyond the edges of your theory without more rigorous measures. Given that the visualizations are the final output, I think that suggests it is based on the results they expect rather than any form of algorithm checking. I agree that could have some use though, but it really depends on the case.

        (full quote: 358-6 validation notes: standardized visualisations of model outputs for visually assessing the scientific integrity *of the run* or
        as a way to *compare it with previous model runs*.)

        For your second point, the ‘new knowledge’ to me is the prediction, or giving weight to a theory in the model. I think we agree here and it is just semantics, certainly it would be rare to get out more out of a model than you put in as far as ‘proofs’. For detecting errors in code though, relying on the result ‘meeting expectations of a theory’ is counter productive – you are just as likely to introduce errors as fix them if aiming for a guess.

        Totally agree on the need for a statement of assumptions. Much talk on quality has been about how well a program satisfies a spec. Looking at their process and comments, it is certain they are working mostly without tests, and in all likelihood, without even a spec. That is a minimum requirement if software quality is a goal, and a minimum for even measuring quality as attempted here.

      • Philip Lee
        “Third, one of the large unrecognized part of the science behind GCM is an active effort to experimentally confirm (or refute) the built in assumptions within the models. ”
        Good observation.
        Cause/Effect: A key issue highlighted by Roy Spencer is cause vs effect – which comes first – the warming or the CO2?
        Does warming reduce clouds or do reduced clouds cause warming?
        The IPCC and GWMs assume CO2 then warming, and warming reduces clouds. However this could be the other way around and is difficult to test.

        One key diagnostic is David Stockwell solar accumulation theory finds ocean temperatures lag by Pi/2 (90 deg) from solar cycle variations.

    • Robin
      Re: “367-15 “One climate modeller we interviewed explained that the climate is a ‘heterogeneous system with many ways of moving energy around from system to system’ which makes the theoretical system being modelled ‘tolerant to the inclusion of bugs.’ “

      That makes me wonder if the programs have been validated under conservation of energy and conservation of mass.

      • Yeah, good question, that would be low hanging fruit for tests. Hey, maybe that was what they were talking about with the ‘travesty’ of not being able to account for the missing heat – maybe it was just a unit test they couldn’t fix ; )

      • “‘heterogeneous system with many ways of moving energy around from system to system’ which makes ththeoretical system being modelled ‘tolerant to the inclusion of bugs.’ ”


      • Steve Milesworthy

        Of course they have been validated against conservation laws. Do you think a scientist would not be worried by that!? They don’t run the model and think “Well it hasn’t fallen over so its tolerance to bugs is acceptable”!

        That said, though, the point of many simulations is to evaluate the impact of perturbations. ie. one runs the model and it gives you a stable climate that looks similar to recent climate. Then one reruns the model with some additional perturbation (extra CO2, clouds of SO2, Saharan dust storms etc.) and one sees what is different. Now if the model is sufficiently “tolerant” of your bugs then the bugs have no different effect when you add the perturbation, so the before and after comparison remains valid. On the other hand, if the bugs behave differently your before and after comparison is invalid.

        While the latter scenario is plausible, the model testing process should cover a wide scope. Some bugs will thereby be minimised. Alternatively, such analysis will highlight areas of sensitivity of the model.

      • steve: Of course they have been validated against conservation laws. Do you think a scientist would not be worried by that!?

        Yes, I think scientists would not be worried by that. Read the paper, especially the summary of the previous paper by Hatton. Scientist-programmers have a tendency to worry about very little when it comes to validating their software. I saw (and did) the same sort of thing when I was a practicing actuary. Very smart developer-users act in very dumb ways because they trust too much in being smart.

    • Good post Robin.
      Having spent most of the last 20 years programming and developing business database software (much of it in the transport tracking and planning arena), I have to agree with what you are saying.

      I suppose, having read parts of the HARRY_READ_ME file I should have been preconditioned, but I still find myself shaking my head in amazement.

      I especially liked 367-15 “One climate modeller we interviewed explained that the climate is a ‘heterogeneous system with many ways of moving energy around from system to system’ which makes the theoretical system being modelled ‘tolerant to the inclusion of bugs.’”

      The problem with that statement is that it effectively means the models cannot be falsified. That leads to the situation where, if you can’t say when something is wrong, there is no way of actually knowing if they are right, which makes them of little more value than a random number generator.

  8. Steven Mosher

    Defect counting is meaningless without some idea of the code coverage involved int the testing. I test 100% of my code making sure that every line, every branch is covered. I get 5 defects. you test 10% of your code and report 3 defects. meaningless without the code coverage information

    • I agree. A few additional comments on coverage and discovery rates.

      Code complexity.
      What could be added to this is that code coverage also needs consideration of code complexity. I may have 80% coverage but the 20% uncovered may be more complex code I did not have time to write proper tests for, even though they probably need more tests. Complex code of 5 lines may require 100s of lines to properly test.

      Probability of defect discovery.
      Often time constraints make it difficult to justify 100% coverage. In this case trivial code and the manner of defect discovery can play a role. Code that would clearly break in production with an exception thrown is different from code that may hide defects such as a complex algorithm where the numbers may differ in subtle ways or introduce rounding errors.

      Limitation of 100% coverage.
      I may for example execute every line of code (100% coverage), but in a single line a mathematical formula may behave differently on boundary numbers such as a minimum, maximum, even or odd input values. So a single line of code may need various test input values.

      Defects in test code.
      Defects in tests can also come into play. Are they logged to the same tracking system, to the same project, or are they included in a specific item.

      Code reviews.
      Code reviews also affect defect discovery rates. An informal review where code is changed as the review progresses, may lead to no defect being logged to the bug tracking system. Independent code reviews by “experts” may carry more weight than reviews by peers. Are tests reviewed as well?

      • Time.
        Do you end up spending all your time making and fixing the tests. Do the tests have tests? There’s a fine line between perfection and being non-productive.

      • Bob7, You mustn’t be too involved with serious software. Unless you are writing what I call throw away code, decades of code testing has proven these basic procedures.

  9. NASA has an Independent Verification and Validation Program (IV&V)

    Located in the heart of West Virginia’s emerging technology sector, the NASA IV&V Program was established in 1993 as part of an Agency-wide strategy to provide the highest achievable levels of safety and cost-effectiveness for mission critical software. . . .

    NASA Independent Verification & Validation Program Value Report 2008 & 2009

    Producing quality goods and services; doing the right thing; performing second to none; practicing continuous improvement; being distinctive, creative, and committed; leading in best practices; being efficient.
    Doing what was said would be done; having trust; being honest, fair, and accountable, both personally and organizationally; having steadfast ethical conduct; living by high standards of individual behavior.. . .
    Verification and validation (V&V) objectively assess the correctness and quality of software and should occur in phase with the development lifecycle of the project. Validation assesses that the correct system is being built. Verification assures that the correct system is built correctly. Both processes are guided by the following three questions, which are the foundation of the IV&V
    Program’s assessments throughout the development lifecycle:
    1. Will the system software do what it is supposed to do?
    2. Will the system software do what it is not supposed to do?
    3. Will the system software respond as expected under adverse conditions? . . .
    “Verification and validation activities produce their best results when performed by a V&V agent who operates independently of the development project or specification agent,” according to Barry Boehm, a frequently cited software engineering expert (Boehm, 1984, p. 76). To be considered independent, the Institute of Electrical and Electronics Engineers (IEEE) states that a program must be technically, managerially, and financially independent from the projects they serve (IEEE, 2004). . . .

    Because software failures have meant loss of dollars, loss of missions, and in some instances loss of life, developing strategies to mitigate these problems are vital to the software development community.

    The IV&V program has been applied to NASA’s climate satellites. e.g., Glory and National Polar‐Orbiting Operational Environmental Satellite System (NPOESS), the Orbiting Carbon Observatory (OCO) (which failed), Global Precipitation Measurement (GPM).
    However, has NASA IV&V methodology every been applied to any GCM software? Note particularly the requirement for an INDEPENDENT evaluation. If “climate change” is a global critical issue, should not it be evaluated with similar effort? (and not another a repetition of the HarryReadMe.file)

    Note NASA posted a: Tutorial on CFD Verification and Validation

    National Program for Applications-Oriented Research in CFD (NPARC Alliance).

    The NPARC Alliance has assembled and maintains a library of validation data sets, check cases, and experimental data in order to provide a validation audit trail for NPARC. The alliance may also design and perform code validation experiments when inadequate data exists.

    In a similar vein, have international standards for evaluating full uncertainty propagation ever been incorporated – especially systemic/bias uncertainty? e.g., see:
    MG Cox and PM Harris, Software Support for Metrology, Best Practice Guide No. 6, Uncertainty Evaluation, NPL Report DEM-ES-011, September 2006.

    • peterdavies252

      Steven Mosher has already written at length the importance of independence of the reviewer in an effective verification and validation process. The climate models under discussion show no evidence of this actually occurring.

      • Thanks Peter. Any links to where?
        (nothing shows searching for “mosher independence”)

      • peterdavies252

        Judith had put up links to the previous threads on this general topic. Have a look at them but in answer to your question go to and use “find” function for Steven Mosher.

      • Steven Mosher

        its pretty simple. V&V is a QA/QC process. Under most TQM or best practices the QA/QC function has an independent chain of command to the executive level. To insure that independence outside firms can be employed to either conduct the V&V. If you dont do outside V&V you can do it inside.

        In a company here is how it happens.
        I build a camera. I send it to my QA department. They do a standard drop test. It breaks and they halt production. I scream. They dont listen.They have the power of No.
        Then a meeting is called:
        sales, marketing, engineering, QA, support and legal. execs.

        The spec said: shall pass drop test. It failed.

        I make a case that the specification is bad. No camera needs to drop from 3 feet and survive. QA just states the facts. Legal has their say. marketing sales. everybody. A document is signed by everyone and the decision to
        go/no go is now traceable.

        Nobody who supports IV&V would argue that failing IV&V would mean the model cannot be used. The point is rather to document the spec. the test against the spec and failures/spec changes. My camera was good. It just didnt survive a 3 foot drop on concrete. 2 feet.. everything is cool.
        yes, there is subjectivity in this process.

        Some people argue that specs cant be written for GCMs. They are wrong.
        Specs can be written. They can be tested against. And failures can be addressed by changing the spec. or you can write a very loose spec if you have no clue when you start.

        All fights are over the spec.

      • Mosher said “Some people argue that specs cant be written for GCMs”

        Which leads to the added problem not much talked about – design requirements capture errors. But from what I gather, the design requirements themselves aren’t well documented, much less the requirements capture process..

    • Steve Milesworthy

      Steve Easterbrook was a key researcher at this facility:

    • hagendl: “Validation assesses that the correct system is being built. Verification assures that the correct system is built correctly.”

      I.e.: Do the right thing in the right way.

  10. Designing, building and verifying good software is very hard. In many years of working with scientists, I have found it very uncommon for a scientist to be a skilled software engineer — though I have found many who thought they were. Just like an electrical engineer or a civil engineer, software engineers need to speng many years aquiring their skills.

    Defect counts are fairly useless as you can only count the defects you have found and probably have no idea how many others exist.

    In practice, it is very hard to produce 100 lines of code with close to zero defects. It is near impossible to produce 1000 lines of such code. If you want to build high quality software, it is usually best to use a building-block technique —
    – build small self-contained modules that can be easily verified.
    – combine small numbers of reliable modules to make bigger modules, verifying that these bigger modules are correct.
    – repeat as required, combining more modules until the whole application is built.
    – for every module, from the smallest to the biggest, before writing any code you should write a short spec describing exactly what the module does, including any i/o and error handling. Develop tests that verify every line of the spec.
    – code reviews are your friend.

    The above contains a few basic tips, but is not anything like a comprehensive guide.

    • emmenjay | April 16, 2012 at 2:08 am |

      It’s not hard to produce even tens of thousands of lines of code with zero defects.

      Just don’t expose the code to a compiler.

      With the standard practice of code re-use, very high numbers of lines of code can be produced with little defect, too; provided the compiled code isn’t exposed to users.

      More to the point, defect counts reveal little about the design; looking at the particulars of the cases in question, where it appears the design itself was re-used and the users compliant to the needs of that design (a very unusual situation), it’s particularly meaningless to compare to the general case.

      • Bart R. Just exactly how do you specify that lines of code be written so that they can be reused with 0 defects in the new application??? Which, of course, is a rhetorical question.

        Goodness, most code cannot be proven to match the code design, much less be error free without testing.

      • harold | April 16, 2012 at 8:59 pm |

        Apparently most irony cannot be proven to match the code design, too.

        Else “provided the compiled code isn’t exposed to users” would have left more of an impression.

        Given the various tools of software design & development environments to map the development directly to the design, it’s been proven that any code team could match requirement to role to use case to design to validation to environment to code to system test to user acceptance test to deployment to production use and trace every step and defect throughout seamlessly. This is how some professional development teams go about avoiding failure and proving their systems.

        While it’s true most legacy and ad hoc code, and patches, and the product of some types of scope-creeping redacted-design development _don’t_ take this approach, and therefore cannot be proven to match design that doesn’t really exist, that’s a different shading of the question.

        However, in some of the world’s largest ‘zero defect’* systems, code reuse is mandatory where possible; development not strictly based on reuse of proven code escalates to a whole different level of v&v. (*Zero defects being the aspiration, often linked to job security. Amazing how much defect count drops on those terms.)

    • emmenjay,
      I would like to second your comments. Quality of code is not a function of errors detected and fixed but of style. While fewer errors is better than more, errors caught and fixed through detailed testing are likely to have a higher count than poorly tested code. Coding style, including modularity, readability, commenting, runtime error detection, boundry checks, and maintainability are far better measures of code quality. Simply counting bugs mentioned in code updates tells little about the code itself. Calling that verification or validation is far outside what large system programming experience would indicate as valid. Validation is testing against the real world, not against theoretical expectations.

  11. Latimer Alder

    How often does a climate model actually get executed? Not often by comparison with mission critical software created by IBM or Microsoft or Google or Oracle where it will be millions of times per second.

    And if the climate model fails, not much just start over and pretend it never happened. If code from the guys above fail, the internet degrades, ATMs don’t work, your paycheque doesn’t get through, airports are closed, electricity grids shut down, Neil Armstrong has to land the LEM manually and on and on and on. We rely on big grown up software for so much of our lives that commercial software companies have spent fifty years trying to understand as much as they can about code quality and getting the b….y stuff to work reliably.

    So I’d be much more impressed if the analysis of model ‘quality’ came from experienced guys in development labs from those companies, rather than from academics who seem to know little – and care less – of the processes and practices in the commercial world.

    If the climate models are to be taken seriously (and that is a very very big ‘if’) they shoudl be measured against the best quality standards in the world. Not just by one bunch of self-taught PhD students comparing themesleves with another and unsuprisingly concluding to their own satisfaction that their ‘Team’ is better than the ones down the hall.

    • Steve Milesworthy

      Well has run 129,754,124 years of a variety of models as screen-savers, and the IPCC scenarios involve models being repeatedly run for 100 model years (assuming a half hour timestep that means the code is being stepped through 1.75 million times per scenario). Plus the models used for NWP will also be run in global and regional configurations at high resolution several times per day year in, year out.

      A failure of the model is not so important though. *Reproducibility* or otherwise of the failure is what is important. If you cannot reproduce your failure then you have little chance of debugging it. Reproducibility of model results down to the bit-level is a key requirement.

  12. One obvious weakness of the P&E paper is that it, itself, fails V&V.
    Nobody can reproduce their results, because they do not identify the codes they are testing!
    (“In the interests of privacy, the modelling centres remain anonymous in this report. We use the identifiers C1, C2, and C3 to refer to the three climate models we studied.”)
    Can you imagine a medical journal reporting the results of the anonymous drugs C1, C2 and C3?
    The paper fails the fundamental requirement of reproducibility.

    • I’ve been seeing more this though. Drug preliminary results are reported at conferences with code names and only by the time the new molecule is reaching approval for trials/commercialization are its name(s) made public. One of my colleagues’ papers was stuck in peer-review for over a year, the exact period during which a new classification method for a tumor was propounded and ‘established’. And now he cannot reproduce the system, and his paper won’t get through because it doesn’t play well with the new system. The original authors sent pictures of their positive and negative controls – and some of them look photoshopped (!). In other words, a bit more transparency at the initial stages would have sorted out the matter before the stage of publicization of the new scheme (ie., akin to drug trial result publication).

      • I have virgin copies of all my photographs as JPEG2000 files; containing the time setting and optical setup of the images. I use very long as they contain a description of what the image is.
        In the near furute these will have to be uploaded to journals along with the images for background which have had the background removed and LUT settings altered.

  13. I would have thought the main strength with GCM software, unlike most software out there, is that there are several different GCM implementations.

    If there was only one GCM in the world then we might wonder if high climate sensitivity was result of a few stray software bugs. But with many GCMs, and even simpler models showing high climate sensitivity, we can be assured that result is not due software bugs.

    I think comparing results from different independent implementations is far more useful in both tracking down bugs in the software and in the physics than any other method.

    Most software developers don’t have the luxury of being able to do that because due to time or resource constraints they only can afford one implementation. With nothing to compare it to they have far more need to be paranoid.

    • Latimer Alder

      ‘I think comparing results from different independent implementations is far more useful in both tracking down bugs in the software and in the physics than any other method’

      And out here in the real world, we think that comparing results from the models with what the climate actually does would be the most powerful method of all!

      That ‘climate modellers’ as a group seem to have a visceral aversion to any such heretical thoughts is a big big minus point to their credibility. Even astrologers in the daily papers get their predictions tested thousands of times a day.

      • “comparing results from the models with what the climate actually does would be the most powerful method of all”

        They do that too. The point is not to overlook the strength of climate models: there are many of them. The spread out output from various models tells you a lot more than you could gather from a single model.

      • Latimer Alder


        ‘The spread out output from various models tells you a lot more than you could gather from a single model’

        I’ve seen this claimed before. Please give some concrete examples to illustrate your point which is not obvious – at least not to me.

      • Arcs_n_Sparks

        Actually, telling the time from many clocks averaged is better than any single clock. This is pretty well established.

      • Latimer Alder

        @arcs n sparks

        Please present a concrete example related to climate models so that I can understand your point.

      • Latimer Alder

        But how can they actually do that comparison with real world stuff? No climate model is capable of being judged over anything less than a fifty year period (climatology 101). And they haven’t been around for 50 years yet.

        Because the most powerful way to convince even sceptics liek me that climate modelling really does have some merit would be to establish a consistet track record of testable predictions that are found to be true.
        Yet whenever this is suggested the meme that we mustn’t confuse weather and climate is trotted out and that short term fluctuations can mask the underlying trends.

        Even climatologits cannot have their cake and eat it. Either they are testable in the short term or they ain’t. If the former, publish the prdictions and show how good they are. If the latter stop pretending that you do something else.

    • Latimer Alder

      ‘Most software developers don’t have the luxury of being able to do that because due to time or resource constraints they only can afford one implementation. With nothing to compare it to they have far more need to be paranoid.’

      And they also have the need to make it work first time in the hard light of reality. Wal Mart, for example, would not take kindly to suddenly being unable to fill their shelves one day because a software glitch caused their entire stock control system to crash. Commercial software is tested each way upwaards backwards sideways and downwards by soem very ingenious very crafty nasty people whose job it is to break it. And then its run in shadow production for maybe six months before its anywhere near being put on a live system.

      Why is it done this way? Because the consequences of failure are high.

      Compare and contrast this rigorous process with the climate models…whose output – we are assured – is of even more consequnece than being unable to buy the right shampoo in WM tomorrow.

      A bunch of guys teach themselves Fortran. Write a few equations. Get really chuffed when the code actually executes without failure. Get even more chuffed when they discover that they’ve avoided a pitfall that another programmer has fallen into. Run the model a few times then write up a paper (sans any means of reproducibility or practical testability) which – to gain attention and hence citations – will reliably state that it’s all worse than anybody thought and that in 100 years the world will end.

      And then get some pals to write another paper saying that since there weren’t a lot of bugs reported you were all jolly good coders and your word should be taken as law. Pats on the back all round.

      This is all pathetically and tragically amateur.

      • You’d have a point if climate models weren’t orders of magnitude more complicated requiring far more extensive and detailed expertize. Comparing it to stock planning software (which can be complicated, but also mundanely dumb) is like apples to oranges.

      • B.S.

        I guess you have never been involved in the development of complex commercial software, stuff that has to run 24/7, has to produce the right results every time, and if anything goes wrong it is you getting the phone call in the early hours of the morning to fix something in a hurry so that other people, working under pressure can get their job done.

        I have seen some of the climate model software code, and it is not impressive.

        Nobody who wrote code like that would get an interview with me for doing commercial software, the standards are in a completely different league.

        I have also had experience with taking modelling software and turning it into commercial grade software for use in the real world.

        It did not matter that I did not have the statistics expertise to create the formulae myself, all I had to do was turn it into decent software.

        Unless you have personal experience of developing end to end software that models and tracks a commercial industry, you are not qualified to make a judgement on orders of magnitude of complexity.

        Having done some of both, it is not apples and oranges at all. It is really all just applied algebra when it boils down to it.

      • Steven Mosher

        looks like you’ve walked through a climate model. Go look at ModelE online. code browser. helps if you know fortran.

        Its neither overly horrible as some paranoids claim, nor overly complicated as hand wavers claim.

      • Having a look now, thanks for the link. Found a 13000 loc file with tons of state and no tests in sight, but keeping an open mind : ). I’m going to try a fortran lint tool to see what it says. I know that isn’t the final answer on anything, but I’ve used them over the years and they can give good advice.

        Using the Frozen version of ModelE for CMIP3 simulations (the other requires a password) at the bottom of

      • Latimer Alder


        Is your argument that because climate models are much more complicated and difficult than stock control systems they need *less* testing and care taken over them? Rather than more?

        Or that because the consequences of failure are so high (end of civilisation as we know it) that we should worry less about them that we do about tomorrow’s availability of Head and Shoulders in Rochester, Minnesota?

        If it is really the case that climate models are the only key to understanding the future course of planetary temperatures, then they are indeed the most important pieces of software in the world. For they can help us to survive as a species. Not much comes more important than that.

        And given that the IT industry has developed some techniques that help to minimise the risk of failure, do you not think that it might be a good idea if some of these were adopted in the model’s implementation and coding? Or do you assert that they are so complex that only annointed climatologists are capable of working on them and that professional software engineers would be useless.

        After all, it is just the future of humanity that rests on the abilities of these models. Nothing serious.

    • The many different climate models are all put together from the same flawed consensus climate theory. It is no surprise that they all produce the same flawed warm, warmer, warmest forecasts while earth temperature does not follow. CO2 sensitivity was derived based on flawed theory and they use false feedback parameters to match the theory. Earth does use her own Theory and Models which gives the correct results which is different from the Consensus Climate results.

      • That’s irrelevant to software quality. If the physics is wrong but it’s coded correctly, that’s not a software problem.

      • What if both the physics and the coding are wrong? Is that right then?

      • Latimer Alder

        Back when I sold commercial software that was an ‘undocumented feature’. But probably APARable.

      • In what universe?

        A bug is a bug, regardless of whether it is a logic or a syntax one.

        If the underlying formulae are incorrect then the software is incorrect and that is a bug.

        That is the real difference between academic software and real commercial software.

        In the real world it does not matter if the output is wrong because the formula is wrong, or the formula is implemented incorrectly or if there is some limitation in the programming environment, or if it is because the floating math is wrong in the CPU. If the answer is not correct then it is a bug, (and will be regarded as such by the end user) and it is the responsibility of the programmer to fix it.

    • Follow the footnote link though, they are not just comparing them against others, they are calibrating them.

      “This framework enables a diverse community of scientists to analyze GCMs in a systematic fashion, a process which serves to facilitate model improvement. Virtually the entire international climate modeling community has participated in this project since its inception in 1995.”

      If they all get the same answer that is one thing. If they get different answers and tweak until they are the same that is another.

      • Hans von Storch suggests there is at least some ‘social’ pressure towards convergence of models
        (interview here

        “There might be small differences, maybe the equilibrium temperature for a doubling of the CO2 concentration in the atmosphere is 3°C for one model and 4°C for another model. But the general trends are reproduced with all models. Of course that is something, one could also be critical about. The scientist making these models know each other more or less. And if somebody then finds very unusual results, he might become shaky and say: Well, maybe my model is not as good as the other 17 models that are around. And then he tries to adjust his model to agree with the other 17 models. There is also a social process that leads to the agreement between all the different climate models.”

  14. Maybe I am just plain stupid, but to me the issue of validation is very simple. You have the model make predictions about the future, and then check these predictions against what actually happens. When the model successfully (whatever this means) forecasts the future a sufficient number of times that the results could not have occurred by chance, then the model is validated; and only then. When some professional puts his/her signature on the outcome of a future prediction, and if that prediction is wrong, and that professional is then held to be personally liabel, then we have a validated model. Not before.

    • Latimer Alder

      But but but but you can’t possibly expect climatologists to be held accountable for what they do on our time and with our dime! That would be heresy. Accountability is for little people!

      The rules are these:

      1. Climatologists are -by definition- infalllible and trustworthy.
      2. Their work is – by definition – already perfcet and so cannot be criticised by anybody who is not a climatologist.
      3. The ‘lay’ public have no business even looking at their work. The public’s role is to provide lots of funding, oodles of adulation and lots of international jollies, not to get above themesleves. And to do what they’re f***g told!
      4. in the very unlikely event of the actual observations and climatological predictions disagreeing, then this merely proves exactly how well-funded the Big Oil denier conspircay is. But luckily there ‘climatology centres of excellence’ where these inconvenient glitches can be homogenised, massaged and adjusted so that they do not upset the applecart.

      • Latimer, You and I are in agreement, so please take these remarks with that in mind. I am aiming them at our hostess, and I hope she will respond.

        First, your four points should have been preceded by “sarc on”. Our hostess specificly stated “Moderation note: This is a technical thread and comments will be moderated strictly for relevance.”.

        If I am right, and I really believe I am right, then this whole thread is a load of scientific nonsense. And that is what I would like Judith to comment on. The whole of CAGW is built on the myth that climate models can predict the future. This has never been proven; the models have never been validated, and probably, can never be validated. All this theoretical and hypothetical stuff which is the basis of this thread, is completely irrelevant.

        What people seem to have difficulty understanding is the non-validated models can be extremely useful. The models themselves are never irrelevant. What is happening is the misuse of models. People setting themselves as experts, using models for tasks for which the models are enitirely unsuitable.

        Our hostess is remiss in even suggesting that the papers she refers to, have any merit in discussing whether or not, climate models have been validated.

      • blueice2hotsea

        Jim Cripwell:

        The whole of CAGW is built on the myth that climate models can predict the future. This has never been proven; the models have never been validated, and probably, can never be validated. All this theoretical and hypothetical stuff which is the basis of this thread, is completely irrelevant.

        What then is the better approach?

        A well-known alternative physics reprobate – whose name cannot be mentioned – wrote:

        Climate modeling is the only way of reaching understanding, since controlled experimenting is impossible.

        Seems reasonable to me.

      • Eric Ollivet

        Controlled experimenting is indeed impossible.
        Yet we have a nice scale 1 “experiment” with about 150 years of past climate data on records…

        We thus shall expect that models could faithfully hindcast those past data but the very inconvenient truth is that they don’t : models’ predictions are daily falsified by observations.

        This is exactly why Jim can state, without fear of being contradicted, that “the models have never been validated, and probably, can never be validated”

      • Pooh, Dixie

        “Climate modeling is the only way of reaching understanding, since controlled experimenting is impossible.”
        “Seems reasonable to me.”
        One way, but not the only way. Observation is an alternative.
        – Bending of light in strong gravitational fields.
        – Orbit of Mercury
        – Elliptical planetary orbits.
        – Etc.
        Note that Post Normal Science has the criterion “Experimentation – “not best suited” – crucial experiments are unavailable.

    • Norm Kalmanovitch

      It is not the models that have to be validated as much as the garbage that goes into them. The models are forced into outputting temperature values that have no relationship to the real world by a fabricated input parameter for CO2 forcing that is based on a relationship that has no scientific validity and then the output from the models which is in W/m^2 is converted to degrees C with another fabricated factor called climate sensitivity which too has no actual physical basis and was merely fabricated to make the hind forcasts of the models fit the observed record back to 1960 but have failed to forcast even one year of global temperatures since.
      Garbage in garbage out is what is causing perfectly good models to project catastropic global warming from CO2 increases when this is actually a physical impossibility!

      • Norm, I understand what you are saying. I have a slightly different way of looking at the problem. Someone has to use the model to try and do something. That someone is responsible for the whole ball of wax; model, input data, etc. I dont differentiate. That is what I was trying to say that it is the misuse of models that is the problem. Users putting in wrong data, and claiming the results are valid is just another way models are misused.

      • Pooh, Dixie

        “That someone is responsible for the whole ball of wax; model, input data, etc. “
        UNFCCC, perhaps?

  15. Judith: The problem with this report is that it is looking at one issue, while it will be interpreted as addressing another.

    By way of comparison: R is a very reliable statistics package. I imagine if you looked at the bug report rate, bug fix rate, update rate, etc, etc, it would rate very highly. That has nothing to do with the accuracy of my analysis done in R.

    I’m not aware of anyone who claims that the climate models are buggy. (This paper addresses that concern.) Rather, the discussion is over whether they are accurate. (This paper will be used to tout how “reliable” the results of these models are.)

    • Wayne2
      Buggy: All models are buggy – we appear to have little idea by how much. Thus the need for independent verification & validation.

      Chaotic: Weather and climate have chaotic uncertainty – most models are run with few replications – we don’t know how much of the difference between models and data is chaotic, how much is poor understanding of weather/climate and how much is climatic trends – or how much of the trends are natural vs anthropogenic – despite IPCC’s claimed > 90% confidence. e.g., See Fred Singer NIPCC vs. IPCC Addressing the Disparity between Climate Models and Observations: Testing the Hypothesis of Anthropogenic Global Warming (AGW)

      (2) Climate models are known to be chaotic. None of current models have a sufficient number of runs to overcome chaotic uncertainty and therefore cannot be validated against observations. . . .
      Attribution of observed warming trends to GH-gas increases is based largely on claimed agreement between observed (tropical) tropospheric trends and modeled ones [Santer et al., IJC 2008, Fig 6]. We show that the claimed consistency is spurious.

      Uncertain data. The data has major uncertainties – there are major issues trying to evaluate by how much. To validate models we need to compare models against climate data. e.g., See Nigel Fox of the National Physics Lab:

      Dr Nigel Fox, head of Earth Observation and Climate at NPL, says: “Nowhere are we measuring with uncertainties anywhere close to what we need to understand climate change and allow us to constrain and test the models. Our current best measurement capabilities would require >30 yrs before we have any possibility of identifying which model matches observations and is most likely to be correct in its forecast of consequential potentially devastating impacts. The uncertainties needed to reduce this are more challenging than anything else we have to deal with in any other industrial application, by close to an order of magnitude.

      Uncertain climate models impair long-term climate strategies
      See Fox’s presentation: Resolving uncertainty in climate change data

      Climate science is still wandering in a wasteland of uncertainty. We are trying to make policy issues to spend billions per bug – while having wide variations between models for unknown reasons.

      Better to get the bugs out first, reduce the uncertainty in the data to be able to test models, understand the physics of climate especially the clouds, and then compare to see if there is a serious issue – so we can evaluate the pros/cons of what to do about it.

  16. Bernie Schreiver

    One very well-documented answer to the question “What validation and verification techniques work best in scientific simulation?” is provided by the twenty-year project Critical Assessment of techniques for protein Structure Prediction (CASP).

    The verification aspect of CASP is simple: a straight-up contest among simulation codes, held every two years, to determine which codes are most accurate.

    The validation aspect of CASP is simple too. Although CASP programmers are of course familiar with V&V classics like JPL/NASA’s DMIE-43913 (because the lead author is none other than Bjarne Stroustrup, whose works every software engineer knows), for the top-ranked CASP groups the actual process of scientific sofware validation is VERY different from anything that Anonymous describes or recommends.

    I had the pleasure of sitting-on on a top-scoring CASP group’s weekly meetings, and it was instructive to observe the tuning of their model parameters. The procedure was simple: (a) trust the experts within the group, and (b) in cases of disagreement, vote within the group. An example that I personally witnesses was constraints to be placed upon a key torsion angle associated to peptide bonding; should it be bounded by 120 degrees, 130 degrees, or 140 degrees? Following arguments, the bond-angle bound was set by voting, with grad students, post-docs, and senior faculty all having one vote (so that the value actually was set not by the most-experienced senior faculty, but rather by the least-experienced grad students). And how does an individual rise within the group to be regarded as an expert? The process is simple (but not easy): establish a good track record in the debates associated to the votes.

    In theoretical terms this chaotically democratic software development process looks horrible — but in practical grounds it works GREAT! To the best of my knowledge, *ALL* of the top-ten CASP prediction models were created by comparably democratic software development processes, and *NONE* were created by rigid adherance to preconceived V&V standards.

    The key point that Anonymous’ critique misses is the lesson of science-and-technology history: much better than one large scientific group slowly producing scientific models via rigorous adherance to V&V standards is numerous competing small groups that each rapidly produce scientific models by an evolutionary process of competition.

    The best way to appreciate this lesson is by the personal experience of writing competive rapidly evolving scientific models, and there is nothing in Anonymous’ critique that indicates that (s)he has ever had this experience.

    Bottom Line: The software development principles that Anonymous advocates are grounded in abstract theory rather than in practical science-and-engineering, and these abstract principles work so badly in practice, that few (if any) scientific software groups embrace them.

    • Bernie Schriver
      With CASP you can quantitatively compare models against quality data with short turn around times and have objective competitive evaluations.
      For climate, Nigel Fox of NPL states:

      Our current best measurement capabilities would require >30 yrs before we have any possibility of identifying which model matches observations and is most likely to be correct in its forecast of consequential potentially devastating impacts.

      Detection of subtle indicators from a background of natural variability requires measurements over a time base of decades. This places severe demands on the instrumentation used, requiring measurements of sufficient accuracy and sensitivity that can allow reliable judgements to be made decades apart.

      See: Accurate radiometry from space: an essential tool for climate studies doi: 10.1098/rsta.2011.0246 Phil. Trans. R. Soc. A 28 October 2011 vol. 369 no. 1953 4028-4063
      Statistical comparisons by Lucia at the Blackboard indicate IPCC model projections are warmer than the data.

      As you can see, if “we” believe that the underlying trend is linear and the noise is “red”, and using the trend since Jan 1980 to test the range of trends, the 0.2C/decade is currently excluded from the 2-σ range of trends. Specifically: the data says warming is slower than that. If we use ARIMA(1,0,1) (which I believe is… uhmm… someone’s… you can guess whose currently favored choice, the 0.2C/decade is also excluded.

      So how can you quantitatively evaluate climate models with unknown bugs in them, with major chaotic variations, highly uncertain data relative to the trends, and models where the physics is seriously wrong? – Where it takes 30 years to get quantitative comparisons – and longer than a scientific career to evaluate, fix and then prove correct? – And where you have very strong vested political interests and funding feedback supporting the position that most warming is from anthropogenic contributions with > 90% confidence? – and rejecting independent validation and verification?

      It looks to me that this will be an extraordinary difficult debugging and independent verification and validation task which will likely require comparison against synthetic atmospheric models AND Fox’s order of magnitude improvement in satellite measurements.

      • Bernie Schreiver

        David L. Hagen, the points that your post makes can be aptly summarized in a single sentence: No rational grounds exist for skeptics to assert with high confidence that James Hansen’s predictions are wrong.

        Indeed, recent observations of accelerating ice-mass loss and sea-level rise constitute verifying evidence that Hansen’s predictions are right.

      • Latimer Alder

        Remind me of Hansen’s track record of previous predictions that have been subsequently shown to be true and accurate?


        He predicted sea level rise of x inches by year y
        Outcome: in year y the sea level rise was z. (z is close to x)

        Anything less definite than that and you might as well be consulting Nostradamus or a ouija board or the tea leaves.

        The rational grounds for disebelieveing Hansens’s predictions are that he has never got any definite climate prediction right in his entire career. In that he is not unique, but it just puts him slap bang with the rest of the population. like me or little Tommy Tucker or Yogi Berra or Miss Piggy or my crazy aunt Agatha or David Icke and the Giant Lizards.

        But the converse of your argumen *is* true. There are no rational grounds for alarmists to assert with confidence that Hansen’s predictions are true.

      • Bernie
        Re: “to assert with high confidence that James Hansen’s predictions are wrong.”
        I think you read the scientific method backwards. The burden of proof in science is on those proposing a new hypothesis. They have to prove that their hypothesis differ from the null hypothesis. In this case the null hypothesis is that nature will continue to vary as it has in the past, and that anthropogenic effects provide a small contribution.
        Lucia shows that IPCC’s models quantitatively fail to track our temperature within 2 sigma.
        Scafetta’s models based on natural variations with small anthropogenic contributions appear to better fit the data. See Scafetta’s solar-lunar cycle forecast -vs- global temperature

        I think you would do better here to learn and apply the scientific method to climate evaluations, rather than illogical political rhetoric.

      • David Springer

        Sea level rise has not been accelerating. We can’t measure ice-mass well enough to make any claims about it. We can measure ice-extent well enough and it’s flat with increased extent in the southern hemisphere almost perfectly offsetting decreased extent in the northern hemisphere. And very very bad for the CO2 model is that there is no observed surface temperature increase in the place where it should be most apparent – Antarctica. You see, Antarctica is the dryest place in the world so it has the least amount of confounding effects from water vapor. It’s the best experimental plaform we have for isolating the effects of higher CO2 and the result is that the effect of CO2 is precisely bupkis.

        Don’t get me wrong. I’m a luke warmer and I happen to believe that if everything else is equal we should see surface temperature rise 1.1C over dry land per doubling of CO2. I have no explanation for why the dryest and most stable environment on the planet is cruising along without any warming or cooling despite the unabated acceleration of anthropogenic CO2 production.

        I don’t know about you but this gives me pause in my presumption that CO2 has any measurable greenhouse effect at all.

      • Bernie Schreiver

        David, a logical corollary of being a “luke-warmer” in the present year of 2012, is that if it should come about that the three flagship predictions of (1) sustained energy imbalance, (2) accelerating ice-mass loss, and (3) accelerating sea-level rise, are verified in the coming decade, then by the year 2020 (or so) the world will witness the mass conversion of previous climate-change “luke-warmers” to newly ardent climate-change “Hansenites” !

        In recent publications and predictions, James Hansen and his colleagues are laying the foundations for this future conversion, via the utterly mundane V&V method of valid physical science followed by verified predictions. … which is the traditional, slow, yet ultimately irresistible avenue by which (ever since Galileo) science has *ALWAYS* acted to catalyze conversions. :)

      • Bernie
        “V&V method of valid physical science followed by verified predictions.”
        The context of this post is on verifying and validating the software. i.e. each line of it.
        Lucia shows is proving Hansen’s predictions are wrong, not validated.

        Try a reality check and look at the actual data on ice and sea level

      • David Springer


        Without water vapor amplification and a *maximum* of 1.1C warming *over land only* with the majority of the warming at higher latitudes in the winter then CAGW turns into MBAGW (Massively Beneficial Global Warming). In other words if we weren’t pumping massive amounts of CO2 into the air as a beneficial byproduct of fossil fuel consumption we”d need to invent some other way of increasing atmospheric CO2.

        That hardly puts me into any semblance of reasonable agreement with the usual suspects in climate boffinry.

      • David Springer

        According Hansen’s predictions from 25 years ago Lady Liberty’s feet should be wet from rise in sea level.

        Yeah, he’s got a great track record. /sarc

      • Hansen may be brilliant, but it makes the case for AGW seem weaker to paint him as a visionary.

      • Bernie,
        What a convoluted circular defense you make of Hansen and his predictions. Did we miss the part about where those making assertions have to prove them?

      • Eric Ollivet


        Ice mass loss and sea level rise are just proof of warming, not of human responsibility.

        In 1988, Hansen has predicted terrific sea level rise of several meters by 2100, and of more than 1 meter in 40 years, that would result in Hudson bay flooding… Obviously he was just plain wrong !
        No acceleration of sea level rise has ever been detected since 1993 and latest Satellite data even tend to show a deceleration

        Hansen also predicted a significant temperature rise of 2°C in 20 years that have been falsified by observations (showing no warming over the past 15 years).

        Hansen is definitely more an activist than a scientist, and all his predictions have been proved plain wrong.

      • Eric Ollivet | April 21, 2012 at 7:30 pm |

        Uh, no on all counts.

        Saying there’s no proof for human responsibility given the body of knowledge from Physics and observations is akin to saying fingerprints, DNA, video and gunshot forensics don’t produce enough evidence to convict a criminal. Are you soft on crime?

        Wrt 1988. Check your source, check your facts, check what was actually said. Find and cite the entire quote and the conditions on the question actually put to Hansen.

        To conclude anything from the observations about acceleration of rate of rise on so short a span as 20 years on so lagged a response other than that we’re foot-draggingly slow to track vital information is simply illogical. We could and should have been making better sea-gauge efforts from the time we knew how to mark the tide line, and could and should have been making better efforts to track land-level changes from the time we figure out that art.

        See Hansen and Sato, p22, Fig. 7 for Hansen’s actual projection.

        Also, could you confirm the source of your “2°C in 20 years” prediction? Given your track record on Hansen quotes, one suspects you’ll find you’ve been misled.

  17. Climate models are incorrect.

    Models Vs Observation =>

  18. Jeff Corwith

    I’ll toss a couple of semi-random observations about modeling. These arise out of my perspective as one who builds and uses models of a different sort.

    -(Probably overly obvious but nevertheless): It’s important that the model honors the basic physics (‘material balance’ and laws of thermodynamics come to mind firstly) of the system. The model developers must test against problems to ensure that the delivered code does this.

    -There needs to be a distinction between the basic physics and assumed and/or empirical relationships/behaviors in the model. Ideally the latter won’t be hard coded into the model such that alternatives (however unlikely) can be tested – at least so the modeler can have an understanding on how much those assumptions drive end results. Those using the model(s) need to understand the assumptions behind their model and the limitations therein.

    -Model runs with alternative assumptions and parameters (uncertainties) are useful so that modeler can develop an understanding for how much they drive the model results. Some uncertainties will be shown to have relatively minor impacts. Those uncertainties which affect the largest changes in outcomes are the ones on which future studies should focus.

    -finally – my own soap box – even though the models may with all likelihood be far from good at predictions, that doesn’t mean that they aren’t useful. Just developing an understanding of the drivers for prediction performance (while fully understanding the context of the model itself; being able to isolate physical processes from model and coding artifacts), will go a long way towards leveraging future research efforts.

    Given the political ‘climate’ around this sort of modeling, I do not envy these modelers in the least. There are going to be a lot of folks (from all sides of the issue) lying in wait to play “gotcha” with selected portions of their results.

    Best of luck!

    • Latimer Alder

      ‘the models may with all likelihood be far from good at predictions, that doesn’t mean that they aren’t useful. Just developing an understanding of the drivers for prediction performance (while fully understanding the context of the model itself; being able to isolate physical processes from model and coding artifacts), will go a long way towards leveraging future research efforts. ‘

      Maybe so.

      But they shouldn’t be claimed to be predictive tools unless they have shown some abilities in those areas. And yet climatologists and AGW advocates are far too ready to seize upon model outputs and use them as dire warnings of ctastrophes to come. It’s just a reworking of the old 1970s idea ‘it must be true, it comes from a computer’

      And after 30 years of throwing good money after bad to develop climate models that clearly can’t do any useful prediction now and have no hope of ever doing so in the future, I detect a groundswell of opinion that says there is little point in spending much more effort in this area,. Time to go and do something more useful instead.

  19. The models must represent the observed data accurately.

    The current models don’t. They don’t have cyclic component. They don’t have turning points and point of inflections. Until they do that, they will remain incorrect.

    • Girma | April 16, 2012 at 9:57 am |

      One imagines if the actual global temperature had real turning points and points of inflection, was actually cyclic instead of reflecting the sums of many recurring and random incidents, your statement might bear some similarity to truth.

      However, as the raw data is just data, and not the product of a mysterious linear function hidden in messages sent from the universe to those cunning enough to decipher them, what you say here is simple fantasy.

      • Look =>

        need I say more?

      • Steven Mosher

        simple question: what is your std error of prediction. And what final temperature in 2012 will make you change your mind and say
        “my model was wrong”

        1. What is your prediction for the 2012 anomaly.
        2. what error will cause you to come on line and say

        “I predicted X, the answer was Y, I was wrong”

        fill in X and Y.

        Actually, do that for may 2012.

        Then we will watch the temperatures and wait for your day of reckoning.

      • Steven

        The problem is the range of the noise is 0.4 deg C (, making short-term prediction less than +/-0.2 deg C impossible.

        Steven, my prediction, for each year until 2030, is that the global mean temperature will be between 0.2 and 0.6 deg C, with 95% confidence.

        Compare that with IPCC’s of 0.8 to 1.2 deg C for 2030.

      • Actually here is the figure for my prediction until 2030 =>

      • Girma | April 17, 2012 at 6:13 am |

        Mr. Orssengo, do you not understand what a derivative is?

        You have two graphs that you are using to draw conclusions, each one contributing something vital you miss in your narrative, such that the sum of the graphs contradicts your stated conclusion.

        Your temperature graph includes a constant rising term. The derivative of a constant is zero.

        The constant rising term does not participate in any way in your derivative curve.

        The rise in the derivative curve is due accelerated warming seen throughout its length. (You overestimate this acceleration due errors in methods, but we’re playing on your skewed field for the moment, so let’s pretend to accept the thesis presented in the graph.)

        So over the next 30 years, due the very tiny negative component of 30 year trend being approximately the same as the constant term as your temperature graph over the same 30 years, using your invalid application of logic uniformly we discover that, because the first dozen years of the 2000-2030 period are so very different from the first dozen years of the 1940-1970 period, to get anything like the effect your own 30-year trend graph is predicting at least one year in the coming decade must be a substantial record high global temperature. It’s a simple matter of sums.

        You can do sums, yes?

        The two predictions your graphs make by your own reasoning are mutually impossible and contradictory.

        Further, the graphs themselves rely on quite incredible gaps in methodology, and are invalid in and of themselves.

        What tests did you do to confirm each part of the data belongs to the same trend in isolation? The answer is none. You validate against the smoothed curve, not the raw data. How is that even remotely meaningful?

        And to claim that there will be a convenient change in the pattern at exactly the time that agrees with your predetermined conclusion, remembering that you have ventured mechanism for neither the graph nor the change in pattern, flouts that you are merely performing an exercise in fingerpainting.

      • selti1 | April 16, 2012 at 11:11 am |

        You understand your trend of 30 year trends projection, as it’s a derivative graph, is saying the following, according to your empirical reasoning:

        1. After 2014, there will be no future significant negative trends at the 30 year level, so long as the trend continues;

        2. The current ‘negative phase’ will end by 2014 (likely sooner due the constant 0.6C slope of the temperature plot), and the 30 years from 2015 to 2045 will experience by far the highest temperature rise on record for every 30-year trend line — so much so that by 2020, you’re predicting a new hottest year on record.

        3. Even at 2075, the next predicted slowest rise in temperature, the temperature will be rising only (as implied by #1). meaning every subsequent decade will be getting warmer faster than the decade before it forever.

        In other words, your two predictions based on your two graphs are in conflict.

        Which prediction do you believe? That warming will exceed the 0.2C/decade rate by 2040 as your temperature graph suggests, or that it will exceed the 0.2C/decade rate by 2015 as your trends graph ‘predicts’?

        Personally, I don’t see a case for prediction at all from these squiggles. Though they’re getting rapidly less awful.

      • Bart

        Don’t forget that on the 30-years trend graph 2015 on the x-axis actually means the period from 2000-2030.

      • Mr. Orssengo

        Yes, I do understand that 2015 is at the midpoint of 2000-2030.

        By 2015, your 30 year trend is predicted to be flat from 2000-2030, with excursions on your temperature graph +/- 0.2C on the smoothed GMT.

        There’s no way to get that result without at least one year in the decade significantly higher than the highest temperature on record, given the current trend.

        And every decade from that midpoint on will be warmer than the decade before, forever.

      • Bart

        And every decade from that midpoint on will be warmer than the decade before, forever.

        Until the next change in the climate pattern of the last 100 years into something else.

      • Bart

        There’s no way to get that result without at least one year in the decade significantly higher than the highest temperature on record, given the current trend.

        Not necessarily.

        Based on the past pattern ( ), after the 1940s, after point F3’, just after the warming trend changed to a flat trend, this change has also changed the noise (Residual GMT) pattern so that there are more frequent cooling. So after the 1940s the GMT was more frequently below the smoothed GMT. Similarly, until the 2030s the GMT will be more frequently below the current smoothed GMT value of about 0.4 deg C.

        The 2000-2030 pattern should mimic that from 1940-1970.

  20. You know, nothing’s stopping individual skeptics from producing their own models to challenge the ones they don’t like; with distributed computing power over the Internet, the only real obstacle is will.

    Look at Mr. Orssengo’s example. He’s spend years with his own models. Imagine what someone with even a little mathematical and programming ability could do with so much as a tenth his stick-to-itivness.

    • I was thinking about building a model. I thought it was going pretty good, but folks kept saying I was getting the wrong answers. I kept coming up with the conductive/convective impact of nearly 1/3 of the total cooling. 27.5 percent I think it was. Problem was, my model didn’t agree with the surface temperature data. Craziest thing. If I used an average surface temperature of 288K it seemed like there was a radiant imbalance of about 0.9 Wm-2 at the tropopause. That’s a lot.

      Then I found out that 70% of the Earth is water and that the average surface temperature of that water appears to be about 294.25 K degrees, plus or minus a touch of course. Damnedest thing though, one degree change from 294.25 293.25 is 0.3 Wm-2 more than one degree change from 289K to 288K. That drops that estimated tropopause imbalance down to 0.6 Wm-2. Plus or minus a touch of course.

    • Id be happy if they’d just produce their own surface temperature record.

      • There is no surface temperature record, just a poorly representative hodge podge off unreliable records, which shows some warming, but is not to be believed. As usual you do not understand the scientific point. Do you want skeptics to go back 150 years and establish a reasonable surface temperature recording system, mostly in the ocean? So sorry, no can do. People have proposed a modern system and parts are being built. But going back in time is not an option.

        We do know that the modern satellite system shows the silly surface statistical system is wrong. But that is all we have.

      • Steven Mosher

        weird. The satellite record has been corrected over and over again.
        At its CORE it depends upon radiative physics models. Those VERY SAME models are used to predict that doubling C02 leads to 3.7W of
        forcing. There is no escaping that logic. You accept the satellite record, you accept the physical MODELS used to generate that Data from raw sensor inputs. You accept those models. Those models say 3.7W of forcing from doubling C02 from 280 to 560ppm.

      • 3.7W but where? The 3.7 is based on a 33C change from 255K to 288K. 70% of the surface is oceans with a current average temperature of ~ 294K meaning the land mass average is ~ 273K. An “average” radiant layer base on 288K is not physical. The ocean mean of 294K would be a mean flux of ~425Wm-2. An additional 7.4W of forcing would increase the surface temperature by 1.27C if the CO2 forcing hit the radiant layer sweet spot and there was no other feedbacks. The oceans are closest to surface radiant saturation meaning it is unlikely there would be no negative feedbacks. It is more likely that only 3.7W would be felt at the ocean surface indicating a doubling impact of 0.6C. Land surfaces would likely expericence the full 1.2 to 1.5 C increase, but that is only 30% of the surface. That put likely warming on the order of 0.8C plus or minus a touch. The Antarctic temperature is below the average radiant layer temperature so it will likely cool because of the reduced energy transfer to the upper troposphere, kinda like the satellite data indicates.

        None of the data or models are perfect, the combinations that make the most sense are likely the most reliable. That increasing Antarctic sea ice doesn’t jive with the models or the surface station “models”.

      • Steve, help me understand please.
        I have the impression that the ‘radiative physics models’ you refer to as part off the satellite record are just a tiny part of the larger GCM’s. Isn’t that true? One can accept the validity of the modeling used in engineering the various sensors without accepting the validity of the monster GCM models, right?

    • First of all, do you have any vagueidea how many millions, if no billions, have been spent on these monster government models? Give me some of that and I can easily produce a model in which negative feedbacks cause the CO2 increase to produce cooling, via cloudiness for example. It is just a model, soyou can make it do whatever you want.

      There are also numerous simple models that show all of the warming to be natural. They are ignored.

      You warmers really have no clue as to what the science includes.

      • David Wojick | April 16, 2012 at 6:09 pm |

        IIRC, you already get “some of that” government money for your efforts to promote your views directly to government, with unfettered access and support.

        It’s easy to balance an egg on its end once you’ve been shown how. The power of computing has so expanded since the first of the models were created, the availability of distributed computing power on the Web is so universal, and the number of people with skills of all sorts able to find each other on the Internet so common that there’s no excuse blaming money.

        Heck, isn’t the code available, too?

      • Latimer Alder

        But David is right that we have spent north if $100 billion on ‘climate research’ and a fair chunk of that has gone on climate modelling.

        It is a reasonable question to wonder just how many bangs we have got for all those bucks. And it seems pretty likely that the actual answer is ‘not very many at all’. We have no new insights or results that we didn’t know 20 years ago..

        Even if only 10% of the $100 billion went to modelling that is $10 billion that has disappeared. I wonder where it went?.

        Any ideas?

      • ceteris non paribus

        Bangs for bucks:

        Global Climate Models have successfully predicted:
        That the globe would warm, and about how fast, and about how much.
        That the troposphere would warm and the stratosphere would cool.
        That nighttime temperatures would increase more than daytime temperatures.
        That winter temperatures would increase more than summer temperatures.
        Polar amplification (greater temperature increase – compared to former local averages – as you move toward the poles).
        That the Arctic would warm faster than the Antarctic.
        The magnitude (0.3 K) and duration (two years) of the cooling from the Mt. Pinatubo eruption.
        They made a retrodiction for Last Glacial Maximum sea surface temperatures which was inconsistent with the paleo evidence, and better paleo evidence showed the models were right.
        They predicted a trend significantly different and differently signed from UAH satellite temperatures, and then a bug was found in the satellite data, i.e. the models were right.
        The amount of water vapor feedback due to ENSO.
        The response of southern ocean winds to the ozone hole.
        The expansion of the Hadley cells.
        The poleward movement of storm tracks.
        The rising of the tropopause and the effective radiating altitude.
        The clear sky super greenhouse effect from increased water vapor in the tropics.
        The near constancy of relative humidity on global average.
        That coastal upwelling of ocean water would increase.


        100 billion? Pffft.

        Do you also worry that the $3 trillion spent on Iraq might not have caused a sufficient number of bangs?

      • ceteris non paribus | April 17, 2012 at 11:43 am |

        Seventeen out of 22? 22.73% failure rate on predictions?

        Successful enough to validate the premise (ie better than random chance by a wide margin) but not enough to itself be relied on to tell us the exact weather outcomes.

        I have to admit, I’d been skeptical the models would perform so well.

        The types of failings of the models appear mostly to break down as failure to adequately account for and explain major factors of natural variability. This allows us to speculate that the limit of the contribution of natural variability over anthropogenic will approach 23% in the limit over the long term.

        Looked at another way:

        “The globe would warm” would seem a 1:3 proposition (cool, no change, warm).
        “About how fast, and about how much” (equivalent to one metric) is a matter of how closely the model correlates with the observed, as compared to a ‘no change’ projection, so is pretty impressive.
        Trop=warmer+strat=cooler is a 1:9 proposition, so is pretty impressive.
        Night>day is 1:3.
        winter>summer again is 1:3.
        polar amplification is again 1:3.
        Arctic>Antarctic again 1:3.

        Overall getting 17 independent predictions of this sort right is a better than beating one in thirteen million odds.

      • Bart, one of the natural variables is the ocean heat content. Depending on how long and how much the little ice age was below average you would have recovery of the ocean heat content which would, cause nights warming more than days warming, north pole warming more than equatorial warming, troposphere warming and some stratosphere cooling.

        The stratosphere leveling off since 1994 is definitely not a CO2 increase response. The models consistently miss the leveling off of the strat and the no warming to cooling of the Antarctic. They just predict slower warming in the Antarctic, not no warming in the Antarctic.

        Where the models consistently missing is an indication that the physics is wrong.

      • ceteris non paribus


        Overall getting 17 independent predictions of this sort right is a better than beating one in thirteen million odds.

        I understand your thinking here – But I would suggest that your assumption of outcome equiprobability (as in Laplace’s principle of insufficient reason) is unwarranted.

        We could claim, with no prior knowledge, that the probability of the Sun rising in the East tomorrow is 1/2 (Sun does not rise, Sun rises), OR we could claim that it is 1/5 (Sun does not rise, Sun rises in the East, Sun rises in the South, Sun rises in the West, Sun rises in the North).

        You assume equipartition of the outcomes – But we are in possession of prior knowledge (laws of physics, history).

        When viewed in this light, the model outcomes are not actually independent of one another. That’s the beauty of it – One physical theory gives us all these predictions, not 22 different theories.

        This allows us to speculate that the limit of the contribution of natural variability over anthropogenic will approach 23% in the limit over the long term.

        The fraction of climate models that turn out to be correct is not equivalent to the ratio of natural to anthropogenic variability (of which there are many ratios, of course – you could mean temperature or CO2 abundance). IOW, the fraction of our beliefs (models) that are true is not the same as the relative frequency of their predicted outcomes.

        To suggest that our “success ratio” at modelling is directly related to the degree of physical change due to human actions seems to me a fallacy of equivocation.

      • capt. dallas 0.8 +/-0.2 | April 18, 2012 at 11:14 am |
        ceteris non paribus | April 18, 2012 at 12:02 pm |

        Apologies, I ought have prefaced my post with something like, “Hanc marginis exiguitas non caparet” to reflect the back-of-the-envelope nature of my speculations.

        My intention is to reflect on the greater relative strength of consilience of multiple facets of a single experimental outcome than some readers might infer by he-said-she-said reasoning.

        While more careful examination than I’ve given to verify all seventeen (or more) interconnected claims, the idea of prior knowledge is not appropriate to estimation of odds entirely. For one thing, it would be begging the question to assume the hypothesis right because the experiment was so successful, ergo our ‘prior knowledge’ amounts to certainty, which would certainly be implied by the statement that the results are not independent.

        So we must treat them as independent until we have commonly accepted independent evidence of the nature and kind of their relationships. Frankly, one doubts if people reject the models they are likely to come to common acceptance of much about physics with those who accept the models.

        For another, if the models are wrong, then certainly the outcomes are independent. Which gives rise to a one in thirteen million (give or take perhaps ten million) longshot. I’m not going to stand next to the model in a lightning storm, if so.

        And it’s only one of many ways of looking at the odds, depending on one’s needs.

      • Bart, really complex problems are great at making very intelligent people look like idiots. CO2 is going to make a whole bunch of intelligent idiots :)

        The GCMs are impressive and surprisingly informative. Since they are not truly independent though, there is the likely problem of common failures. Those are the really fun ones to find. Orders of magnitude idiot production.

        When I compared GISS modelE projections (or is it really predictions?) The Antarctic and the tropics are the biggest blunders. The paper by Stevens and Schwartz compares five models in figure 10. All five missed the Antarctic or the tropics. Believe it or not, UKMO was the cream of the crop with MPI, whoever they are, close.

        Most of the model biases are greater than 10Wm-2. That is a lot. The topic of this thread is assessing the model software quality, that seems to be adequate Being off over 10Wm-2 isn’t a software bug, it is a lack of understanding of some part of the physics. Something is out an order of magnitude. Trenberth missed 20Wm-2 then claimed the model data was accurate to +/- 0.18Wm-2. Kimoto used Trenberth’s data and his estimate was at least half of what it should be and Kimoto’s paper was on a common error :) The error is likely in the surface temperature data and most likely at the poles with the least coverage and greatest extremes. If everything is based on a common flaw, everything is going to be off but agree with each other remarkably well.

        So, I would get ready for that lightning strike.

        Here is that paper again,

        and here is GISS south pole versus RSS.

        That is a pretty serious discrepancy. So I recommend shorting CAGW :)

      • cnp,
        Don’t forget that GCM’s also erased the LIA, MWP and RWP to make it happen. Models always look better when the model promoters can edit the past.

      • capt. dallas 0.8 +/-0.2 | April 18, 2012 at 9:38 pm |

        I’ve never found people to need CO2 to make them into idiots.

        Though I suppose at high concentrations, it could do the job.

        I don’t disagree that the models miss quite a bit.

        However, you’re still left with the pickle that while a common source of error may make several outcomes wrong, nothing but a common source of correct formulation will so widely beat random chance to make many outcomes right.

        I doubt model performance will get much better than it has been. However, I doubted we’d see much improvement in the global temperature record’s utility to confidently assert global warming, and see how BEST proved me wrong there.

      • Bart said, “However, you’re still left with the pickle that while a common source of error may make several outcomes wrong, nothing but a common source of correct formulation will so widely beat random chance to make many outcomes right.”

        Ah, random chance. Arrhenius made the same basic prediction in 1896, 1900 was one of the coldest starts of any new century. Strike one. Calendar in 1938, you have seen Girma’s graphs. Strike two. Then Hansen made his prediction. With three tries, someone finally hit a 1 in 3 chance, but now the observations aren’t agreeing as well with predictions yet again. The same physics for all three, just more refined with each attempt and there is still uncertainty.

      • capt. dallas 0.8 +/-0.2 | April 19, 2012 at 10:12 am |

        You don’t play much baseball, do you?

        A home run counts the same on the third swing as on the first.

        Also, your argument appears to miss out that the bases are loaded because your pitches miss the zone so widely, and so often.

        That’d be four runs for Hansen on a single swing bringing Arrhenius, Calendar and AGW home, while the Dallas team keeps getting stranded on base by its own strike outs.

      • ceteris non paribus


        Apologies, I ought have prefaced my post with something like, “Hanc marginis exiguitas non caparet” to reflect the back-of-the-envelope nature of my speculations.

        No apologies necessary – I like to swing at Fermat’s last theorem as much as the next guy.

      • ceteris non paribus

        It is just a model, so you can make it do whatever you want.

        Please – Make one that wiggles its trunk.

        There are also numerous simple models that show all of the warming to be natural. They are ignored.

        Since you just claimed that you can make models do whatever you want – Why shouldn’t they be ignored?

        You warmers really have no clue as to what the science includes.

        Careful – Your use of the definite article makes “the science” sound kinda “settled”. But of course “the” science can’t be settled – that’s why the models that get ignored are (somehow) correct, and the models that don’t get ignored are wrong. Right?

    • Here is a rock star model. The ‘Secret’ Mac Daddies are wondering if he might be a skeptic too.

      It looks like we have some time yet.


  21. Bernie Schreiver

    The single most famous maxim of the V&V literature is George E. P. Box’s All models are wrong, but some models are useful.

    Anonymous imagines a world in which scientific models in general, and climate models in particular, all are provably correct. Needless to say, Anonymous imagines a world of absolute truth and certainty, that is not presently and can never become, the world that we live in.

    The messy democratic process by which competing scientific models earn our trust is, as Winston Churchill said of democracy itself, “The worst of all systems, except for every other system that has ever been tried.” :)

    • Latimer Alder

      We may never get to a point where climate models are ‘provably correct’. But if climate modellers want to start to begin to earn our trust, they could at least pay some attention to the software tools and techniques that have been shown to be effective and necessary in other fields.

      Their smug self-assertion that anything that smacks of taking a professional approach to their work is completely unnecessaary for them since they are a very special case does not give confidence that they know their arses from their elbows. Neither does their belief that their towering intellects makes them immune from the errors and mistakes that afflict mere mortals.

      And if they really really wanted to buold public confidence, they could publish their expected climatic results for the next few years so that we could actually assess how good they are before we have all died of boredom.

      Strenuously avoiding do so by vigorous handwaving – while still whinging on about ‘Trust Us, We’re Climate Scientists’ is rapidly erdoing any lingering trust they may once have had. The models have not been shown to have any skill at all in forecasting anything. They are a waste of time and effort.

    • Bernie,
      “Anonymous imagines a world in which scientific models in general, and climate models in particular, all are provably correct. Needless to say, Anonymous imagines a world of absolute truth and certainty, that is not presently and can never become, the world that we live in. ”

      I read Anonymous’ words as not commenting on models in general but about a paper that claims scanning a software project’s development and maintenance logs for the words “bug” and “error” as a way to determine if a model is correct – in the real world. For those of us who have built large complex software systems, his words ring quite true.

      • Agreed. That is probably the most absurd measure of software quality I’ve ever seen. Even at that, they omitted the Eclipse results which I was curious about as I’ve worked on it and it was spearheaded by Erich Gamma, who co wrote ‘the’ book on design patterns (focused reusability and bug reduction), as well as JUnit, the standard in unit testing.

        I’m not a huge fan of that code base, but it is well made. I’d be interested in seeing a debate between Erich Gamma and people who don’t even write tests and still claim to have better software than he can write : ).

  22. David Springer

    The ultimate validation of a computer model is of course correctly predicting the future. We have decent models for predicting the flight characteristics of aircraft, for instance, but who in their right mind would produce 100 newly designed jumbo jets before the first one had been flown to make sure the characteristics of the actual craft lived up to model predictions?

    In effect this is what climate boffins insist we do because we just don’t have time to validate the models. Unfortunately for them there have been a number of more conservative minds insisting that some measure of validation is necessary before embarking upon expensive and quite possibly counter-productive courses of action. Even more unfortunately what little validation we have is indicating a seriously flawed product as atmospheric CO2 accumulation has accelerated along the fastest modeled track while the result in actual global temperature is beneath the slowest predicted outcome.

    The models now appear to be seriously flawed and the world should be immensely grateful to those of us who have successfully demanded that more validation be done before expensive action plans are undertaken.

    • Latimer Alder

      Ain’t that the Truth, Brother!


    • Steven Mosher

      ‘The ultimate validation of a computer model is of course correctly predicting the future.”

      nope. you need to specify what “counts” as “correctness” for a given purpose. No science correctly predicts the future. Science “predicts” selected aspects of reality to chosen levels of accuracy. If science were perfect we would call it logic or math. Even then it would be incomplete. It’s always less than complete, always imprecise, and sometimes useful for human purposes.

  23. Many if not most people in the sciences publish work that involves computer programming and simulation. Few if any of them expect to have to go through a VV&A process every time they do so. But on the other hand most scientists operate under the notion that their work is advancement of scientific knowledge, not a canon of “incontrovertible facts,” “settled science,” and clarion warnings to the world that it had better start reordering human activity now to prevent a global disaster. Given how much they want all of us to put on the line a rigorous VV&A process is the least that we should expect of climate scientsts for their software.

    • Latimer Alder

      @bob k

      You forget that the normal rules of science do not apply to self-selected ‘climatologits’. Their superior intellects (apparent only to themselves) have unshackled their visions from such trivial conventions as observations and theories and proofs and evidence.

      With ther hectic schedule of hectoring the public, books to publicise, important lawsuits to fight, the well-funded Big Oil denier conspiracy to debunk, protests to attend and planets to be saved, they have risen above such things.Like the guy in North Korea, they merely have time to issue pronouncements and they just exepect Nature and The Climate will obey.

  24. Bernie Schreiver

    Latimer, anyone can earn respect by making correct predictions. In particular, the consensus of folks who create climate change models includes the following flagship predictions:

    Prediction I: Continuing imbalance of Earth’s energy budget (as seen by ARGO sensors),

    Prediction II: Accelerating ice-mass loss, in consequence of 1), first in the north, then in the south (as seen by satellite imaging and radar), and

    Prediction III: Accelerating sea-level rise, in consequence of 1) and 2) (as seen both by land stations and by satellite).

    The rationale for these three predictions is simple (the climate codes merely add details):

    Principle I: CO2 is a greenhouse gas whose concentrations are increasing, and

    Principle II: Feedbacks act to amplify the CO2 greenhouse effect, and for this reason

    Principle III: The Earth’s energy budget is presently in a sustained state of imbalance.

    Supposing that you skeptically foresee these three key predictions won’t be verified, because some or all of these three key scientific principles are incorrect … then please suggest concrete alternative predictions, that are supported by concrete alternative principles.

    My own overall prediction is simply this: the above three physical principles are fundamentally correct, and as verification, the above three key predictions will be fulfilled in coming years.

    • Principle II has termites. Feeding. Then comes the back part.

    • Latimer Alder


      Wow. $100 billion of climate research, and the best we can come up with are some entirely qualitative hand-waving predictions without a timescale.

      So far my propensity to do anything at all about AGW based on these predictions remains stubbornly at zero. Try quantifying them and putting up some timescales and I might be prepared to modify that stance.

      PS. The nearer to ‘now’ the timescale and .hence the greater possibility of validation (or not) the better.

    • “the above three physical principles are fundamentally correct”

      What evidence can you provide that any of the three are true?


      • Bernie Schreiver

        Andrew, way back in 1981, before the modern era of ARGO and satellite gravimetry, Hansen and his colleagues set forth specific physical principles (“validation”) linked to concrete experimental predictions (“verification”) in an article titled Climate impact of increasing atmospheric carbon dioxide.

        Thus in the language of Anonymous, the essential elements of “V&V” with respect to climate change, namely (1) valid foundations in physical science and (2) verified predictions of energy imbalance, ice-mass loss, and sea-level rise, have been reasonably established for three decades and more … readers of Climate Etc. are encouraged to verify this assertion for themselves, by reading Hansen’s 1981 article.

      • Bernie,

        An article is evidence of writers and writing. An article is not evidence of anything having to do with climate.


      • We’re pretty sure it warmed, Bernie, and we’re pretty sure we don’t know why.

      • Hi Bernie,

        Can I ask you directly, do you think counting ‘bug’ words in check-ins over 6 months and dividing by the lines of code in the entire code base is a valid estimation of software quality? Reading comments you seem to be ok with that, yet still knowledgeable about the software process. Just like to confirm as that is what this thread is about.

        I think this is important, the ‘big picture’ stuff people have pretty much decided for themselves already. Measuring and then improving the quality of modelling software and its development processes is something that is attainable without getting into the passions of the debate. A win here would have a positive impact on the science, I don’t think that is a controversial statment.

      • Latimer Alder

        @bernie schreiver

        Show us some numbers. Not just vague principles.

        Here’s my Hansen-esque weather forecast. Note the lack of numbers or anything more than a soggy mass of platitudes.

        ‘When it isn’t raining or cloudy the sun will probably shine during daylight hours. Nighttimes wll be darker and probably colder than the day time. Winds will vary from calm to considerably stronger. We confidently expect that summer temperatures will, on average, be warmer than those during the winter’

        That all of these things are pretty likely to come true does not make me an ace weather forecaster. But it does show that anyone who concludes that I am is a pretty gullible type. And I have a bridge I want to sell them……

        Without testable numeric predictions you have nothing of any value.

      • What you conveniently ignore Latimer is that Hansen predicted in the 80s that the world would warm. It could have cooled or even remained flat. But it didn’t. It warmed.

        Who else predicted warming back then? Not climate skeptics, they didn’t exist until afterwards.

        But once they did exist they played very hard to deny the warming that started to emerge in the temperature records. Even in the early 2000s we had climate skeptics citing a faulty UAH satellite record to cast doubt the idea the world had warmed since 1980.

        It’s only relatively recently that skeptics have accepted the warming since 1980, generated ad-hoc explanations for it and cast the illusion that they expected the warming all along.

        Another big difference is that Hansen presented a physical explanation for the warming since 1980 that still stands today. There was and still is no physical based natural explanation for the warming. Skeptics who posit a natural explanation for the warming resort to hand-waving about cycles they can’t physically explain the basis for.

        Now we have another interesting situation. This time there’s a big group of people predicting the world will continue warming: those darned climate scientists again. Climate skeptics on the other-hand have by and large thrown themselves at a contrarian prediction of global cooling. Again they can’t physically explain it. It’s more hand-waving about cycles.

        It’ll be crunch time again but this time it won’t be so easy to get out of it if climate scientists turn out right. Ironically it’s all those skeptic blog posts predicting global cooling that render history unrewritable.

      • Latimer Alder


        Great. Big Jim made a one in three call (up, down, same) in 1980 and got it right – until about 1998. And seemingly has been trading on that call ever since. We might also give a little credit to good old Svante Arrhenius who made the same call about 70 years beforehand, so Hansen can’t even claim he was original.

        So what has been done in the last 31 years to improve on this work? We’ve spent north of 100 billion dollars and seem to have nothing better than where we were in 1980.

        And I am not really interested in what other guys are saying abut the climate. I pay the climatologists to do that on my behalf. But like the old proverb says ‘Trust but Verify’

        And every time I look at the way the climatologists go about their business I see shoddy standards, laughable amateurism, highly questionable ethical practices and a smug arrogance that shouts ‘shysters’ loud and clear.

        And the discussion here about software quality has only reinforced my idea that most of them shouldn’t be allowed to run anything more complex than a whelk stall. The general conclusion is that climate modellers don’t even understand the concepts, let alone the implementation of any professional practices in this area. And this is their primary job write quality code that we can sue to help make far-reaching policy decisions.

        And attempting to divert attention by pointing out that there are other amateur bloggers whose ideas may be a bit bizarre does nothing to absolve these supposed ‘professionals’ from their consistent failures to get their act together in a grown up way. They are the ones I pay through my taxes. They are the ones who (for the moment at least) policy makers listen to and they are the ones who should clean up their act. There is an awful lot of cleaning to do.

      • lolwot – So warming, flat, cooling; 1/3,1/3,1/3. The odds were 1/3 that Hansen got warming, not all that impressive. The warming could be wholly natural.

      • Steve Milesworthy

        A lot of warming (compared with the temperature record) estimated to within a few-tenths of a degree, sustained for over 30 years and observationally supported by detailed analysis of oceans, the atmosphere, and the cryosphere is not a one-in-three chance event.

      • Last time I checked, not one of Hansen’s three scenarios were tracking the global atmospheric temperature.

      • Latimer Alder

        @steve milesworthy

        ‘A lot of warming (compared with the temperature record) estimated to within a few-tenths of a degree, sustained for over 30 years and observationally supported by detailed analysis of oceans, the atmosphere, and the cryosphere is not a one-in-three chance event’

        Sounds impressive at first hearing. Especially the bit about ‘detailed analysis of the oceans etc etc’. But let’s cut the crap and boil it down to essentials.

        Hansen’s made some predictions about the temperature and got them right to ‘within a few tenths of a degree’. Since the overall temperature increase is only about 0.6 degrees, then being ‘within a few tenths of a degree’ doesn’t actually sound very good to me. It could be out by a factor of two and still meet your pretty vague test. And I have a sneaking suspicion that if I went back to good old Svante Arrhenius a hundred years ago then he’d have done equally well with log tables and graph paper.

        So no special cigar for Big Jim on that one I fear.

        He also fails on the post-1998 temperatures which don’t seem to be doing anything very much at all.

        So maybe in climatology you consider these predictions to be absolutely top notch and gee whizz spot on magic. Maybe they are genuinely the best that could have been made and we are being too harsh to dismiss this wondrous piece of futurology.

        But I don’t think so. The level of ‘accuracy’ does nothing to convince me that the models Hansen used then are any good and the very very visible lack of any significant testable predictions since then doesn’t make me any more confident that today’s models are any better. If they were. you guys would be shouting your successes from the rooftops rather than keeping a long long way from anything that could be construed as a predction within a human lifetime.

      • Steve Milesworthy

        Jim2 and Latimer, in summary then the gish-galloping from “one in three” to (paraphrasing) “well he didn’t track the temperature accurately all the way” indicates we are apparently in agreement that the odds of Hansen being right just by chance were lower than one in three.

      • Ah, of course! We should not dare to suggest that the Great Prophet Hansen is anything but perfect. He is right even when he’s wrong.

      • Steve Milesworthy

        Your attempts to raise Hansen to sainthood so you can then knock him down is a common enough debating tactic that it is boring.

        Putting it succinctly, his estimate was good, based on reasonable science for its day. Assessments of his estimate above are based on burying head in sand.

      • It’s you guys that put Hansen on a pedestal.
        Just be careful you’re not standing underneath when he falls

    • Bernie, your socalled predictions and principles are neither, just vague claims. Given that climate is naturally a far from equillibrium system, the lack of energy balance is naturally to be expected. Other than that you got nothing, so you got nothing.

    • ‘Prediction III: Accelerating sea-level rise, in consequence of 1) and 2) (as seen both by land stations and by satellite).’

      Er, sea-level rises is not accelerating, in blue we have satellite measured sea-levels and in black the CUSUM (rate) of the detrended data

      In point of fact the deceleration observed since the summer of 2010 is the greatest observed in the whole record.

    • Eric Ollivet

      Prediction II has been falsified by observations : sea ice extent (ice mass is difficult to measure accurately enough) has indeed decreased in Arctic but the loss has been compensated by the increase in Antarctic…

      Prediction III has also been falsified by observations : satellites and tide gauges measurements do not show any acceleration of sea level rise.
      Recent data even show a deceleration (

      Principle II has never been demonstrated : positive (i.e. amplifying) feedbacks would result in an unstable climate system, whereas this system has showed its extreme stability over the past millions of years, despite quite large variations of CO2 concentrations. Satellites’ data (ERBE for instance : cf Lindzen & Choi - ) also show that feedback factor is actually very small (much smaller than assessed by IPCC) and even negative.

      My alternative principle is that Earth climate, at any time scale, is mainly driven by natural processes and cycles (Milankovic cycles, PDO/AMO oscillations, Solar cycles, ENSO oscillations….).

      Based on this principle and on the (reproducible) climate patterns observed over the last 150 years, my alternative prediction is that there will be no significant warming until 2030, and that temperature rise won’t exceed 0.5°C by the end of the century. Corresponding sea level rise will also remain limited and won’t exceed 30 cm.

      Scafetta has published papers showing similar forecasts based on an astronomically based decadal-scale empirical harmonic model (cf., where he also demonstrates how inaccurate GCM models and IPCC predictions are.

      See you in 20 years and we’ll know who is right…

  25. blueice2hotsea

    1. lack of uniqueness

    A key difficulty to developing an executable theory of climate is the enormous breadth of required knowledge. The theme of Richard Feynman’s Nobel Lecture addressed the unlikelihood of doing successful science in such a circumstance. Feynman offers his advice as to the best approach. At times, the equations must be guessed with no physical basis. And for every area of required expertise the team must be able to provide multiple unique physical and mathematical interpretations of the same phenomena.

    The lack of uniqueness in the results of GCMs suggests to me that individuals are having difficulty representing alternative approaches within their specialty. One solution might be GCM teams with more members and which are more inclusive of scientists working outside census group-think.

  26. blueice2hotsea

    2. effects of the Advanced Strategic Computing Initiative (ASCI) project on the development of modern verification and validation methodologies.

    I hope that V&V eventually becomes a significant, even limiting factor in the power of GCMs to usefully simulate long-term climate evolution. It would mean that the more significant underlying problems are nearly resolved. Until that time, perhaps sticking with a more crude approach to V&V can suffice and maybe even be wisest. Too much more than that can be seen as a defacto declaration of impending success, that it’s almost there and soon it will be time to polish that apple and eat it.

  27. Software quality is an irrelevant side issue.
    You judge a model by the quality of it’s underlying science, the knowledge of the processes involved, the initial data.
    You don’t judge the food by the quality of the pot in is made in.
    I’m saying this as a software engineer.

    • I agree as far as elegance, but I assume we are focused on bugs that affect the outputs. Quality of the pot, not that big a deal. Adding salt instead of sugar, bigger problem.

      Certainly one can lead to the other in real life though, so it is important to have high over all quality for a task like this imo.

      • I can only presume you have never worked in a decent kitchen, or had serious aspirations to be a show-off cook at home. Pots are essential parts of the cooking process – the material and shape of the pot actively influence the way the food cooks and the flavour it takes. The molecular gastronomist school of cooking considers the method of cooking as important as the ingredients.

        In simple terms, to cook a proper Morrocan dish, you need a properly shaped clay container. Otherwise you get a spicey (and probably passable) stewed meat which lacks that edge. And, more obviously, there is a reason why we do not use say leather pots to cook on a gas stove – although they are a perfectly effective way of poaching things.

        In fact, I defy anyone to find a field in which the container in which a process takes place is not important. Last time I checked, there was an entire field of science dedicated to researching the best materials in which to undertake particular processes. Why this same principle does not apply to software models, I cannot see – it does not matter how good the model is if the container (the code) is injecting something unwanted in there.

      • Indeed I have never worked in a kitchen. I stand corrected, and informed : ).

  28. EternalOptimist

    apologies up front, I have not read all the comments above.
    I am a developer, I build massive applications and have done for the last 20 years.
    They are full of bugs at first, and I do a pass to remove as many as possible.
    Then there is an iteritive process that includes the team, the users and testers. and it never ends. ever.
    But in the sense of climate models, as discussed here, I get the sense that you are not talking about what I am talking about.

    I get the sense that you are focussing on the algorithms and methodology that drive the output, and there is nothing wrong with that. That has to be right.

    But please be aware, it is only a tiny fraction of my job

  29. As the climate models don’t accurately describe the known observed data ( ) with its sinusoidal characteristics and the accompanying turning and inflection points, their chance of predicting the unknown future is nil.

  30. The world’s very best Climate Model is Mother Earth. The Climate of the past ten thousand years is the very best Climate Model for the next ten thousand years. If any Climate Models Forecast temperatures or sea levels outside the range of the past ten thousand years, based on a manmade fraction of a trace gas, those models are flawed and the theory that they are based on is flawed. If you really want to understand climate, read my theory.
    This is long, redundant and possibly boring to many of you. I explain multiple times, using different words to reach different people. If any can prove any of this to be wrong please do so. If any want to disagree with any of this, please do so. I am not a consensus scientist. I am an engineer. I am skeptical.
    If I am wrong, I do want to adjust my theory. I have worked very hard on this for four years and I am still learning. My science is not settled, I do not yet know everything. I suspect that my science will never be settled. I suspect that I will never know everything. I just do not understand how the Consensus Settled Climate Scientists ever achieved ultimate knowledge and still make flawed forecasts. I give thanks for Dr. Curry and other Skeptical Climate Scientists. There is still hope.

  31. Harry (not that one)

    I am flabbergasted with a number of reactions to this blog. Testing the quality of software by counting the bug reports is something I would not have imagined to be possible in 2012?

    The GCM models are trying to hindcast the past global temperature development. They will never be able to do so.
    Global temperature is artificial in it self. It contains a model which transforms data from individual stattions, at unique timestamps, into a global average.

    What a GCM should be able to do, is provide it with a starting situation and location, and rerun the temperature record, using all available data. Then look at the similarity (correlation will be impossible).

    And about all the fuzz regarding coding and testing practices: the GCM verfification, validation is garbage. These mechnisms are not implemented, every change is ad hoc and mostly undocumented.
    Parametrization allows to tweak even malfunctioning models to track the desired trace.

    Openness, documentation could solve this problem.

    Alas, there is no openness.

    • Latimer Alder

      ‘Openness, documentation could solve this problem’

      Well it ain’t ever going to happen then until the current ‘leaders’ of climatology have all retired from active service. As a matter of principle they are vehemently opposed to openness. It might lead to people asking awkward questions. Even people who are not climatologits And that would never do.

      Like all influential cliques throughout history they love the power but hate the accountability.

    • yes, but this report proves that their quality is much better than that of open software – so openness would only make it worse.

      Yes, I am being sarcastic.

  32. Much of the discussion regarding climate models is nonsense.

    What is important is for each model’s developer to clearly state what attributes the model is designed to accurately predict (within what margin of error) over what timeframes. If that is done we can easily judge the reliability of a model for ourselves.

    Take a look at the current GCM’s and try to explain why these models require such a huge margin of error for attributes they are designed to predict so near into the future. You sould typically expect a model to have very tight margins in the near term and growing error budgets the farther into the future they are trying to forecast.

  33. Judith,
    It’s a pity you didn’t let me know you were going to write about our paper – I could have provided a response to your “anonymous” reviewer.

    There’s a few wonderful insightful comments in the discussion thread here, but it will take me some time to tease them out from the noise – It’s a shame many of your commenters don’t seem to have read our paper before weighing in with an opinion. Plus I’m busy planning a trip, so won’t have time to join in the discussion for a while.

    in the meantime, perhaps you could elaborate on what you mean by “My personal take on this topic is more in line with that presented by the anonymous reviewer than that presented by Pipitone and Easterbrook.” That’s very evasive. Having read our paper, do you have specific comments of your own to make? As you’ll know, it’s a very preliminary study, and much of the paper is concerned with how we would go about answering questions about software quality. I hope that you and your commenters can offer constructive suggestions on this – unfortunately, many of them seem to be using it just as an excuse to air their prejudices about climate modeling. As an experiment in “blog review” I’m not very impressed so far, but perhaps it’s too early to draw conclusions. I’ll check back in a few days when I have more time – hoping for some useful insights from you.

    • Steven Mosher

      I’ve read your paper.

    • Steve,

      I’ve been developing software for years, and like most developers have an interest in measuring quality and reducing bugs. So I read your paper very carefully, at first with interest, and eventually just to be sure I wasn’t seeing things. You offer a regex over six months of commit messages as bug count proxy, then compare that to the total code size. This result is supposed to be related to quality. You even offer hypotheses to explain the ‘high quality’ of the models.

      If that is where you start even as a preliminary study, I would suggest a better starting place would be with someone else. Quality analysis of software is not a new field, but even on the first day it was further ahead than this. Sorry to be so harsh, but your methodology is absurd.

    • Steve

      I read your paper and do not believe it offers a valid approach to determining the quality of the various programs. The approach you are using seems to ignore the key issue- how well has the software accurately forecasted the future conditions it was designed to predict.

    • Steve, I will go through the comments and weed out the noise. We look froward to your response when you have time.

    • For those that might be interested, Easterbrook and I had an exchange on this subject here:

  34. Albert Stienstra

    DocMartyn | April 16, 2012 at 12:41 pm |

    “Second, it is a mistake in modeling is to expect computer models to reveal new knowledge”
    That is the most insane thing I have read in a while.
    Now this is really insane. Of course computer models do not reveal new knowledge, whether in science/technology fields or others. They do not even represent reality, because it is impossible to fit all aspects in one model. ‘
    I am very surprised you all let this stand.

  35. Eric Ollivet

    The software (i.e the code) and its quality are just a small part of the issue.
    Indeed the crux of the matter is the science that is behind.
    The code can be perfect but if the science behind is corrupted then the model is just bullshit.

    What actually makes sense is not code defects counting but model Verification & Validation i.e. the validation of the science that is behind.

    As ridiculous as it may appear, none of the climate modellers has ever applied any V&V processe / standard. And as a matter of fact, none of those nice climate models has ever been formally validated. Climate models are only “inter-validated” (see previous posts on Climate etc…) but inter-validation is definitely not the state of the art for models’ validation, especially for such complex ones. A rigorous model’s validation process requires a confrontation between model’s outputs (for various runs with different sets of inputs / border conditions) and tests’ data (obtained with same inputs / border conditions).

    The very inconvenient truth is that AGW dogma is fully based on climate models but that none of these models would ever be able to pass any V&V process. Those complex models have proven scientifically inadequate in predicting climate only 1 or 2 decades in advance since their outputs are daily rebutted by observed climate data:
    (a) None of these models is able to reproduce the observed cooling trends over [1880 – 1910] and [1940 – 1970] periods
    (b) None of these models is able to reproduce the observed warming trend of 0.15°C/decade over [1910 – 1930] period, that is actually equivalent to the one observed over recent [1970 – 1998] period.
    Models only reproduce a warming trend of 0.06°C/decade that is almost 3 times lower than observed one.
    (c) None of these models has been able to forecast the pause observed since 1998 neither the slight cooling observed since 2002: all of them have predicted a warming of 0.25°C minimum over the past 15 years.

    The very bottom line is just that AGW theory is just a dogma resulting from corrupted & fraudulent science.

  36. Harry (not that one)


    I read your paper, and one of the things that struck me was table 1, copied from Pfleeger and Halton 1997. The C language stands out for its low profile, while assemble ranks equal to Fortran and various languages uded at IBM. To my humble opinion, this only indicates that a type checking language reduced coding errors.

    Nothing more.

  37. Which is what bugs me about some climate modelers. The anomalies are the interesting part. The troposphere hot spot is not happening over the oceans, that is an interesting anomaly. The Antarctic is not only not warming, it is cooling, that is an interesting anomaly. A small solar variation of about 0.25Wm-2 is discernible with nearly 80% of the CO2 equivalent forcing, that is an interesting anomaly. The rate of convection is greater than anticipated, interesting anomaly. Mixed phase clouds in the Arctic have much greater impact at lower temperatures than anticipated, interesting anomaly. The anticipated cloud forcing in total appears to be of the opposite sign than anticipated, interesting anomaly. At least the code itself doesn’t have very many defects :)

  38. The core maths of computer programs is relatively easy. The user interface takes a large chunk of the code and generates lots of bugs. GCMs are predominantly math, which makes the code *relatively” simple.

    Many people here agree that the verification and validation is lacking.

    I have not seen a sensitivity analysis of the various input parameters. All of them have their own errors – measurement errors, statistical confidence bounds, etc. A sensitivity analysis could also be a way of testing the face validity of the models.

    • Most of these bounds are unknown. That is the point of skepticism. If the warming is natural then the models are completely wrong. This is not a sensitivity analysis.

      • Steven Mosher

        Have you even looked at input parameters. I guess not.

      • Here are some random samples from the GISS ModelE code fwiw (there are tons – thousands of lines plus read in data). Hard to tell where the values are coming from for most of them, but I guess if you are in the know.

        C 1 2 3 4 5 6 7 8 9 10 11
        1 .500, .200, .267, .267, .233, .300, .200, .183, .267, .000, .200,
        2 .500, .206, .350, .300, .241, .218, .200, .183, .350, .000, .200,
        3 .500, .297, .364, .417, .297, .288, .250, .183, .364, .000, .200,
        4 .500, .255, .315, .333, .204, .218, .183, .183, .315, .000, .200,

        real*8, dimension(13) :: AERMIX=(/
        C Pre-Industrial+Natural 1850 Level Industrial Process BioMBurn
        C ——————————— —————— ——–
        C 1 2 3 4 5 6 7 8 9 10 11 12 13
        + 1.0, 1.0, .26, 1.0, 2.5, 2.5, 1.9, 1.0, 1.0, 2.5, 1.9, 2.5, 1.9/)

      • Forest, meet tree.

      • If scientists were to admit to this, how would they ever be able to make a good living?

    • There are two kinds of uncertainties in models. A primary uncertainty is whether we have chosen the correct set of natural processes to model the phenomena of interest. The second uncertainty is whether we model selected processes precisely enough. I suspect that we only have extremely crude estimates of this uncertainty. That’s probably why IPCC shows “uncertainty bounds” as a maximum and minimum values of 85% of models. Long live science.

      With this uncertainty an almost-forgotten computer technology could help. We should run the models not in a 32- or 64- or 128-bit precision, but in a 32- or 64- or 128-bit “interval arithmetic”, where each quantity is represented as a [lower bound, upper bound] pair. As an example, 1/3 would be represented as [1,1]/[3,3] = [0.3333,0.3334] because of rounding errors.

      The intervals capture not only an uncertainty resulting from a finite precision of computer calculations, but also a natural variability or uncertainty of data, be it a variation of a wind speed over an area, or a temperature distribution in a grid cell. Thus, a thermometer reading 12.0 probably means [11.9,12.1], or with a really good thermometer [11.95,12.05].

      On today’s computers the implementation of interval arithmetic with its required rounding up and down is extremely slow – slow enough to be completely impractical. But it should be relatively easy to design an interval arithmetic floating-point processing unit. We only need to run several FPUs in parallel, get both rounded-up and rounded-down results, and select maximums and minimums, e.g.,

      The design would only cost millions, not trillions (and, if mass-produced, the processor would be probably less than ten times more expensive than today’s floating-point processing unit). How much is a Met office’s new supercomputer?

      Even better would be a “triplex arithmetic” proposed by Prof. Nickel in 1970s – represent a quantity as a triple [lower bound, standard result, upper bound] which makes it easy to compare model runs with older models.

    • Steven Mosher

      yes. I have seen some sensitivity analysis run. But a full parameter test grid is not feasible due to runtimes. The approach is to run fractional factorials and create an emulation. then run the emulation on the full space. Then check the emulation with additional runs on the full model

    • The paper by Farrell et al is interesting:
      P. E. Farrell et al.: Automated continuous verification for numerical simulation

      In relation to GCMs in section 2.3, they say:

      2.3 The limitations of testing
      It has been noted elsewhere Oreskes et al. (1994) that complex geoscientific models such as GCMs may be formally unverifiable simply because they do not constitute closed mathematical systems. Aspects of these models, particularly parameterisations, may be difficult to formulate analytic solutions for and may not, in fact, converge under mesh refine-
      ment. Nonetheless, individual components of models considered in isolation, for example the dynamic core or an individual parameterisation, must have well-defined and testable mathematical behaviour if there is to be any confidence in the model output at all. If there is to be any confidence in the output of a formally unverifiable model, it is surely a necessary condition that each verifiable component passes verification. The methodology explained here is therefore useful in at least this context. The automated verification of code stability (i.e. that the model result does not unexpectedly change) must also be regarded as a key tool in the verification of the most complex models.

      Surely this is the next hurdle of software quality of GCMs.

  39. Joshua Stults in this post noted an up-coming report from the National Academies.

    And with this post, notes the availability of this pre-publication report

    A summary in pdf is here.

    The complete report in PDF, Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of Verification, Validation, and Uncertainty Quantification.

    Climate modeling is mentioned.

    The V&V and UQ procedures and processes are those originally developed by Roache, Oberkampf, Roy and co-workers.

  40. DocMartyn quoted me in an unfair way – giving only part in the statements:

    while leaving off the full quote: “Second, it is a mistake in modeling is to expect computer models to reveal new knowledge. They may, but it is likely they will reveal only our scientific expectations.”

    Because of the errors (other than the quote), I’d like to elaborate the point I made. One great question of mathematical physics is whether the point mass model of the solar system, a Newtonian N-body problem, is stable. If that could be established, it would represent new knowledge about the N-body problem and open new areas of research on the motion of the physical solar system. That new knowledge could never be accomplished by any computer model. Nor could a computer model disprove stability.

    Any computer model of this problem isn’t even the math model of the problem because it involves approximations of non-linear differential equations whose errors will grow by reason of the non-linearity and eventually computational solutions will diverge from the mathematical one.

    Even though we can build SW that can guide us around the moon and planets, that SW creates no new knowledge about the math N-body problem. But SW orbital models have helped us to understand our lumpy earth by the deviations that earth produced on satellites which we demonstrated we understood by modeling the lumps and matching prediction to experience with nature.

    GCMs can’t predict the climate next year — nature is telling us that we don’t understand it well even if the SW has zero defects. There are a lot of reasons for that failure — failure to understand the science is the concern now, not SW quality.

    • Bernie Schreiver

      Philip, I definitely agree that numerical integrators for Newtonian dynamical systems (from molecules to planets to galaxies) illustrate many of the points under discussion. But the main lessons-learned (I would argue) include:

      (1) Development efforts that are small, diverse, rapid, and democratic consistently excel relative to development efforts that are large, monolithic, slow, and structured.

      (2) Fundamental theory learns similarly much from numerics, as numerics learns from fundamental theory.

      The classic example is the synergistic relation of (numerical) symplectic integrators to (fundamental) KAM theory. This is a case where multiple rapidly-developed numerical simulation codes stimulated similarly rapid advances in KAM theory … a classic case of “small-and-fast is beautiful.” :)

      In contrast, I am not aware that *ANY* significant advances in our understanding of Newtonian dynamics vis-a-vis KAM theory have *EVER* been associated to the implementation of formal V&V methods.

      • Bernie,
        Thanks for your thoughtful reply. Your KAM/numerical analysis example has a number of differences from climate modeling that suggest it is a poor notion to apply there.

        First, the Newtonian N-body model as a mathematical abstract of a problem in physics has been well developed for several centuries. In it, physical realities have been abstracted away in an understood way (the earth is a point mass, not lumpy for example). So, we note that there are physical properties stripped away for the mathematical model. If implemented in a computer code, the N-body problem becomes a new computational model which might deviate from the math model in significant ways.

        So, second, the computations of the computer implementation has more than implementation errors for concern, it needs to be concerned with whether the method of computation can produce the result (say the state of the N-body system some thousand years or one year into the future) with sufficient accuracy. Its a distinct problem to determine whether the computer implementation is suitable for the intended purpose.

        In GCMs it appears we’ve jumped to computer models with no statement of a mathematical model (with all the equations and assumptions) to serve as an abstract of the physical model being studied. This conclusion seems justified by the recent excuse of aerosols for the failure of the earth to warm as predicted. Obviously, we’re jumping into computations without properly formulating the problem. And, we’re making no real effort to characterize the limits of the predictive power of the GCM computations with our accepted physical assumptions (in a math model). And where is the effort to determine GCM suitability?

        It might be useful to note that some ideas take decades (and the right minds) to develop. Throwing money at Newton would not have create KAM.

  41. My opening paragraph just posted was truncated — it should have been: [DocMartyn quoted me in an unfair way – giving only part in the statements: “Second, it is a mistake in modeling is to expect computer models to reveal new knowledge”
    That is the most insane thing I have read in a while.
    This is precisly what models in science based fields do all the time. Indeed, the major difference between a model and a fit is that models are predictive in multidimensions. ]

  42. Looking at ModelE code from GISS (on Stephen Mosher’s recommendation)

    I’m surprised how ‘uncomplex’ this code is, around 70k LOC but a lot of that is just parameterization data or housekeeping type code. There are modules for the basic systems (ocean, ice, radiation, clouds, snow, vegetation etc). (I can’t help thinking climate can’t be anywhere near as simple as that, but I’m not an expert there!) It is definitely loaded with constant values that represent assumptions. I didn’t see weightings of confidence in the various numbers, but given the naming system I can’t rule out that being there. I’m guessing this may have been C1 in the report based on it being open, and the line count. Just a guess though.

    There are literally thousands of cryptic variables and no descriptions of them that I can find. There are no tests, but they do seem to pay careful attention to conservation of energy and water. There is no typing in fortran, though almost everything seems to be a double anyway : ). There are plenty of commented out sections without explanation. Various sections can be turned on and off during runs.

    There certainly were bugs and crashes that were fixed (a good thing), from the faq:
    “modelE fixed a number of bugs (some major, some minor)…”
    or more worryingly:
    “modelE has made explicit most of the dependencies that were hidden in model II’. Thus minor changes are less likely to have devastating knock on effects.” <- this is more a sign of poorly designed software

    It has the feel of a large personal project rather than rigorously engineered software. It would be pretty hard for someone to evaluate the logic of what is being done (at the low levels) due to the naming and lack of comment/doc guidance. I would guess validation couldn't be done by anyone not intimately familiar with the papers various sections must have been based on (though usually references aren't listed).

    Open source projects like Eclipse are not exactly the high bar in software quality (eg – not like code for nuclear reactors). That said, this really doesn't compare very favorably at all. It was made and evolved as a tool, so the messy process and code may well be justifiable (not all code need be perfect), but to call it 'high quality' by any standard seems a stretch. It is flexible and it seems fairly easy to manipulate for different inputs and outputs, I'm sure those who use it find it useful. I'm sure there are plenty of bugs left in there too, like in all code of this nature.

    This speaks nothing to the accuracy of the output, other than bugs would reduce it. (It seems in measuring, mistakes can be averaged and will often cancel each other out. This is most certainly not the case in code.)

    • Latimer Alder

      Perhaps it is worth reminding ourselves of the trials and tribulations of Ian Harris at UEA/CRU while he was trying to make some sort of sense of the code and data that they had accumulated.

      Caution: Not for IT guys who are easily shocked. Nor for those who imagined that looking after the most important dataset in the world would be a cool and calm professional organisation. With documentation, audit trails, archives, backups…all that serious grown-up IT stuff that its taken us 60 years to learn the hard way.

      Nope…this is back to the earliest days when IT was ‘data processing’ and you dreaded dropping your stack of punch cards and getting them out of order. The place where IT meets Climatology. And it is not a marriage made in heaven!

      Raw stuff here:

      Selected highlights with hilarious (but profane) commentary here:

  43. Latimer Alder

    Seems to me that the discussion here has become a bit polarised.

    I may be wrong but I think it breaks down into two camps.

    Hardcore lifetime acdemics who see the climate models and their processes and practices as pretty good under the circumstances of them being difficult and complicated physics and not even understandable to outsiders without a doctorate in Radiative Physics. Their basis for comparison of good/bad/indifferent is with other climate models or other academic modelling projects.

    And professional IT engineers who view it very differently. They see them as just another IT project that happens to be about climate modelling. Ad their basis for comparison is much wider…commercial systems, highly regulated systems, oper\ting systems, mission critical 24×7 systems and the like. And they know quite a bit about what causes IT projects to be successful or not. About good practices and bad ones. Nad – like anybody else – they discuss the horror stories with their peers and colleagues. And even write papers about them. They have a wide and varied base of epertise…not just in one very narrow very specialised field.

    So it should be very very worrying indeed to the climate modellers that they are getting pretty near unanimously negative criticsm from the IT professionals about their IT practices. And this cannot just be explained by the plaintive cry that ‘climate models are so different that we must be judged only by (our own) different standards’.

    It may be that there are some very clever scientists writing climate models who understand the partial differential equations of radiative physics better than anybody else on the planet. But turning that into basic code is probably less than 10% of the work needed to get a successful implementation of any project.

    It is in the other 90% of the task that the climate modelling efforts are seen to be severely lacking.

    • Bernie Schreiver

      Latimer, as was previously noted, the experience of the molecular modeling community is precisely opposite to your recommendations. That is, in head-to-head competition, the most effective software development strategies are relatively smaller, faster, less formally structured, and more democratically organized.

      Formaiized V&V methods claim to be better in principle … yet it is far less evident that they are better in practice. Experience suggests the opposite.

      The reason for this is partly common-sense. Ten rapidly, creatively, and democratically developed simulations — two of which are deficient, and two of which are brilliant — are easily superior in speed and quality to one slowly, rigorously, and rigidly developed simulation, whose performance is mediocre.

      Think of the distinction as creative capitalism versus soviet communism, and you’ll appreciate why, in practice, rigid V&V methods are seldom attempted and often perform poorly.

      • Latimer Alder


        Sorry – you are missing my point. It may well be that for the 10% of an IT project that is the actual creation of the algorithms that you are going to use then the method you describe is more effective. I don’t have an opinion about that. If it works..then great..use it.

        In IT we call these ‘skunk works’ and they are great for rapidly knocking up demonstrations or to show a client a ‘proof of concept’. Quick and dirty – and they are good ‘teasers’ or advertisements for the real thing.

        But they are only at the very beginning of the much much bigger job of making solid, reliable, maintainable ‘industrial strength’ software that can function well in different circumstances and meet a variety of needs. They are, at best, early prototypes of what such software might look like.

        It may be that you are content in your own mind that climate models should be no better than prototypes that might be capable of running without a failure under the careful tending of the authors (*) .And I wouldn’t give a monkeys about it if it weren’t for the inconvenient truth that it is only the climate models that give us any idea at all of there being an imminent climate catastrophe. There are no observations that lead us to this conclusion. No experiments. Just the sayso of the models.

        And if we need to restructure the entire world economy – with huge expense and disruption of billions of live o do so, I would like to have some faint assurance that we are taking these momentous decisions on rather better advice than from some code written by ivory-towered Fortraners with no professional IT training or experience.

        Climate models are IT projects. They should be done to professional IT standards. If you wish to keep them in the sate of being academic playthings, fine. But don’t simultaneously pretend that their results are anything other than toys. You cannot have your cake and eat it.

        (*) Been there, done that, got the t-shirt 35 years ago…in Algol!

      • Latimer Alder

        And just in case haven’t made it clear, IT professionals are just as lazy and prone to take short cuts as anybody else. But our collective experience of hundreds of thousands of projects is that if you spend that bit of extra time at the beginning to do it right, – not just quick and dirty – you save unimaginable amounts of pain, grief, anguish and money later on down the line. Been there, worked the continuous support shifts, had the clients foaming at the mouth with rage and sobbing hopelessly with frustration. And the 500 angry lorry drivers wanting to get out on their rounds….trust is better to do it right.

        And if that means a bit of extra work for the programmers…and maybe some new skills for them too, then that’s just tough. Its only the lives of seven billion people that are being affected by the results. Don’t you think that its worth trying to get it right?

      • +100. I’m a programmer. The cost of fixing bugs in production is much, much higher than catching them in unit testing, system test, or even user acceptance testing.

      • And I might add, the cost of a bogus climate model will be huge both to our personal freedoms and money.

      • Bernie and LA:

        This exchange is great.

        LA, I think your initial comment directly above is misleading b/c you focus on the supposed difficulty of the subject and expertise of the expers, whereas Bernie’s point is more about the competitive, multi-team approach. LA, I think your major point comes through better in your response.

        Bernie – can you give an example of how the molecular modeling you described has been solidified into practice? A winning team goes public…or something.

      • Bernie Schreiver

        BillC, a well-respected starting reference for “small-fast-democratic-competitive” simulation practices is the textbook by Frenkel and Smit titled Understanding Molecular Simulation: from Algorithms to Applications, which is a compendium of “best simulation practices” extracted from hundreds of different projects and articles. For an up-to-date snapshot of this rapidly evolving discipline see the CASP web pages (as was referenced earlier).

        Similarly, a well-respected starting reference for “large-slow-monolithic-monopolistic” coding practices is the Lockheed-Martin Corporation’s Joint Strike Fighter Air Vehicle CPP Coding Standards for System Development.

        The difference between these two cultures is captured by a job interview that a friend of my had with a large aerospace corporation:

        Interviewer “If we hire you, you will be expected to write, each working day, two lines of fully documented code.”

        Friend “No problem. I can write many more lines of code than that!

        Interviewer “You don’t understand. You will be required to write precisely two lines of code each working day. No more, and no less.”

        Some folks (including almost all mathematicians and scientists) strongly prefer “small-fast-democratic-competitive” coding efforts, and other folks strongly prefer “large-slow-monolithic-monopolistic” coding efforts … this is largely a matter of individual taste.

      • Bernie,

        You didn’t answer my question. You can’t refer me to a textbook to explain how for instance, a winning team has gone commercial with its code.

        A better example than Lockheed Martin or CASP, would probably be Google.

      • Bernie Schreiver

        BillC, you need to appreciate that the present-day dominant business model in molecular simulation is not to license codes, but rather to release codes freely, and sell expertise in understanding the code.

        It’s true that some academic simulation codes have been commercialized, but this business strategy has encountered significant speed-bumps (see the web page Banned By Gaussian, for example).

        That is why, nowadays, hybrid public-private strategies increasingly dominate the simulation industry.

        The common-sense point is that even the most scrupulous V&V procedures do little to prevent incompetent modelers from generating wrong predictions. Thus the key question turns out to be “Does the operator understand the code’s underlying physics model?” — and it is this human understanding that the simulation industry profitably sells.

      • Latimer Alder

        Nice story. Unfortunately I don’t believe a word of it.

        And whether you use small fast methods or large slow methods is only about ‘how’ you do something. Whether you use a screw and a screwdriver or a hammer and a nail.

        What is really important in getting your project successful is ‘what’ you do. Not ‘how’ you do it. It is a big mistake to argue about the latter while missing the former.

        It may be that a small fast method can get you to the wrong place quicker than a big slow one. But however you got there, if you’re in the wrong place you have failed.

      • Bernie,

        I visited that website. It appears there could be a legitimate complaint there.

        I understand the issue of selling the expertise rather than the code. It is legitimate. At the same time, as both an academic and a consultant, it’s a great way to make money and reduce liability.

        Since my initial question is irrelevant, how about this one: What companies or government agencies, or NGOs, are using the open source code generated by the CASP program to successfully develop product?

      • Bernie Schreiver

        Readers of Climate Etc. (skeptics especially) are invited to verify for themselves that free software dominates the molecular modeling industry.

        The reason is simple: the dozens of open-source simulation packages are cross-fertilizing themselves with innovative ideas faster than the close-source packages can keep up — it’s pure evolution-in-action.

        Will climate simulation software evolve along a similar trajectory? This parallel evolution is obviously natural and may even be inexorable.

      • Latimer Alder

        @bernie schriever

        ‘Readers of Climate Etc. (skeptics especially) are invited to verify for themselves that free software dominates the molecular modeling industry’

        Sure. And free software is great because you get exactly what you pay for…no more and no less. For things where support and responsiveness on a best efforts (or not at all) basis are adequate, then free software is appropriate.

        But if you want to use it for anything ‘serious’, the lack of defined responsibilities and often the lack of ownership means that it can be a very fraught place for the unwary user.

        If I am betting my reputation/career/business/species on the way in which a bit of software works…and will continue to work for as long as I need it to, then it is (to my mind) much better to have a defined individual whose cojones one can squeeze gently in the palm of one’s hand as an extra incentie ot get the problems fixed. Baked up (if in US) by a team of lawyers ready to sue the ass off the bastard if it doesn’t work.

        When the chips are really down, gentlemen’s agreements to help out the best that they can aren’t worth the paper they weren’t written on.

        Sorry if that sounds a bit harsh…but out in the big bad world that’s the way that serious software is often viewed.

      • Bernie Schreiver

        Hmmm … I wonder whether Climate Etc. hosts the world’s finest free-as-in-freedom typesetting software, namely \TeX/\LaTeX (TeX/LaTeX)?

        \int_0^\infty\text{exp}(-x^2)\,dx = \sqrt\pi

        Did the above render OK? Whether “yes” or “no”, many readers of Climate Etc. will appreciate that the world’s leading scientific typesetting software (by far!) not only is completely, totally, and forever free, but also has on-line forums that for friendliness and responsiveness excel any commercial vendor by a wide margin.

        Already widespread among mathematicians, and increasingly among scientists and engineers, is the ideal that all professional software tools should be free-as-in-freedom.

        This emerging ideal is partly moral, but mainly pragmatic: in the internet era, free software tends to be better software.

    • You make some very good observations, Latimer. To summarize, the explanation for the two camps would be:

      1. Quality is in the eye of the beholder. The “academic” camp considers the GCMs to be scientific software used to understand the climate. The “IT” camp considers the GCMs to be engineering software used to design, or at least help justify, climate mitigation and remediation efforts. Since the requirements and stakeholders for scientific GCM software are different than those of GCM engineering software, the appropriate IV&V will be different. This in spite of the fact the software itself my be exactly the same. Software can be of high quality for one usage, and of low quality for another.

      2. The “academic” camp consider themselves to be a superset of the “IT” camp. That is, climate scientists believe themselves to be more expert about the quality of the GCM software (and its software development processes and metrics) for any usage of the software than software quality engineers in general. (If this is true, we will see that the “academic” camp will not be as worried as you think (or hope) they should be.)

    • Steve Milesworthy

      they [climate model developers] are getting pretty near unanimously negative criticsm from the IT professionals about their IT practices

      Basing your analysis on a bunch of self-proclaimed IT professionals on the blogs you frequent doesn’t add up to much, particularly when the analogies you make (to ATMs failing etc. etc.) are so far off the mark to indicate that you have absolutely no clue as to what constitutes normal practice in developing, testing and supporting climate models.

      • Steve,
        Don’t forget the thrashing of the stats techniques used by AGW promoters in the stats journals.

  44. IPCC Climate Models

    The GMT difference between the observed and IPCC’s projection for 2011 already below –2*sigma of –0.2 deg C!

    IPCC models have no skill. None!

    Only a couple of years for the error to go below –3*sigma!

    • selti1 | April 17, 2012 at 11:14 am |

      Will you zero that graph properly, and stop making unfounded claims on an unzeroed line!

      • Let other be the judge.

        Is there anything wrong in the following graph?

      • Other than not attributing your sources for the pink annual temperature series, and the empirical model of a flat temperature trend which doesn’t look so flat, and misquoting the IPCC, there is not much wrong with your graph.

        And you should make it very clear which part of the graph is your work and which part of the graph is the work of others, otherwise you could be charged with plagiarism.

      • misquoting the IPCC

        How did I do that? Please explain.

      • Ah, I see you removed the 0.2 C per decade part of the graph, are you conceding that that part is misleading?

        Go read the IPCC reports, and tell me what it actually says about the 0.2 C per decade projection. You will find that an important word is missing. Which you left out of your graph, that is, before you removed it all together. And since this is a technical thread, maybe quotes from the executive summary of the IPPCC reports are not appropriate.

  45. Statistical testing of IPCC’s 0.2 deg C per decade warming

    In the last 102 years, the observed data was below the lower error band limit of –2*sigma only three times in 1950, 1956 & 1976 as shown.

    Theoretically, in a normal distribution, the probability of values between +/- 2*sigma limits is 95.4%. This mean that the probability of values outside the +/- 2*sigma limits is 4.6 %, which intern means the probability of values below the lower limit of –2*sigma is 2.3%. As a result, for sample size of 102, the theoretical estimates for values below the –2*sigma limit is 102*2.3/100 = 2.3 which is approximately 3 and agrees with the observed values for 1950, 1956 & 1976 above.

    In contrast, for IPCC projection of 0.2 deg C / decade warming, just in the last four years the observed data was below IPCC’s –2*sigma limit twice in 2008 and 2011 as shown.

    This shows that the 0.2 deg C per decade warming of the IPCC is becoming statistically impossible.

    As you can see in the Appendix here the long-term rate of increase in SST was about 0.06 C / decade between 1900 and 1930 and has now reduced to about 0.05 C / decade.

  47. Michael Hart

    Bernie Schreiver | April 17, 2012 at 11:37 am |
    “Readers of Climate Etc. (skeptics especially) are invited to verify for themselves that free software dominates the molecular modelling industry.”

    I disagree. I have found free academic molecular-modelling software [molecular mechanics, semi-empirical, ab-initio quantum mechanical computations] to be difficult to install and use, and have appalling documentation and user-friendly-ness for anyone not acquainted with those who wrote it or use it regularly.

    As a chemistry grad student in the 1990’s I was taught molecular modelling using the Spartan program from Wavefunction Inc. It is one of the more popular molecular modelling programs in education and I find it reasonably intuitative to use and has attractive graphics. [The molecules I made certainly had some of the structural properties I was looking for when I did the modelling]

    Sure, arguments can always be made about the merits of ‘high end’ performance, depending on your specific needs, and the academic specialists will be involved in their own software writing. But when I went looking for some molecular modelling software about three years ago I ended up paying over $1000.00 of my own money to buy a copy of Spartan for my laptop. [I do not have any connection with Wavefunction Inc.]

  48. Identical 60-years global warming rates:

    Early 20th century from 1988-1944 = > 0.06 deg C/ decade
    Mid 20th century from 1910-1970 => 0.06 deg C / decade
    Late 20th century from 1939-1999 => 0.06 deg C/decade
    Whole 20th century from 1884-2004 => 0.06 deg C/decade

    No change in the global warming rate in the last 120 years!

    • Your conclusion is clearly backwards, Mr. Orssengo.

      The fact that you’ve produced only four identical 60-year slopes on 120 years of data tells us the global warming rate has varied widely in the last 120 years.

      Do you have no other sets of three or more slopes identical with each other for 60 years that appear in the dataset?

      This isn’t unexpected, given your derivative curve tells us to expect the slope to change over time, generally accelerating from the corollary slope of 60 years ago at every point.

      Well, provided we accept the logic of expecting the future to match a temporary-looking condition in a current highly tuned line, without doing further investigation.

  49. Bart

    The R^2 for my result ( ) is 0.923 as shown =>

    • Sure, for this curve the r^2 isn’t as much of an issue, as you’re already talking about a derivative.

      Which means you should not label your derivative curve “Observed”, as you are not directly observing it. “Derived” would be fine.

      It’d be nice if you did include a reference line showing the actual observed data you’re basing your derivative on, but that’s not a big deal.

      Of all of your graphs, this one depicts the most startling result, and appears to be the most worth investigating. However, derivative curves are tricky and ought not be approached without due care and attention, which I’m concerned about in your case given past issues with other graphs and what you’ve said about this one.

      • Yes, it is interesting.

        My interpretation is , from the bottom left side of the trend line, the warming phase of both the model and the observed trends starts and increases until they reach their maximum, then they reverse direction and decrease until they reach their minimum that are now just above the previous minimum, then they reverse again and increase until they reach their maximum that are now just above the previous maximum, and reverse again (it has decreased from 0.19 deg C per decade to about 0.15 deg C per decade, and has way to go to the next minimum).
        Here is my model

      • A rising derivative curve indicates acceleration.

        If you have discovered a truly novel relationship of physical factors that can be linked to an underlying mechanism, then what your graph is saying is for so long as that underlying mechanism lasts that:
        a) by 2024 we will have seen the last negative 30 year trend in global mean temperature;
        b) sometime in the next decade a new global temperature record for a calendar year will be set;
        c) from 2015 on, global warming will accelerate;
        d) every decade in the future will be warmer than the previous one.
        e) every three decades will be warmer than the three decades 60 years previous by more and more every year.
        f) there is a deep and uncharted six decade cycle operating in a way that synchronizes the global climate to it through temperature.

        I’d say taken together, these six claims are on very rocky ground.

        GHE adequately explains the observations, and better than your curve, and makes more plausible outcomes likely.

        However, it is a very interesting illusion to see, both compellingly convincing and at the same time utterly baseless to make any claims about at all.

  50. How about calling it “trend for the observed data”?

    • Seems wordy, inaccurate and confusing. Why use more words than are needed to be wrong and hard to follow? That’s my job.

  51. The only way to validate a climate MODEL is to determne how valid it’s predictions are after 10 to 20 years. . PERIOD !

  52. Le Pétomane,

    I.m pretty sure that there are 2 criteria for judging climate models. Neither of which involve correctness or a 30 year old carbon monoxide study.

    ‘AOS models are therefore to be judged by their degree of plausibility, not whether they are correct or best. This perspective extends to the component discrete algorithms, parameterizations, and coupling breadth: There are better or worse choices (some seemingly satisfactory for their purpose or others needing repair) but not correct or best ones. The bases for judging are a priori formulation, representing the relevant natural processes and choosing the discrete algorithms, and a posteriori solution behavior. Plausibility criteria are qualitative and loosely quantitative, because there are many relevant measures of plausibility that cannot all be specified or fit precisely. Results that are clearly discrepant with measurements or between different models provide a valid basis for model rejection or modification, but moderate levels of mismatch or misfit usually cannot disqualify a model. Often, a particular misfit can be tuned away by adjusting some model parameter, but this should not be viewed as certification of model correctness.’

    One notes the ‘plausibility criteria’ of ‘a posteriori solution behaviour’. It relates to the chaotic nature of the underlying maths of climate models. It rather immaturely pleases me to remark that – very much like you Le Pétomane – they pull it out of their arses.

    Robert I Ellison
    Chief Hydrologist

    • Robert I Ellison

      Why use more words than are needed to be wrong and hard to follow?

      Mr. Orssengo wants that job, too.

      Is the pay for being obscure and inaccurate so much better in Australia than in America?

      • Le Pétomane.

        And here’s me thinking you had cornered the market in trivial wrongheadedness.

        I will admit that the James McWilliams review is damn near impenetrable – through the effort I took to understand it. But McWilliams is not someone to blithely dismiss as wrong out of hand. – – Nor is this merely an appeal to authority – no matter how grand – as I have taken the effort to understand the meaning of ‘irreducible imprecision’ in ‘atmospheric and oceanic simulations’.

        For someone so keen on chaos in climate systems – it seems improbable that the mathematics of the Navier-Stokes partial differential equations and the limitations on the precision of quantification of both initial and boundary conditions escapes you – limitations in data, in physical theory and in parametising sub-grid processes. This leads to a divergence of solutions within the bounds of feasible inputs – a limit of irreducible imprecision that is not at all defined in AOS. Hence the need to choose amongst non-unique solutions, whose limits of divergence are not defined, based on a subjective assessment of ‘a posteriori solution behaviour’. ‘Sensitive dependence and structural instability are humbling twin properties for chaotic dynamical systems, indicating limits about which kinds of questions are theoretically answerable.’

        This post seems focussed on the first plausibility criteria – the accuracy of depiction of ‘the relevant natural processes’ and in ‘choosing the discrete algorithms’ as well as in the potential for bugs to be embedded in the programming. The second of McWilliams plausibility criteria is much more fundamental to the ability of AOS to make credile projections. .

        Robert I Ellison
        Chief Hydrologist

      • Ah…credible projections…not sometjhing that shuld worry you too much Bart

      • Ah…something that should…I guess I’ll go back to the drawng board.

  53. I have read the paper and given it the sort of peer review we gave each other as quality control in a consulting firm. I’m afraid I am not kind to the paper. Also, I find that most of the comments here (up until author Easterbrook’s post at least) are off topic. They deal with validation, verification, or error issues which have nothing to do with the paper in question.

    My review:

    Much of the early sections of the paper seem to be included mostly to prove that the authors have read the relevant literature. The actual ideas expressed were not actually used in the study and play no part in the analysis and conclusions. Examples include:

    352/18-21 The definitions of validation and verification. Nothing in the study is directly or indirectly concerned with validation or verification of the climate models. The issues are simply not addressable within the constraints of this study. Many of the comments on this blog address validation and verification; the study in question does not.

    353/1-8 Easterbrook’s distinction between intrinsic and internal quality is similarly of no direct or indirect relevance to the study.

    353/17-19 Also, the distinction between acknowledged and unacknowledged errors is not a relevant feature of the study, although it is relevant to the logical defects embodied in the study’s methodology. Unacknowledged errors are a key quality component, and as the authors admit, cannot be tracked by defect densities reported during software development and maintenance.

    The key issue, and the Achilles heal of the entire study, is acknowledged in the authors’ summary of Kelly and Sanders:

    From a scientist’s perspective, Kelly and Sanders observe, “the software is invisible” – that is, scientists conflate the theoretical and calculational systems – unless the software is suspected of not working correctly.

    This is a typical perspective which arises when a very intelligent, mathematically sophisticated community engages in developing its own software, trusting in their own amateur ability and smarts rather than the disciplines and practices developed by software professionals. I am intimately familiar with this hubris and overconfidence from 20 years of work as a consulting actuary. What is invisible is not the software, but a concern for software quality. It is simply assumed because the developer-users are so smart. This culture makes errors commonplace. Only a vigorous dedication to validation or verification can produce quality software in such a culture.

    The result of this culture is amply illustrated by the quoted Hatton study. The T1 portion of that study demonstrates that amateur scientific programmers routinely generate large numbers of program defects almost without awareness of the difficulty. It is a defect-tolerant programming culture. The T2 portion of the study proves conclusively that the scientist-programmer work product is very poorly verified. Hatton’s conclusion is inescapable:

    the results of scientific calculations carried out by many software packages should be treated with the same measure of disbelief researchers have traditionally attached to the results of unconfirmed physical experiments.

    Just as the study does not directly address validation and verification, it also is not directly concerned with the issue of errors, which the authors define as

    … the difference between a measured or computed quantity and the value of the quantity considered to be correct.

    The study nowhere suggests that its ‘defect density’ measures this type of failure, and their definition of defect makes explicit that errors were not intended to measured in the study.

    However, substantial fuzziness emerges as the authors proceed from their explicit definition of ‘defect’ to their working definition of ‘any problem fixed’. It is not at all clear whether the interpretation of the revision logs specifically attempted to maintain their declared notions of ‘defect’ and ‘failure’. It is also unclear how the methodology of characterizing reports may have induced differences in density measures simply due to differences in vocabulary and terminology between the scientist-programmers and the open source projects used as ‘controls’.

    Finally, the authors fail dramatically to come to terms with their own logic and the full scope of the scientist-programmer problem. The problem is differences in behavior between the two software development cultures. As the authors say:

    In order to be counted, defects must be discovered and reported. This means that the defect density measure depends on the testing effort of the development team, as well as the number of users, and the culture of reporting defects. An untested, unused, or abandoned project may have a low defect density but an equally low level of quality.

    As documented by the cited Hatton study, low reported defect density and low quality are exactly what should be expected as the product of scientist-programmers. In this case, the low reported defect density is exactly what should have been expected as an indicator of the expected low quality of the software.

    The authors make explicit that they assume that finding, fixing and reporting behaviors are similar on the climate model projects and the comparator open source projects, and they admit:

    If they are not – and without any other information we have no way of knowing – then we suggest the defect density measure is effectively meaningless as a method of comparing the software quality, even roughly, between products.

    But, contrary to this assertion, they do have a reason to believe that the behavior is different between the groups. The cited paper by Hatton is a strong indication that the scientist-programmers of the climate models are likely, as other scientist-programmers do, to have a high defect tolerance and to write code with substantial undiscovered, unacknowledged error, defects, and even failures.

    Accordingly, the authors conclusion that ‘low defect density suggests that the models are of high software quality’ is exactly wrong and does not follow logically from their results. The observed low defect density confirms the expectation of low software quality. The surprisingly low density is not surprising at all, but is instead strong confirmation of the hypothesis that the climate models are amateurish trash.

    As I said above, I have had extensive experience with the sort of amateur trash I suspect the GCMs actually are. I wrote enough of it and had to maintain enough of it written by others that I can usually spot when I see it. Fortran is my ‘native tongue’ as a programmer, so I spent 15 minutes or so poking around in the NASA model. Not as bad as it could be, but I don’t think the modules are small enough or adequately protected from unintended interactions with each other. Not as bad as it could be, but not reassuring. I would have to see evidence of dramatic, continuous validation and verification efforts before I bet more than a few dollars on the reliability of that software.

    • tom, thank you for your detailed analysis

    • No matter how well written the software is if it produces the wrong output it is useless.

      No amount of quality control will cause modules to output the correct answers.

      I think current models are generated to create more funds. the only way to do this is to predict huge warming in 100 years. Predicting the actual temperatures is not the goal.

      Have you ever wondered why the later years have more warming when we know CO2’s effects are logarithmic ? Logically warming should decrease in the later years.

    • Mr. Tom Schaub,
      Thank you for your very clear discription of what this stuff ‘is’.

      “The surprisingly low density is not surprising at all, but is instead strong confirmation of the hypothesis that the climate models are amateurish trash.”

      You made IT very understandable to this reader.

    • Eric Ollivet

      Many of the comments on this blog address validation and verification; the study in question does not.

      You may think most comments are off topic.
      It partly true in the extent that commented study did not address V&V at all.
      But actually I would rather say that this study itself is off topic because it didn’t address the crux of the matter which is, indeed, Verification & Validation !

      The quality of a model shall not be assessed only on the basis of software / coding aspects but mainly on the soundness of the science that sustains the model. This means on the ability of coded equations to faithfully account for the system and processes they intend to simulate. This quality cannot be assessed by defects counting but only through a careful and independent Verification & Validation process.

      None of the Climate models has ever passed any V&V process and it is highly probable that none of them will ever be able to do so, as long as they will focus on CO2 as the thermal knob of our climatic system. They have all proved unable to provide any plausible and reliable forecast since their outputs are daily falsified by observations. This means they are formally invalidated and therefore useless.

      IOW, their quality is poor and even lamentable !

      • Prezakly. An internally consistent model of the climate based on phlogiston is worthless except as a playtoy; same with CO2 forcing.

    • John Carpenter

      “Finally, the authors fail dramatically to come to terms with their own logic and the full scope of the scientist-programmer problem.”

      You know there is a reason why quality control is a separate and independent function from production in manufacturing. A part of the solution to the scientist-programmer problem involves understanding why.

      • Related is the persistent effort to make “90%” and “95%” confidence levels sound respectable. The reason 3,4,&5-sigma levels are required in real science is to combat the powerful forces of data snooping, investigator bias, fraud, self-deception — plus the hiding of negative results in dark corners. All of which are rampant and unconstrained and unrestrained in climate “science”.

    • Very good analysis.

      I have one quibble, though. Collecting and analyzing defect metrics is a verification activity.

      With all the complaints about lack of validation for climate models, the authors have good reason to explain the differences between verification and validation. It gives them safe harbor against claims that they were validating the models.

  54. Berényi Péter

    Do these climate modelling teams release their source code? Are subsequent versions archived? Documented? If not, there’s nothing to talk about.

    • There are rumours that some of them briefly considered it, but decided they were too busy and it wasn’t worth the effort.

      • Berényi Péter

        Wasn’t worth the effort? It’s kinda funny. One can of course say it’s not worth the effort to do science at all, but beyond that it is an utterly incomprehensible stance.

        The trouble is mainstream climate science with its computational climate models do projections instead of predictions which could be compared to subsequent observations and measurements using the traditional toolset of science.

        You can’t do the same with a projection, because it is a prediction (generated by running computer code) over a particular scenario. If later on none of the scenarios considered at the time the projection was made occurs in reality, you are left with a bunch of fiction, literally nothing that could be related to reality.

        On the other hand, if source code were archived, documented and published, one could re-run several decades old computational climate models with a scenario that corresponds to historical reality to see how well its predictions matched actual climate behavior. This way one could either confirm or falsify the model itself.

        But with no source code available, it can never be done.

        The best part is that computational resources have increased tremendously over the past few decades, so old models, that required supercomputers of their time to run, could be tested on present day desktops.

    • “If not, there’s nothing to talk about.”

      And yet, all anyone seems to do is talk, and no one develops and releases their own rival model code.

      • They don’t because scientifically inclined bloggers would have a field day in writing mocking posts dissecting the errors.
        So they instead say that nature is hopelessly complex and chaotic and leave it at that.

      • Posts like this one remind me in no uncertain terms why I’m a sceptic.

      • Peter317 | April 21, 2012 at 5:28 pm |

        As opposed to a skeptic who writes his own code?

      • Right, skeptic Peter317 writes his own code and posts brilliant research studies on climate science in his spare time. I think his web site is right here :

      • ‘The global coupled atmosphere–ocean–land–cryosphere system exhibits a wide range of physical and dynamical phenomena with associated physical, biological, and chemical feedbacks that collectively result in a continuum of temporal and spatial variability. The traditional boundaries between weather and climate are, therefore, somewhat artificial. The large-scale climate, for instance, determines the environment for microscale (1 km or less) and mesoscale (from several kilometers to several hundred kilometers) processes that govern weather and local climate, and these small-scale processes likely have significant impacts on the evolution of the large-scale circulation (Fig. 1; derived from Meehl et al. 2001).

        The accurate representation of of this continuum of variability in numerical models is, consequently, a challenging but essential goal. Fundamental barriers to advancing weather and climate prediction on time scales from days to years, as well as longstanding systematic errors in weather and climate models, are partly attributable to our limited understanding of and capability for simulating the complex, multiscale interactions intrinsic to atmospheric, oceanic and cryospheric fluid motions.

        The purpose of this paper is to identify some of the research questions and challenges that are raised by the movement toward a more unified modeling framework that provides for the hierarchical treatment of forecast and climate phenomena that span a wide range of space and time scales. This has sometimes been referred to as the “seamless prediction” of weather and climate (WCRP 2005; Palmer et al. 2008; Shapiro et al. 2009, manuscript submitted to BAMS; Brunet et al. 2009, manuscript submitted to BAMS). The central unifying theme is that all climate system predictions, regardless of time scale, share processes and mechanisms that consequently could benefit from the initialization of coupled general circulation models with best estimates of the observed state of the climate (e.g., Smith et al. 2007; Keenlyside et al. 2008; Pohlmann et al. 2009). However, what is the best method of initialization, given the biases in models that make observations possibly incompatible with the model climate state, and how can predictions best be performed and verified?’ Hurrell et al 2009

        Of course it is dynamically complex which is equivalent to chaotic precisely in physics – Webby. I don’t really care that the code for early beta climate models is or is not available – just the misuse they are put to by ignorant believers.

      • No, because it’s far easier to write mocking posts than it is to review and validate existing model code – which is probably why your activity is almost exclusively in the former rather than the latter.

      • The climate isn’t understood well enough to model temperatures 100 years from now .

  55. I think that the purpose of a climate model isn’t to predict temperatures 100 years in advance it is to assure that funds will be available for writing more models. To assure that the temperatures predicted must be very high, it a run o a model is too low it will be re written until it is high enough.

    Climate pessimism demands fear and the models are written to ganerate this fear [and $$] !


  56. For me personally the ideal mixing is just one
    a portion espresso to 2 areas hot dairy this particular will provide you the perfect cup of hot coffee through diluting the
    coffee but still exiting a mug of great powerful coffee.
    for hundreds of years past and also because recently since the early 20th
    century, it was accepted practise to add ground coffee in order to liquid in a saucepan,
    bring it in order to a boil creating the coffee
    to permeate water, .

  57. Furthermore, you could still take into consideration that this book is not totally perfect.

    Some readers have stated in The Tao of Badass review a few things where the book
    has to enhance. First would be, not all the “moves” that this book
    have are suitable for all ladies because there are people who
    are timid and hesitant. The strategies mentioned in the book works best with women who are outgoing and prepared to meet a guy that can sweep them off their feet.
    Bear in mind though, that no matter just how very good The Tao of Badass is,
    it cannot do all the work for you, nor could it offer you
    immediate results. Soon after reading through all of its content, you should put them
    to the test. If you follow the methods appropriately, and
    put a few time to practice it, then you could ensure that it would help your dating and sex life a whole lot.