[O]ur notion of software quality with respect to climate models is theoretically and conceptually vague. It is not clear to us what differentiates high from low quality software; nor is it clear which aspects of the models or modelling processes we might reliably look to make to that assessment.
Assessing climate model software quality: a defect density analysis of three models
J. Pipitone and S. Easterbrook
Abstract. A climate model is an executable theory of the climate; the model encapsulates climatological theories in software so that they can be simulated and their implications investigated. Thus, in order to trust a climate model one must trust that the software it is built from is built correctly. Our study explores the nature of software quality in the context of climate modelling. We performed an analysis of defect reports and defect fixes in several versions of leading global climate models by collecting defect data from bug tracking systems and version control repository comments. We found that the climate models all have very low defect densities compared to well-known, similarly sized open-source projects. We discuss the implications of our findings for the assessment of climate model software trustworthiness.
Citation: Assessing climate model software quality: a defect density analysis of three models, by J. Pipitone and S. Easterbrook, Geosci. Model Dev. Discuss., 5, 347-382, 2012, www.geosci-model-dev-discuss.net/5/347/2012/, doi:10.5194/gmdd-5-347-2012
JC comment: This journal is an online Discussion Journal (a format that I am a big fan of). To date, one Interactive Comment has been posted (oops just spotted another one here). Both of these reviews are favorable. I received the following review via email, from someone that wishes to remain anonymous (this review is not favorable). This person intends to submit an online review to the Discussion Journal, and will peruse the comments here to help sharpen review before it is posted. Note: Easterbrook has a blog post on the paper [here], with a good comment by Nick Barnes.
Review by Anonymous of Pipitone and Easterbrook’s paper
The following issues in the paper by Pipitone and Easterbrook are addressed in this Comment:
- lack of uniqueness of Global Climate (Circulation) Models (GCMs) software development procedures and processes
- effects of the Advanced Strategic Computing Initiative (ASCI) project on the development of modern verification and validation methodologies
- lack of precision for simple defect counting as an indicator of software quality
- lack of consideration of the fitness for production applications for the software in the decision-support domain.
Scientific Software Development
Global Climate (Circulation) Models (GCMs) are not unique in terms of the large numbers of domains of deep expertise required, complexity of phenomena, large numbers of system response functions of interest, or in any other regard.
All scientific and engineering software associated with complex real-world applications require close attention by experts who are knowledgeable of the physical domain, and these experts must be deeply involved with the software-development processes. The earth’s climate, and other natural phenomena and processes, are not exceptions in this regard. Some engineered equipment systems and associated physical phenomena and processes are easily as inherently complex as the earth’s climate systems. Models and software for such systems also require deep knowledge of a variety of different phenomena and processes that are important to the modeling, and the coupling of these to the equipment and between equipment components. For the same reasons, the users of the software are also required to have deep experience and expertise in the domain of the physical problem,i.e. the application arena.
As mentioned in the paper by Pipitone and Easterbrook, the developers of the models, methods, and software for these complex systems use the development process to learn about both the physical domain and the software domain. Tight iterative interactions between physical-domain, mathematical-domain, and software-domain experts and the software are the rule. GCMs are not unique or exceptional in this regard.
The Problem of Software Quality in Scientific Software
Easterbrook and Johns (2009) have presented a review of a few of the techniques that are used by some GCM model-development laboratories during development of GCM software. All the techniques described by the authors are standard operating procedures used during construction of research versions of all scientific and engineering software prior to release of code for production applications.
The activities described by Easterbrook and Johns are sometimes called developmental assessments: they are utilized by developers during the development process. The techniques described by the authors, however, are not sufficient to determine that the models, methods, and software have been correctly constructed, or that they are fit for application to production-grade applications. In particular, the developmental assessment activities described by the authors do not determine the order of convergence of the numerical methods and the fidelity of the calculated results to all system response functions of interest.
Stevenson (1999), and the first two references cited in Stevenson, Gustafson (1998) and Larzelere (1998), were among the first papers to consider the ramifications of the Advanced Strategic Computing Initiative (ASCI) of the Stockpile Stewardship Project relative to verification and validation of models, methods, and computer software. That project has an objective of replacing full-scale-tests experimentation with computing and smaller-scale experiments. The papers considered this objective to be among the more challenging ever under-taken for modeling complex physical phenomena and processes. These three papers rightly questioned the depth of verification and validation of the models, methods, and software that would be needed to ensure success of the Initiative. The authors of these papers were not optimistic that the immense verification and validation challenges presented by ASCI could be successfully addressed.
The objectives were associated with software development within the National Laboratory system, for which such verification and validation procedures and processes were not formally in place. However, all the laboratories involved in the ASCI project readily accepted the challenges and have made important contributions to development and successful application of modern verification and validation methodologies.
The developments in verification and validation initiated by Patrick Roache with significant additional developments by William Obefkampf and his colleagues at Sandia National Laboratory, and later joined by other National Laboratories, and with contributions by industrial and academic personnel and professional engineering societies, have answered all the concerns raised by those first papers by Stevenson and others.
Verification and Validation Methodologies Developed from ASCI
The modern methodologies for verification and validation of mathematical models, numerical methods, and associated computer software are far superior to counting defect density as an assessment of software quality. The books by Patrick Roache (1998, 2009) and Oberkampf and Roy (2010) have documented the evolution of the methodologies. Oberkampf and his colleagues at Sandia National Laboratory have additionally produced a large number of Laboratory Technical Reports, notably Oberkampf, Trucano, and Hirsch (2003). The methodologies have been successfully applied to a wide variety of scientific and engineering software. And they have been adapted by several scientific and engineering professional societies as requirements for publications in peer-reviewed journals. A Google, or Google Scholar, or osti.gov, or peer-reviewed journal, search will lead to an enormous number of hits.
The book by Knupp and Salari (2002) on the Method of Manufactured Solutions (MMS), a methodology first introduced by Roache, demonstrates a powerful method for quantification of the fidelity of numerical methods to the associated theoretical performance, and for locating coding mistakes. Again, searches of the literature will lead to large numbers of useful reports and papers from a variety of application areas. The MMS is the gold standard for verification of numerical solution methods.
Counting Defects is Defective
The data on which the paper by Pipitone and Easterbrook is based have been gathered and presented by Pipitone (2010). That thesis, and the Discussion Paper itself, address the less-than-ideal characteristics of defect counting relative to determination of software quality. The book by Oberkampf and Roy (2010) devotes a single long paragraph to defect counting. The raw data given in the thesis show that very large numbers of defects, in absolute numbers, are present in the GCMs that were reviewed.
Defect counting does not lead to useful contributions in three of the most important attributes of software quality, as that phrase is used today: Verification and Validation (V&V) and uncertainty qualification (UQ). VIn modern software development activities, verification is a mathematical problem and validation is a physical-domain problem, including design of validation tests.
Defect counting would be more useful if the data were presented as a function of time after release of the software for production applications, so as to allow a measure steady improvement and close classification as to the type of defect. The counting also would be more useful if it is associated with only the post-release versions of the software, and none associated with the developmental assessment phase. And the number of different system response functions covered by the user community is also of interest: a very rough approach to model / code coverage. In general, different system-response functions will be a rough proxy for focus on different important parts of the mathematical models.
Defect counting would be much more useful if the type of defect was also accounted. There are four possible classes of defects as follows: (1) user error, (2) coding mistake, (3) model or method limitation, and (4) model or method deficiency. Of these, only the second counts coding mistakes. The first, user error, might be an indication that improvement in the code documentation is required, for the theoretical basis of the models and methods and/or the application procedures and/or understanding of the basic nature of calculated syatem responses. The third, model or method limitation, means that some degree of representation is present in the code but a user has identified a limitation that was not anticipated by the development team. The fourth means that a user has identified a new application and/or response function which the original development did not anticipate. These generally require significant planning for model, method and software modifications relative to correcting the deficiency.
Items (3) and (4) might need a little more clarification. A model or method limitation can be illustruated by the case of a turbulent flow for which the personnel who developed the original model specified the constants in the turbulence-closure model to be those that correspond to parallel shear flows, and a user has attempted to compare model results with experimental data for which re-circulation is imporant. Or a flow in helically-coiled square flow channels. Item (4) can also be illustrated by a turbulent flow. Consider a case in which the developers used a numerical solution method that is valid only for parabolic / marching physical situations and a user has encontered an elliptic-type flow.
The paper by Pipitone and Easterbrook has provided little information about the kinds of defects that were encountered.
Fitness for Production Applications
The paper by Pipitone and Easterbrook does not address the fitness-for-duty aspects of the GCMs; instead the paper focused on the model-as-learning-tool aspects. As noted in this Comment those aspects are common to all model, methods and software development whenever complexity is an important component – complexity in either or both the physical or software domains.
The objectives of model and software development are production of tools and application procedures for predictions having sufficient fidelity in the real-world arena. The foundation for sufficient fidelity is validation of these by comparisons with measured data from the application areas. All system-response functions are required to be tested by validation. The Review Paper by Easterbrook does not address any aspects of validation as this concept is defined in the reports and papers that have defined the concept.
Validation is required of all models, methods, software, application procedures, and users the rersults from which will form the basis of public-policy decisions. Validation must follow after verification. Generally, verification and validation for these tools and procedures are conducted by personnel independent of the team that develops the tools and procedures.
Counting defects, especially those that are encountered during developmental assessment, has nothing to offer in this regard.
The paper presents a very weak argument for the quality of GCM software. The widely accepted and successful modern verification and validation methodologies, which are used in a variety of scientific and engineering software projects, are not even mentioned in the paper. More importantly, the fitness of the GCMs for applications that affect public-policy decisions is also not mentioned. Simple defect counting cannot lead to information relative to validation and application to public-polcy decisions.
Easterbrook, Steve M. and Johns, Timothy C., Engineering the Software for Understanding Climate, Computing in Science & Engineering, Vol. 11, No. 6, pp. 65 – 74, 2009.
Gustafson, John, Computational Verifiability and Feasibility of the ASCI Program, IEEE Computational Science & Engineering, Vol. 5, No. 1, pp. 36-45, 1998.
Knupp, Patrick and Salari, Kambiz, Verification of Computer Codes in Computational Science and Engineering, Chapman and Hall/CRC, Florida 2002.
Larzelere II, A. R., Creating Simulation Capabilities, IEEE Computational Science & Engineering, Vol. 5, No. 1, pp. 27-35, 1998.
Oberkampf, William F. and Roy, Christopher J., Verification and Validation in Scientific Computing, Cambridge University Press, Cambridge, 2010.
Oberkampf, William F., Trucano, T. G., and Hirsch, C., Verification, Validation, and Predictive Capability in Computational Engineering and Physics , Sandia National Laboratories Report SAND 2003-3769, 2003.
Pipitone, Jon, Software quality in climate modeling, Masters of Science thesis Graduate Department of Computer Science, University of Toronto, 2010.
Roache, Patrick J., Verification and Validation in Computational Science and Engineering, Hermosa Publishers, Socorro, New Mexico, 1998.
Roache, Patrick J., Fundamentals of Verification and Validation, Hermosa Publishers, Socorro, New Mexico, 2009.
Roache, Patrick J., Code Verification by the Method of Manufactured Solutions, Journal of Fluids Engineering, Vol. 114, No. 1, pp. 4-10, 2002.
Stevenson, D. E., A critical look at quality in large-scale simulations, IEEE Computational Science & Engineering, Vol. 1, No. 3, pp. 53â€“63, 1999.
JC comment: For background, here are previous Climate Etc posts on the topic of climate model V&V:
- Climate model verification and validation
- Climate model verification and validation: Part II
- Verification, validation and uncertainty quantification in climate modeling
- What can we learn from climate models? Part II
My personal take on this topic is more in line with that presented by the anonymous reviewer than that presented by Pipitone and Easterbrook.
Moderation note: This is a technical thread and comments will be moderated strictly for relevance.