The Principles of Reasoning. Part II: Solving the Problem of Induction

curryja

15 years ago

by Terry Oldberg

This essay continues the argument which I initiated in Part I. To summarize, in Part I, I described a kind of model that was a procedure for making inferences. One kind of inference was a prediction from a known state of nature called a “condition” to an uncertain state of nature called an “outcome.” Conditions and outcomes were both examples of abstracted states. I pointed out that sets of conditions of infinite number could be defined on the Cartesian product space of a model’s independent variables and that each of these sets defined a different model. Thus, models of infinite number were candidates for being built.

Using this description of a model, I posed the problem of induction. As I posed it, this problem was to identify that unique model for which no inference made by it was incorrect. The correct inferences were identified by the principles of reasoning. The problem of induction was to identify these principles.

I introduced a terminological convention under which conditions were called “patterns” when they belonged to that unique model for which no inference made by it was incorrect.

I pointed out that events were of two kinds. “Observed” events were a product of observational science. “Virtual” events were a product of theoretical science. A model could be built from observed events or virtual events or both kinds of events.

A path toward a solution

A path toward a solution for the problem of induction opens up when one realizes that when we make a deductive inference to the outcome of an event of specific description, sometimes we err. In a sequence of statistically independent observations, the relative frequency with which we do not err can be measured. As the number of observations increases, sometimes we observe that the numerical value of the relative frequency gives the appearance of approaching a stable value. This value is called the “limiting relative frequency.” The limiting relative frequency is the empirical counterpart of the theoretical entity which we call “probability.”

The advent of the probabilistic logic

Writing the 16^th century, Girolamo Cardano took this path. In doing so, he provided the first description of the mathematical theory of probability. Though Cardano could not have known it, for measure theory would not be described for another 4 centuries, a probability was an example of a measure. In particular, it was the measure of an event.

In its mathematical aspects, probability theory was rooted in the deductive branch of logic. However, in its logical aspects probability theory potentially extended beyond the deductive logic and through the inductive branch of logic.

With it extended in this way, I’ll call this logic the “probabilistic logic.” In the deductive branch of logic, every proposition is associated with a variable that is called its “truth value.” The truth value takes on the values of true and false. In the probabilistic logic, the truth value of the proposition that an event of a particular description will be observed is replaced by the probability of this proposition. When the values of the probability are restricted to 0 or 1, this logic reduces to the deductive logic, for 0 for the probability corresponds to false and 1 to true. Otherwise, this logic is inductive.

The probabilistic logic having been described, I’ll proceed with the task of describing its principles of reasoning.

The law of non-contradiction

To commence discovery of the principles of reasoning, I’ll observe that the law of non-contradiction is a principle of reasoning. Unlike the other principles of reasoning, it cannot be derived. Rather, it serves as a part of the definition of what is meant by “logic.” By the definition of “logic,” a proposition is false if self-contradictory.

The principle of entropy maximization

Acting intuitively, Cardano stumbled into making an application of a principle of reasoning before this principle was articulated. This was the principle of entropy maximization.

When a model makes an inference to the numerical value that is assigned to the probability that an event of a particular description will be observed, inferences of infinite number are candidates for being made. Each of these inferences corresponds to a different numerical value for this probability. Which of these inferences is correct? This is a question that is posed by the problem of induction

In the context of this question, it can be proved that the quantity which we call the “entropy” is the unique measure of an inference from state space A to state space B in which A contains a single state and this state is abstracted from the states in B.

Additionally, it can be proved that if and only if the states belonging to B are at the level of least abstraction, then the entropy possesses a maximum. I’ll call the states at the level of least abstraction the “ways in which an abstracted state can occur” or “ways” for short. In the roll of a pair of dice, there 36 ways in which an abstracted state can occur of which 2 ways are associated with the abstracted state (1, 2) OR (2, 1). Here, (1, 2) signifies that 1 dot is facing upward on die A and 2 dots are facing upward on die B while (2, 1) signifies that 2 dots are facing upward on A while 1 dot is facing upward on B.

In view of the uniqueness of the entropy as the measure of this kind of inference and provided that the states of B are examples of ways, the inductive question can be answered by optimization.

In particular, that inference is correct which maximizes the entropy, under constraints expressing the available information. As will be shown later, the entropy of the inference is the missing information in this inference about the state in B for a deductive conclusion about this state, given the state in A. Maximization of the entropy pulls the missing information upward. The constraints push the missing information downward by the amount of the available information. The constraints are applied mathematically but they reflect information collected by observations made in nature.

The principle that the model builder should maximize the entropy under the constraints is called the “principle of entropy maximization.” It joins the principle of non-contradiction as one of several principles of reasoning for the probabilistic logic.

Cardano’s interest was in modeling games of chance. The designs of these games defined state-spaces containing the ways in which an abstracted state could occur. As I’ve already pointed out, in the throw of a pair of dice, there are 36 ways.

Though unaware of the principle of entropy maximization, Cardano maximized the entropy without constraints. This procedure uniquely identified an inference to the numerical values that should be assigned to the probabilities of the ways. Maximization of the entropy without constraints assigned equal numerical values to the probabilities of the ways. Thus, for example, it assigned 1/36 to the probability of each way in which an abstracted state could occur in a throw of two dice. By this assignment, Cardano’s model provided no information to the user of this model about the way in which an abstracted state would occur. A game that had this characteristic of providing no information was called “fair.”

The probability of an abstracted state was the sum of the probabilities of the ways in which it could occur. Thus, for example, the probability of the abstracted state (1, 2) OR (2, 1) was 1/36 + 1/36.

Let t designate the number of ways in which an abstracted state can occur. Let f designate the number of ways in which this abstracted state cannot occur. Let Pr designate the probability of observing the event of the abstracted state. By entropy maximization without constraints one builds the inference that

Pr = t / (t + f) (1)

In equation (1), t is the frequency of virtual events of a particular abstracted state while t + f is the frequency of virtual events of any description.

Tying the probabilistic logic to observational science

In arriving at the rule that equal numerical values were assigned to the probabilities of the ways in which an abstracted state could occur, Cardano had made a purely theoretical argument. In order for the probabilistic logic to be tied to observational science, a mathematical procedure had to be found by which observed frequencies influenced assignments of numerical values to the probabilities that events of various descriptions would be observed. In the 18^th century, the mathematician Thomas Bayes and mathematical astronomer Pierre-Simon Laplace independently responded with a proposed solution. Their solution lay in the theorem from probability theory that became known as “Bayes’ inverse probability theorem.” Purportedly, by this theorem the existence was proved of a function that mapped event-descriptions plus the frequencies of these descriptions in observed events to the probability values of these descriptions. This function became known as the “posterior probability distribution function” (posterior PDF).

However, there was a catch. The catch was that an input to the posterior PDF was the “prior PDF.” The latter was similar to the posterior PDF but differed in the respect that it supposedly was known in the absence of observational data.

How could the prior PDF be known when there was no empirical evidence? In effect, Bayes and Laplace argued that the probabilities of the prior PDF must have equal numerical values because there was insufficient reason to argue otherwise. The logician John Venn countered that the choice of equal probability values was arbitrary thus violating the law of non-contradiction.

The followers of Venn sought means for accomplishment of the same task that required no prior PDF. The solution which they found was called “frequency probability.” This was a definition of the word “probability” under which the probability of observing an event of given description was identical to the limiting relative frequency of this event description. For the followers of Venn, probability was not just the theoretical counterpart of the limiting relative frequency; probability was the limiting relative frequency! This was a purely empirical definition of “probability” and stood in stark contrast to Cardano’s purely theoretical definition.

The followers of Bayes and Laplace became known as “Bayesians.” The followers of Venn became known as “frequentists.” The conflict between the two camps continues to this day.

From reading the literature one might get the impression that in building a model it is necessary to choose between the approach of the Bayesians and the approach of the frequentists. However, to choose between these two is logically unpalatable for both approaches violate the law of non-contradiction.

Excepting special circumstances, the critique of Bayesianism that is offered by the frequentists is accurate: the selection of the prior PDF really is arbitrary. However, frequentism has an element of arbitrariness of its own.

Under frequentism, supposedly a model exists in nature which is complete save the numerical values of its parameters. This supposition is wrong, for in nature there are not parameterized models for us to observe. Models belong to the theoretical world and not to nature. In nature, we observe events and their descriptions. We do not observe parameterized models. Thus, the choice of parameterized model is arbitrary, violating the law of non-contradiction.

Bayesianism fails to tie the probabilistic logic to observational science in a logical manner. Frequentism has the same failing. Fortunately, there is a third alternative that makes this tie in a logical matter. This tie is the principles of reasoning.

The advent of thermodynamics

In the 19^th century, the physicist Rudolf Clausius found, in data on the efficiencies of steam engines, a previously unrecognized property of matter. Clausius called this property “entropy.” The entropy became a key ingredient of the theory of heat which became known as “thermodynamics.” Under the first law of thermodynamics, the energy of a closed system was conserved. Under the second law, the entropy of a closed system rose to a maximum.

Later in the same century, Ludwig Boltzmann and Willard Gibbs discovered, in effect, that the entropy was the measure of an inference to a state-space whose states were the ways in which the abstracted state called the “macrostate” could occur. The ways became known as “microstates” for they described a chunk of matter in microscopic detail. The entropy of the inference to the microstate of a chunk rose to a maximum under the constraint of energy conservation. In this way, the second law of thermodynamics was an application of the principle of entropy maximization.

Measure theory

Early in the 20^th century, Henri Lebesgue published measure theory. Under the theory, a measure was a real valued function on a class of sets. Measure theory had a pair of precepts. They were:

the measure of an empty class of sets was nil and,
the measure of disjoint sets was the sum of the measures of the individual sets.

Entropy and information

Measure theory forges a link between the ideas of entropy and information. It follows from the precepts of measure theory that if A is a set and B is a set, the measure of the set difference B – A is the measure of B less the measure of the intersection of A with B.

Now, let A designate a state-space and let B designate a state-space. Let the set difference B – A designate the inference from A to B. From the precepts of probability theory it can be proved that the conditional entropy is the unique measure of this inference in the probabilistic logic.

Under measure theory A, B, and the intersection of A with B have the same measure as B – A. The measure of A is the entropy of A. The measure of B is the entropy of B. The The measure of the intersection of A with B is the information about the state in B given the state in A as the word “information” is defined by the developer of information theory, Claude Shannon. Conversely, the measure of this intersection is the information about the state in A given the state in B.

It follows from the semantics that were imparted to the word “information” by Shannon that the conditional entropy of the inference from A to B is the missing information in this inference for a deductive conclusion about the state in B given the state in A. Provided that A does not intersect with B, knowledge of the state in A provides no information about the state in B; under this circumstance, the conditional entropy of the inference from A to B reduces to the entropy of this inference. Thus, the entropy of the inference is the missing information in this inference for a deductive conclusion about the state in B. Whether or not knowing the state in A provides information about the state in B, the measure of the inference from A to B is the missing information in it, for a deductive conclusion about the state in B given the state in A and the missing information is the unique measure of this inference. This finding is of great significance for philosophy as I’ll soon show.

Optimization

That the missing information in an inference exists as the measure of an inference and that this measure is unique has the consequence that the problem of induction can be solved by optimization. There is a “principle of entropy maximization” which I’ve already covered. There is a “principle of conditional entropy minimization” which I’ll cover immediately below. There is a “principle of maximum entropy expectation” which I’ll cover later. Each of these principles satisfies the law of non-contradiction and is consistent with all of the other principles of reasoning for the probabilistic logic.

The principle of minimum conditional entropy

From part I, please recall that a model makes an inference from a known condition in a set of conditions to the uncertain outcomes of statistical events. Sets of conditions of infinite number are possibilities. Working in concert with all of the other principles of reasoning, the principle of minimum conditional entropy selects that unique set of conditions which maximizes the information about the outcome, given the condition. By terminological convention, these conditions are called “patterns.”

The principle of maximum entropy expectation

In the discovery of the set of patterns, the model builder must compute the conditional entropy that is associated with each set of conditions in the class of sets of all possible conditions. Before calculating the conditional entropy corresponding to a particular set of conditions, it is necessary to assign numerical values to the probabilities of the conditions and to the probabilities of the outcomes, given the conditions. Under the language of mathematical statistics, a certain procedure for doing so is both the “unbiased estimator” and the “maximum likelihood estimator.” This procedure makes the assignment

Pr = x / n (2)

where x is the frequency of observed events of a particular description and n is the frequency of observed events of any description. Note that equation (2) makes an inference that is based entirely upon the frequencies of observed events. In this respect, equation (2) differs diametrically from equation (1), which makes an inference that is based entirely upon the frequencies of virtual events. The inference that is made by equation (1) belongs entirely to the theoretical world while the inference that is made by equation (2) belongs entirely to the empirical world. Generally the two inferences differ in the assignments that they make. That they differ violates the law of non-contradiction.

If equation (2) is used with conditional entropy minimization, the effect is to discover patterns for which each pattern is based upon a single observed event. The finding of the existence of each pattern lacks statistical significance and thus the model fails when tested. The failure is observable as a disparity between predicted probability values and observed relative frequency values. The failure can be traced to presumption by equation (2) of more information than is available in the observed events.

Entropy maximization provides a route of escape from this predicament. The possibility of applying the principle of entropy maximization to the problem of assigning a numerical value to Pr arises from the following set of facts. In 1 trial of an experiment, the relative frequency of a particular event-description will be 0 or 1. In 2 trials, the relative frequency will be 0 or ½ or 1. In N trials, the relative frequency will be 0 or 1/N or 2/N or…or 1. Note that the distance between adjacent values of the relative frequency possibilities is 1/N signifying that the relative frequency possibilities are evenly spaced in the interval along the real line that lies between 0 and 1.

Now let N increase without limit. The relative frequencies become limiting relative frequencies. The limiting relative frequency possibilities are 0, 1/N, 2/N,…,1. As they are states at the level of least abstraction, these possibilities are examples of ways in which an abstracted state can occur. It follows that the principle of entropy maximization can be applied to the problem of identifying the correct inference to the set of limiting relative frequency possibilities. Under this principle, the entropy of the inference to the state-space { 0, 1/N, 2/N,…,1 } is maximized, under constraints expressing the available information. As it turns out, maximization of the entropy provides a unique answer to the question of the prior PDF over the values in the state-space { 0, 1/N, 2/N,…,1 } and in this way, violation of the law of non-contradiction by the use of Bayes’ theorem is avoided. This invention of Ron Christensen exploits a loophole in the generalization that the prior PDF is arbitrary.

Given the existence of the prior PDF, the posterior PDF is determined by the observed frequencies of events of particular descriptions through Bayes’ theorem. The numerical value which is assigned to the probability of a condition or to the probability of an outcome, given a condition, is then the expected value of the posterior PDF.

The forms of the prior and posterior PDF’s vary dependent upon the kinds of information that are available. In practice, they often take on the form of a Beta distribution. In this circumstance, maximum entropy expectation makes the assignment.

Pr = (t + x) / (t + f + n) (3)

In equation (3), note that the value assigned to Pr lies between the purely theoretically based value of equation (1) and the purely empirically based value that is produced by equation (2). By equation (1), the information about the way in which an abstracted state will occur is nil. By equation (2), this information is overstated. As the value that is assigned by equation (3) lies between the value that is assigned by equation (1) and the value that is assigned by equation (2), there is the possibility of correctly representing the information that is available. In particular, t + f can be viewed as a parameter of the model and empirically tuned to that value for which the available information is correctly represented. In an elaboration of this idea, the virtual events that are generated by a mechanistic model (for example, a mechanistic climate model resembling today’s IPCC climate models) are conceived to provide an additional constraint on entropy maximization and these virtual events are weighted such that the information content of them is correctly represented.

These are the ideas underlying Christensen’s principle of maximum entropy expectation. This procedure is unique in representing the available information correctly. By defining a unique procedure for assignment to Pr, the principle of maximum entropy expectation avoids violation of the law of non-contradiction.

Reduction to the deductive logic

To recapitulate, I’ve provided an argument for the existence of principles of reasoning for the probabilistic logic. These principles are:

The law of non-contradiction,
The principle of entropy maximization,
The principle of conditional entropy minimization and,
The principle of maximum entropy expectation.

You should understand that my claim is that these are the principles of reasoning for the deductive branch of logic as well as for the inductive branch of the probabilistic logic. These principles relate to traditional thinking about the deductive logic by constructing a model that makes the arguments known as Modus Ponens and Modus Tollens.

The nature of demonstrable knowledge

An interesting sidelight to discovery of the principles of reasoning for the probabilistic logic is that this discovery provides a logical definition for the Latin word scientia meaning “demonstrable knowledge” in English. The English word “science” is rooted in scientia. Thus, provision of a logical definition for scientia clarifies what one means by “science.”

Under its definition in the probabilistic logic, scientia is the information about the outcomes of events, given the associated patterns. Through conformity to the principles of reasoning in the construction of a model, the maximum possible scientia is created from fixed informational resources. The scientia is created by the discovery of patterns. Thus, the role of the scientific investigator is to support pattern discovery.

Neither the Bayesian approach to the construction of a model nor the frequentist approach discovers patterns. Thus, both approaches are stumbling blocks for rather than helpmates for the scientific investigator.

Empirical support

As logic is a science, its principles are subject to falsification by the empirical evidence. There seems to be general agreement that the deductive branch of logic has frequently been tested but has never been falsified by the resulting evidence. Similarly, the inductive branch of the probabilistic logic has frequently been tested but has never falsified by the resulting evidence.

In its application to the second law of thermodynamics, the principle of entropy maximization is continuously tested in the real world by the machines and processes which engineers construct on the assumption that entropy maximization is a law of nature. If a machine or process were to be discovered that violated entropy maximization, the principle of entropy maximization would be falsified by the evidence.

In communications theory, the information rate of a communications channel is the entropy of the encoder less the conditional entropy of the decoder. The information rate is maximized by maximizing the entropy of the encoder; this is an application of the principle of entropy maximization. The information rate is maximized by minimizing the conditional entropy of the decoder; this is an application of the principle of conditional entropy minimization. Virtually all modern communications channels incorporate encoders and decoders. There are huge numbers of these devices and they are used continuously. If an encoder were to be discovered whose performance exceeded the maximum entropy limit, then the principle of entropy maximization would be falsified by the evidence. If a decoder were to be discovered whose performance exceeded the minimum conditional entropy limit then the principle of conditional entropy minimization would be falsified by the evidence.

Tests of maximum entropy expectation are reported in the works of Christensen and his colleagues. Tests of the entire set of principles of reasoning for the probabilistic logic are reported by Christensen in a 1986 paper (Int J. General Systems, 1986, Vol 12, 227-305). In the paper, Christensen compares the performance of all models for which, at the date of writing, there was a model that had been constructed under the principles of reasoning for the probabilistic logic and at least one model that had not been constructed under these principles from the same data. In every case, the model that had been constructed under the principles of reasoning exhibited superior performance. In some cases, the degree of outperformance was of an astounding order of magnitude.

A fuzzy logic

In the development of my topic, I have tacitly assumed that the elements of partitions of Cartesian product spaces are crisply defined sets. By this assumption, the probabilistic logic is created. By relaxation of this restriction, a generalization from the probabilistic logic can be created with principles of reasoning that are similar to those of the probabilistic logic. This example of a “fuzzy logic” has uses in model building when it is necessary for the abstracted states in a partition of the Cartesian product space of the independent variables or of the dependent variables of the model to be ordered.

Applications in meteorology

In meteorology, models built under the probabilistic logic and its principles of reasoning have consistently outperformed models built under the method of heuristics; the outperformance has been of an astounding order of magnitude. In the year 1978, a team that included an engineer-physicist (Christensen), a criminologist and a psychologist but no meteorologists set out to try to build the first long range weather forecasting model. Initially, the plan called for generation of virtual events by a general circulation model (GCM). However, this facet of the plan had to be aborted when it was discovered that the GCM was numerically unstable over periods of prediction of more than a few weeks.

In the construction of the model, more than 100,000 time series were examined for their potential for providing information about the outcome. These time series included ones that provided measurements of tree ring widths, stream flows, sea surface temperatures, atmospheric temperatures, atmospheric pressures, Palmer drought severity index and other measurements.

In 1980, Christensen et al published the characteristics of the first weather forecasting model to be built under the principles of reasoning for the probabilistic logic. At the time, centuries of meteorological research using the method of heuristics had made it possible to predict the weather with statistical significance no more than 1 month in advance. By using publically available data but by replacement of the method of heuristics by the principles of reasoning, Christensen et al extended the time span over which precipitation could be forecasted to 12 to 36 months. This was a factor of 12 to 36 improvement and this improvement produced the first successful long range weather forecasting model. Later, replacement of the method of heuristics by the principles of reasoning was tried in building long-range weather forecasting models for all of the far Western states of the U.S.; the outcomes of these models included whether or not the spatially and temporally averaged temperature at Earth’s surface would be above the median as well as whether the spatially and temporally averaged precipitation would be above the median. The resulting models exhibited startling improvements in performance over heuristically based models ( Int J. General Systems, ibid).

A barrier to application

There is a barrier to construction in IPCC climatology of a model under the probabilistic logic and its principles of reasoning. This is that, as IPCC climatology has thus far been conceived, it has many fewer observed events than meteorology.

In the work of Christensen et al that produced the first long range weather forecasting model, the weather was averaged over 1 year. Time series extending backward in time to the year 1852 provided 126 independent observed statistical events. By comparison, it is not unusual for an epidemiological study to have 100,000 independent observed events at its disposal.

For IPCC climatology, the situation is much worse than for meteorology because the averaging period is much longer. If the averaging period had been 10 years, Christensen et al would have had only 12 observed events at their disposal. If the averaging period had been 20 years, they would have had only 6 observed events and if the averaging period had been 100 years, they would have had only one observed event. In any case, there would have been far too few observed events for patterns to be discovered. The attempt at building a long range weather forecasting model would have failed.

In order for a successful climatological model to be built under the probabilistic logic and its principles of reasoning, it would be necessary to have available many more observed events than 1, 6 or 12. In order for this to happen, it would be necessary reorient the plan for the research research such that it yielded these events. This would result in the model being based upon paleoclimatological data; a consequence would be that it would be necessary to use a proxy for the surface temperature rather than the surface temperature itself.

Separating climatology from politics

Some of us favor separation of climatology from politics. The grounds for separation are suggested by the slogan that “science is value-free.” On the other hand, politics is value-laden.

How to go about making a field of inquiry value free is not completely obvious. Interestingly, an argument from the probabilistic logic establishes a procedure which, if followed, would make climatology value-free. Under the probabilistic logic an inference has a unique measure that is unrelated to values such as the costs or benefits from regulating CO2 emissions. Under the probabilistic logic, the measure of an inference is uniquely the missing information in this inference for a deductive conclusion. By religiously measuring inferences by the missing information in them, climatologists would make their field value-free.

Resources for further learning

Readers who wish to learn more about my topic will find a bibliography at Ron Christensen’s web site and at my web site. The tutorial at my Web site could be helpful. For readers with serious interest in the topic, I recommend engagement of a competent tutor over trying to come up to speed by reading the literature as tutoring would be far more cost effective than unguided study.

Share this: