Patterns and Noise—Kolmogorov Complexity, Efficiency, and Marginal Improvements

By Matthew Powell

William of Ockham said that preference should be given to explanations with fewer elements. Claude Shannon said that any methodology that is an effective compressor is also an effective gambler. Andrey Kolmogorov said that the shortest possible program that expresses a piece of data gives the most useful assessment of its information content.

Admiration of brevity and capability of doing more with less runs across many professions. Poets celebrate the value of compression, lauding poems that say more than their words denote. Mathematicians save the greatest honors for the shortest proofs of deep results.

As actuaries, we are accustomed to working with complex models and predictive tools. In order to improve our business unit’s profits or productivity, we are often asked to refine those models to increase the efficacy of the output. But sometimes the incremental gain—in terms of increased profits or predictive power—does not justify the additional costs—in terms of time of opportunity cost.

This article introduces concepts from Kolmogorov complexity, which integrates concepts from the study of resource complexity in computer science with information theory and statistics. This approach can help in applying an actuary’s intuitive understanding of statistical inference and credibility theory to other domains. As an actuary, I might say that efficient use of data means not reading more from the data than the data itself has to say. Taking cues from computer programming and audio compression gives us a means to pursue the same kind of efficiency in modeling and business decisions.

Kolmogorov Complexity

There is an information metric called the Kolmogorov complexity that gives a unifying framework for questions relating to complexity and simplicity. Kolmogorov complexity, also referred to as descriptive complexity, is defined as the length of the shortest computer program (in some computer language that is sufficiently developed to be able to implement other computer language in a “universal” manner) that outputs a given piece of data. The sense is that a program that reproduces a piece of data encodes all of the essential details of that data. The smallest program encodes all of the essential details in the most compact manner possible.

The precise computer language that is used only matters significantly for relatively small pieces of data. There may very well be pieces of data that are easy to express in a particular computer language, but because we are measuring Kolmogorov complexity on a “universal” computer, we could include code that translates from that language, which makes the problem easy to our current working language.

It is not possible to directly calculate the precise Kolmogorov complexity of a piece of data. However, it is possible to use any particular approach to making data smaller as an approximation. The most direct approach to approximating Kolmogorov complexity is to use compression algorithms. Comparisons between file sizes after compressing as a ZIP file or in other formats can serve as a very rough comparison of relative Kolmogorov complexity of data sets.

Statistical analysis is also an avenue for compressing data. Estimates of the likelihood of certain sequences occurring in data provide a basis for more frequently occurring elements to be mapped onto shorter codes. General-purpose compression algorithms essentially perform statistical analysis of data based on some assumptions about the structure of the data that work reasonably well for common data formats but that will not necessarily identify deep structural relationships in data.

Data that is easy to compress has some form of redundancy in it—elements repeat in fixed patterns, there are patterns with predictable frequencies, or there are structural relationships where the values of certain elements can be used to predict other elements. Any form of compression that we apply identifies and exploits these patterns. The impossibility of precisely calculating Kolmogorov complexity relates to the fact that there are too many possible patterns to evaluate all of the possibilities.

Data that is impossible to compress is statistically random, as if drawn from a uniform distribution. It is also statistically impossible to compress data generated by a known statistical process to a size smaller than its Shannon entropy.^[1]

We can also examine the relative Kolmogorov complexity of two pieces of data by looking at how much they can be compressed jointly compared to how they can be compressed separately. Two pieces of data with significant “overlap” in their patterns can be compressed to a smaller size, while data with no overlap will compress to the same size as was achieved when the pieces of data were compressed separately.

What Might Have Been

Kolmogorov initially wanted to be an historian. He looked at the question as to whether taxes in Russia in the middle-ages were collected at the house level or at the village level. He analyzed tax data and showed that that data had a much simpler description if taxes were collected at the village level. (You can see the seeds of Kolmogorov complexity here.) He presented these results to the history faculty to great applause. Asking them whether he could publish such a paper he was told, ‘You only have one proof. You cannot publish in a history journal without at least two more proofs of your hypothesis.’ And so Kolmogorov left history for a field where one proof suffices.

From the blog “Computational Complexity”

(It should be noted, however, that the redundancy required by history journals serves a function of compensating for accepting methods of proof with less than 100 percent accuracy. Compression makes data more efficient and resolves it to its underlying explanatory value, while adding redundancy to data or processes makes them more robust against corruption or failure.)

Relationships Between Kolmogorov Complexity and Other Fields

The fundamental element of probability theory is uncertainty with exceptions. These exceptions can take the form of distributional assumptions or Bayesian priors. If a coin is fair, the outcome is unknowable and the coin generates one bit of Shannon entropy per flip. If we have knowledge that a coin is not fair, the outcome of each flip is at least slightly known in advance and on average it generates less than one bit of Shannon entropy per flip. In probability theory, Shannon entropy gives us a measure of the amount of irreducible uncertainty in an event, although measuring this quantity requires us to make some assumptions or use some prior knowledge. In classical probability theory, what we already know determines our understanding of the information content of the data.

From what has been called an “algorithmic probability” perspective, you look at a data set and identify the ways to isolate and extract the predictable content of the data, leaving behind only the incompressible residuals. Incompressible binary data does not have any more efficient representation than as a string of ones and zeros, or equivalently heads and tails outcomes. It is just as arbitrary as the outcome of a series of fair coin flips. What this means is that whenever we find a way to take input data and identify the predictable parts, leaving behind only incompressible residuals, we have isolated all of the predictable components of the data and separated them from the inherently unpredictable components. Compared to classical probability theory, we are working backward, determining the information content of the data and using that to assess what we know. Compression is inherently objective in that the result speaks for itself.

While this seems satisfying, we have no means of determining whether any particular piece of data is incompressible. At best we can say that compression algorithms, statistical methods, and heuristic optimization methods fail to provide us with a more efficient representation within the computation time that we have applied to the search. This may seem like a significant setback for this approach, but the success of the theory comes from the fact that the algorithmic probability perspective unifies a number of fields where it is already known that perfection is not achievable.

The fundamental yardstick of algorithmic probability—Kolmogorov complexity—serves as an abstract problem that can be used as the generalization of many concrete problems:

Kolmogorov complexity directly models the limits of data compression algorithms. Data compression itself has concrete value in terms of efficient storage and transmission of data.
When applied to data sets of observations of phenomena, the program that achieves the Kolmogorov complexity of the data can be viewed as the simplest explanation of that data without any use of prior beliefs, knowledge, or assumptions. This observation combined with the equivalence of maximum likelihood estimation and the Huffman coding compression algorithm^[2] can be used to formulate inferential principles such as the Minimum Description Length and Minimum Message Length principles.
When Kolmogorov complexity is applied to programs that accomplish specific tasks, such as performing a Fourier transformation or calculating the result of an actuarial model, the shortest program is necessarily organized in a manner that reflects the structure of the task itself. Computer programmers call the elements of a computer program that allow a problem to be tackled with less code “abstraction.” The shortest program uses the most effective abstractions for the task.
Search and optimization algorithms that are efficient in their use of computation time can be viewed as steadily accumulating nonredundant information, which either assists in locating the result or in proving that the solution is correct. Inefficient algorithms recalculate earlier results or make evaluations that can already be determined from the results of past operations.
The laws of physics can be treated as a computer because all practical computers are based on physical principles. Kolmogorov complexity could be used to model energy efficiency of physical processes by looking at the question of what is the shortest sequence of incremental energy inputs into a system that results in a task being completed. This example is counterintuitive in some respects because common tasks such as moving an object between two points would seem to be highly compressible, but there is no mechanism for removing this redundancy with the kinds of looping constructs available in computer languages. The resolution of this apparent paradox involves recognizing that while it is possible to compress information and computational processes that contain redundancy, it is not possible to compress redundant mechanical processes.

The problems above all involve some notion of “efficiency.” The fact that Kolmogorov complexity cannot be calculated deterministically and can only be approximated based on upper bounds directly relates to the fact that these are all problems that cannot be solved deterministically in the general case. The areas that these problems come from all require empirically grounded compromises and simplifications for real-world work.

For actuaries, a primary aspect of our work is making the best and most practical use of the evidence available to us for the benefit of a principal. This usually takes the form of application of statistical inference or informal principles collected from practice and deemed to be reasonable. Application of concepts from algorithmic probability can make it more clear what compromises we are making and what bodies of prior knowledge we are drawing on in our work. It can also reveal possible paths for borrowing from other disciplines.

As an example, it would be possible to formulate a compression-based version of credibility theory based on Minimum Description Length principles. An event frequency assumption that is set based on a particular large data set can be used to compress that data set using Huffman coding to produce a set of codes for events with lengths proportional to the log-likelihood of the event under the assumption. Improving the fit improves the compression, but any gains created by adding finer-grained breakdowns to the data increase the size of the code table, offsetting the compression. This potentially gives you a credibility framework that can be adjusted more fluidly to the structure of the problem.

If you append a smaller data set to this data set, the optimal assumption (yielding the best compression) for the combined data set will range between setting the assumption used for the smaller data set to be the same as the large data set to setting the assumption used for the small data set independently. The point in this range the new optimal assumption would occupy depends on the size of the smaller data set and how much different the smaller data set is from the larger data set. This does not necessarily have to take a predetermined form in the manner that Bayesian revisions typically do. As you scale these parameters, the solution might shift from varying the assumption on the smaller data set by applying modifiers to the existing assumption (which would not require as much information as a new code table) to developing more complex criteria as the data set grows.

This type of framework may or may not be more useful than existing credibility frameworks. In any case, examining this kind of concept gives you a more intuitive way of explaining the concept of credibility methods. A fully credible source of information is demonstrably statistically distinct from other candidate sources of information; to the extent that fully credible sources of information are not available, it is preferable to choose assumptions that can explain (effectively compress) the available sources of information.

Algorithmic Statistics and Incremental Refinements

So far we have discussed things in absolute terms, seeking the simplest or most efficient representation of a data set or process. But the algorithmic probability approach also allows evaluating partial explanations. A deterministic computer program with fixed inputs generates a specific piece of data as output, but randomized computer programs can have a chance to generate any particular piece of data. We can also look at what changes can be added to a program that generates random output so that the likelihood of producing a particular piece of data are increased. This can be viewed as a progressing sequence of partial explanations of that particular piece of data.

Here are some examples of what these sequences of partial explanations might entail:

One million draws of a normal distribution (with draws rounded to a certain number of decimals):

a. Uniformly distributed random numbers over a sufficiently large interval and with a sufficient large degree of decimal precision to cover all of the numbers that humans frequently use;

b. Uniformly distributed random numbers over a specified interval with a specified degree of precision;

c. Random numbers generated based on a normal distribution using only the most significant bits of the mean and standard deviation; and

d. The model can be further refined by using more precise values of the mean and standard deviation.

A specific English language sentence:

a. Randomly generated text giving all characters equal frequency;

b. Randomly generated text giving characters frequencies set based on an English language corpus;

c. Randomly generated text giving multi-character sequences frequencies set based on an English language corpus (e.g., character-based Markov model or neural net);

d. Randomly generated words giving multi-word sequences frequencies set based on an English language corpus (e.g., word-based Markov model or neural net);

e. The previous model, but with the program modified to draw again if the generated text is not validated by a parser that checks syntax; and

f. The previous model, with further validation to constrain generated sentences to maintain consistent subject matter and plausibility.

In these examples, the basic generation procedure is relatively simple and small, but there is some additional piece of information or a piece of code being supplied that augments its capability to generate the target data. In all of these examples, the probability of generating the specific piece of data is negligible, but that negligible probability is clearly increasing over the sequence. In the case of the normal distribution, the program improves as we add information that summarizes the range over which the numbers occur, and it further improves as we add information that specifies the distribution. The English language example improves as we specify frequencies of elements and gains higher-order structure as we supply sources of knowledge about the rules of the English language.

These pieces of data that improve the ability of this kind of program to generate a specific object are called algorithmic statistics of the object. There is a direct correspondence between minimal algorithmic statistics and minimum sufficient statistics of distributions. When this exercise is performed using a known distribution, the most efficient improvement we can make to the model is to specify the distribution and fill in estimates of the parameters based on minimum sufficient statistics. If we try to continue the exercise after this point and try to improve the chance of generating the target data further, we can only improve that chance by giving the program some information about the quantiles of the target data. At that point, we are not improving the model—we are just reproducing a specific sample.

An interesting implication of this type of thought experiment is that it gives a probabilistic grounding to non-probabilistic models. If you have a fixed estimate of a result and no estimate of the exact process by which actual results may vary, rounding the actual results to a materiality standard and assuming a discrete uniform distribution in increments based on the materiality standard may give a reasonable basis for comparing the performance of this model to a probabilistic model.

Narrative as Lossy Compression of Residuals

Data such as uncompressed still images, audio, and video carry more information than human sensory systems can process, and as such, they can be effectively compressed relative to their intended audience by discarding information that the human viewer is not very sensitive to. This practice can be viewed as using algorithmic statistics to summarize data but discarding the statistics that carry irrelevant data.

If we imagine that we have indifference over small residuals of an actuarial model, this relates to this notion of lossy compression.^[3] Truly random (incompressible) residuals indicate that we have addressed all aspects of the problem that can be easily anticipated (under our current form of analysis). If the residuals are sufficiently small, it may be appropriate to say that we don’t particularly need to explore alternative approaches that may be more accurate.

However, it is also notable that larger residuals may be tolerated if they can be attributed to exogenous factors (e.g., investment return, changes in a population’s composition or behavior, volume of sales or participation) that are expected to be nonrecurring or that are seen as being outside of the model’s “responsibility.” An explanation of the residuals may take the form of an adjustment to an assumption that makes the residual small, plus a narrative explanation of factors that may cause the deviation. This is an example of lossy compression in the sense that between the model and the explanation of the residual, we have a description of the essential details of the outcome.

Taking this into account, there may be a rationale for a stronger preference for simpler models than is given by established information criteria. From an actuary’s perspective, the ideal model may be the model that uses the fewest words to make an outcome seem reasonable relative to conditions while at the same time not obscuring factors that can be controlled to adjust the outcome. Furthermore, a model that is stable over time and does not incorporate elements that may lose relevance can be viewed as minimizing the amount of information that needs to be communicated to stakeholders over longer timeframes.

Relationship of Complexity With Utility Theory of Insurance

As an exercise, imagine if you were to give a customer, client, or plan participant a computer program that assists them with building and maintaining an adequate fund to self-insure for rare events or to prepare for difficult-to-measure long-term needs. Imagine the inputs the program would need and the outputs the program would generate. Some specific questions could be asked to characterize these elements:

How often would they need to run the program in order to be effective?
How much data would they need to input, and how frequently should it be updated?
In order to make the program effective, what kinds of data queries to public data sources, private data sources, or “crowdsourced” information would need to be included?

Essentially, this exercise is about identifying what it would require to take all of the evaluations that an actuary performs and transfer these tasks back to individuals. If you imagined that you actually distributed this program to a broad population and you recorded a log of the calculations and queries performed, how compressible is this log? If the log is highly compressible, we might assume that there is a great deal of redundant work being done by all of these individuals. We should note that we cannot expect to meaningfully eliminate redundancies that relate to outcome-relevant physical or financial events. However, all calculations and queries for information are fair game.

From this perspective, the economic value generated by insurance and risk pooling mechanisms does not necessarily come from differentials in risk preferences. Instead, the economic value from these risk pooling mechanisms can come from removing redundancy from the collective risk evaluations required by the group and facilitating information collection needed for all of the individuals. Pooling similar risks is more computationally efficient than managing them individually.

A key offset to this improvement is that pooling the risks changes the format of the solution that is needed, and additional work is needed to assure various stakeholders that risks are being managed appropriately—and if the risk is transferred to a for-profit venture, additional work is needed to ensure that risks are being managed profitably. In order for risk pooling mechanisms to be economically viable, the improvements under aggregation have to exceed the costs that are added.

Relationship of Complexity With Risk Management

Imagine that you have the task of selecting an investment policy for a fund with relatively few constraints. You have the options of investing primarily in Treasury securities, investing in corporate bonds, investing in equities, or possibly even permitting a more complex investment policy that may contemplate derivatives and OTC contracts. It is possible to write risk controls for any of these policies, but it is worth noting that it is less complex to verify that risk controls are being followed for the lower-risk options.

An investment policy that consists of a set of whitelisted securities requires evaluation of fewer metrics to confirm policy is being followed than does an investment policy that is permitted to create risk exposures. As you move through the risk spectrum, broader duties of due diligence are created. This creates work that needs to be verified under the supervision of whatever individuals have ultimate responsibility for the outcome. The relative ease of supervision of lower-risk investments may form the basis of an explanation of the equity risk premium. The size of risk premiums may be related to the scarcity of free time among individuals in supervisory positions in organizations with the capacity to invest in riskier investments.

This thinking can be extended to other types of risk controls as well. Organizations that maintain appropriate risk controls also implicitly choose policies that have supervision requirements that can adequately be met by the organization. The supervision requirements for the policy need to fit appropriately with the reporting and workflow of the organization. One form of organizational failure that can occur is for management to implement policies that it cannot hope to supervise due to these policies requiring a greater quantity of supervision than management is capable of providing.

Thought experiments may be helpful for evaluating this kind of issue as well. If we had a log of all of the activities of an organization, can the events needed to implement the policy be inserted into existing work in a simple manner? Putting this another way, are the requirements of the new policy correlated with requirements of existing work? If we add the new required events to the log, how much does the compressed size of the log increase? There is potentially more nuance that can be added to the analysis, but these kinds of questions may give rough estimates of how supervision requirements will change with the new policy.

Considerations about complexity are, well, complex. But understanding what we mean when we attempt to quantify complexity—and understanding the practical limitations and costs associated with incremental refinement to complex systems—helps us perform our actuarial duties better.

MATTHEW POWELL, MAAA, ASA is an assistant actuary with Segal Consulting based in Atlanta.

Endnotes

[1] Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of an information source.

[2] A Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression.

[3] Lossy compression is the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content.

Patterns and Noise—Kolmogorov Complexity, Efficiency, and Marginal Improvements

Related posts

The Ticking Clock—Exploring Doomsday Risk Factors

An Imperfect Storm