Statistics as algorithmic summarization

Though a multifaceted and complex discipline, Statistics’ greatest contribution is a rigorous framework for summarization. Statistics gives us reasonable procedures to estimate properties of a general population by examining only a few individuals from the population. In this regard, statistics is algorithmic: it provides randomized algorithms for extrapolation. In this blog, I’ll review some elementary stats (with as little mathematical formalism as possible), and try to crystalize why this algorithmic view is illuminating. In future blogs, I’ll build upon this algorithm perspective in applying it to more interesting examples in experiment design and prediction.

For a simple starter example, we know that every person on Earth has a height, defined as the distance from the bottom of their feet to the top of their head when standing upright. Suppose we would like to know the mean of the height of all living people. This would require us tracking down every living individual, getting out a tape measure, and measuring the distance from the bottom of their feet to the top of their head. To avoid such exhaustive counting we could instead devise an efficient algorithm to estimate this quantity. What if we selected a subset at random, and used this subset to estimate the mean? That is, we could collect a random sample of individuals from the population and measure the average height of all of the individuals in the sample.

The average height measured on this random sample is a random quantity. Thus it must have a mean and variance like we associate with other random quantities. If we were to sample each individual uniformly at random from all living humans, the expected mean height of the sample would be precisely the average height in the general population. Moreover, the variance of the average height of the sample will shrink proportionally to the number of sampled individuals. The more individuals you measure, the closer the average height on the sample will be to the average over the population.

Statistics provides powerful formulas that let us precisely quantify how close the sample average will be to the population average. For instance, we know a person’s height is a positive number and that there are no people who are taller than nine feet. With these two facts, a simple application of a formula called Hoeffding’s inequality tells us that if we sample the heights of thirty thousand individuals, our sample average will be within an inch of the true average height with probability at least 83%. This assertion is true no matter how large the population of individuals. The required sample size is dictated only by the variability of height, not by the number of total individuals.

You could replace “height” in this example with almost any attribute that you are able to measure well. For quantities with reasonable variability, a uniform sample from a general population will give a high quality estimate of the average value. Statistics provides a powerful framework to reason about population-level properties by examining properties of small subsets.

In this example, statistics provided an algorithm for estimation. We began with a quantity whose average value we would like to estimate. We specified how much this quantity can vary in the population under study. Using formulas from statistics, we computed the number of individuals we needed to examine in order to have an estimate of appropriate quality. We then can proceed to do our best to extract a random sample of individuals of this size from the population, measure these individuals, and report the mean of the sample. This is a full procedure for estimating the mean value of the general population.

This algorithmic description of statistical summarization probably seems uncontroversial, but it differs from how statistics is often taught or conceptualized. As I’ve discussed in previous blogs, the statistical model is oftentimes given highest precedence. In this model-based framing of statistics, the world is governed by probabilistic rules. With enough effort, these rules can be specified by a generative statistical model, and we assume that measurement is equivalent to sampling from this generative model. The generative model will have properties like a mean, and the law-of-large numbers will tell us that if we sample enough times from the model, we can accurately estimate the mean and other properties.

This distinction between modeling the sampling and modeling the population may appear to be splitting hairs. In some sense, the two viewpoints only differ conceptually as the algorithms for estimating the mean height in a population will be identical. However, our interpretation of these two views is different: in the algorithmic view, one can use statistics to understand the physical world no matter how the general population arose. As our height example highlights, only the most minimal modeling assumptions are needed to make use of statistical methods. In the modeling view, we shoehorn ourselves into modeling all processes with probability distributions. Not only is this unnecessary, but validating probabilistic models is also quite difficult. As I described in my last two blog posts (1) (2), proposed statistical models are never validated in the vast majority of scientific studies.

The algorithmic view of statistics moreover illuminates the aspects of randomness that we have under our control. For example, there are whole branches of computer science dedicated to deterministically generating numbers that look random for all intents and purposes. Random number generation is something we can control. We can focus on understanding how our measurements fail to be ideal, rather than focusing on how the natural world doesn’t obey our models. This empowers the statistical practitioner to tune their procedures to be more robust against the limitations of what that can measure.

I want to once more emphasize that the algorithmic perspective I’ve presented here is not novel at all. Thinking about the combinatorics of random permutations and how they could be used to design randomized experiments is nearly one hundred years old, and, for example, prevalent in the works of Jerzy Neyman and the eccentric eugenicist Ronald Fisher by the 1930s. Fisher’s rambling discursion on the woman who can tell if milk was added to a cup before tea describes a randomization procedure that does not model the appearance of tea in cups, the composition of tea cups, nor the psychic abilities of the tea expert. It purely counts permutations and uses exact statistics to provide a way to evaluate validity of a claim. I’ll review in more detail how to use these effectively model-free ideas for experimentation in the next post.

There are obviously issues with the algorithmic view. Statistical sampling methods work best when the variability of the quantity of interest is low. In such a case, small experiments quickly reveal insights about the population. When variances are large or effectively unbounded, the number of samples required for high precision estimates might be impractical and our estimators and algorithms need to be rethought. There are many phenomena that obey “power law” scaling, and such quantities are harder to work with in this rudimentary experimental framework. The uniform sampling algorithm we have described will have unbounded variance, and hence the average on the sample will no longer be an accurate measure of the average of the population. More sophisticated means must be deployed to estimate means of quantities with such high variation.

Another drawback of statistical sampling is the requirement that samples be sampled uniformly. Uniform sampling is an idealization and is often hard to implement in practice. For example, what does it mean to collect an i.i.d. sampling of registered voters in a poll? Some voters may not respond to your phone calls. Even more problematic is when you cannot define the population well in advance, as would be the case of the vague notion of “likely” voters. More sophisticated statistical analyses can provide guarantees on non-uniform sampling strategies, but care must be taken to ensure that the statistical bounds we compute reflect the reality of the implementation of the data collection. Understanding our sampling capabilities is crucial to understanding the validity of our estimates.

With these caveats in mind, our algorithmic framing of statistics gives us a focused way to think about experiment design and prediction methods. In the next blogs, I’ll explore these both in depth.

Comments