Commentary

Finding George E.P. Box

Finding George E.P. Box

By Paul Conlin

Like many actuaries and financial analysts, I have spent the last half-decade hearing the buzzwords “Big Data,” “data analytics,” “predictive analysis,” and various other iterations of the same idea. Vast facilities are being built across the United States to house exabytes of data “in the cloud.” 

With all the data there, all sorts of statistical algorithms can be bounced against it, finding unexpected correlations, and drawing all sorts of inferences that only existed at great cost as recently as two decades ago. Some of those algorithms generate fascinating results, while others—say, if you are candidate Hillary Clinton in October 2016 running for president and being told by your data-centric campaign manager Robby Mook that you are handily defeating Donald Trump in Wisconsin (so, don’t bother visiting there for any political rallies)—are catastrophic. How do you tell which type of algorithm you’re looking at?

If you’re like me, you hit the textbooks.

You do so to fulfill your professional obligations to understand the results and models you are relying on. You do so to stay relevant in your career. You do so to be able to communicate with the new data scientists running around the office. But mostly—and, again, I emphasize the disclaimer “if you’re like me”—you do so because it’s sort of fascinating.

You studied mathematics for 16 years in getting your bachelor’s, but they didn’t get to the “magic” stuff like linear regression until your senior year of college, and by then you were tuned out, or at least distracted by your first real paycheck and your new life as an adult. You learned it again, part of it, because it was required for actuarial exams. But we all know exams aren’t directly useful on the job, so you set it aside after exams were over.

Well, would you like to find out what’s going on with all this data analytics stuff? Then have I got a story for you. Having gotten started on linear regression on a work project, I’ve spent the past three or four years hitting the books on a pretty continuous and sustained basis. I’ve learned a lot of statistics and techniques, and I’ve learned about the approaches mathematicians, data scientists, economists, and even physicists take to solving various problems and answering various real-world questions. 

I’ll touch here on some of the more usable concepts and approaches that I’ve stumbled upon. But first, and arguably more importantly, I’ve learned—for the first time in my life—how to read a textbook. I wanted to share some epiphanies I’ve had. I’m hoping that by doing so, I save you some of the aggravation I had in getting up the curve on this topic, and make it less intimidating for you to approach any complex field of knowledge.

✼ ✼ ✼

I’ll start with the punchline of this section, because it’s an assertion that might seem provocative, but is not meant that way. Here it is: An author of a textbook has a tone, theme, and “main idea,” every bit as much as an author of a novel does. It took me awhile to arrive at this epiphany, partially because it’s so counterintuitive. What could be more opposite than a fictional novel and a technical textbook? The latter clings stubbornly to facts, the former spins fanciful yarns—the more unrealistic the better. In the latter, you go to the back and read the index, while the former doesn’t even have an index; a novel might have a table of contents, and even then, not necessarily. But never an index; you read it from front to back. In a textbook, you go straight to a specific section, and then perhaps start reading backward if you don’t understand the notation in the section you came to read about. 

And then—admit it—the textbook goes back on the shelf. You’re not particularly interested in the author of the textbook, or what year he or she wrote it. And you’re certainly not treating the text as a whole work in and of itself. You’re not stepping back and asking why the author wrote the book, and more important, how he or she approached the topic. It’s hard enough to find time to read novels; who would actually read a textbook, cover to cover?

And although I offer no solution to the time issue, I am here to tell you that there’s a reward to reading a textbook. Not a single one, but multiple rewards, from multiple disciplines. You start to pick up on clues dropped by the author, intentionally or unintentionally. The author’s passion for the topic, for one; consider the investment of time needed to write a textbook. But more importantly, you start to understand how the author understands the topic, and how he or she chooses to explain it to strangers (the readers). And, eventually, it becomes the first thing you look for when you pick up a next textbook.

A quick aside on tactics: Textbooks are expensive. There’s no getting around it; it might be because of college loan availability, or corporate greed, or the narrowness of the topics that we actuaries specialize in. Buying used helps, but only a little. Many editions of many important statistical textbooks are just plain unavailable, or available in only limited quantities in questionable condition. And even then you need to spring for $40 or $50 or more, with an uncertain payoff. You can sometimes get lucky if you live near a good public library, but even that requires a significant investment of time. My only suggestion here is to view the effort as multiyear enterprise, and to return to online booksellers periodically to see whether a “target text” you’ve been meaning to get your hands on becomes available, and/or has had a drop in price. The reward will be worth it when you get your hands on a quality copy of a quality text. 

✼ ✼ ✼

So, assuming I haven’t lost you yet, you’re now motivated and “intellectually available.” You have a body of subject matter you want to get up to speed on, and a financial and temporal budget to draw upon. Where should you start, where should you move on to next, and where should you end? I’m here to give you a proposed “road map” to guide you along the way.

I started by accident as an undergrad, the last semester of my senior year of college, in 1987. The textbook was Introduction to Linear Regression Analysis, by Douglas Montgomery and Elizabeth Peck, the former a professor at Georgia Tech, the latter an executive at The Coca-Cola Company. Coke, in 1987, was the pinnacle of a profitable corporation, delivering mind-boggling stock market returns, hip commercials, and a pristine reputation across the globe at a time when not every American company was particularly welcome. The text is 500 pages long, but eminently readable and, more importantly, relentlessly practical, with a variety of real-life business and scientific problems seeking solutions. The chapters are logically structured, starting with simple single variable linear regression (first, fitting a line, then performing diagnostics on the model), progression to multi-variate regression (including discussion of the underlying matrix mathematics), polynomial regression, and finally challenges of regression such as autocorrelation and multicollinearity. For 30 years, the textbook has stayed on my bookshelf for both doing modeling of my own, and testing the modeling performed by peers.

One gem in the Montgomery and Peck textbook that I’ll touch on later in this article when we get to time series: Linear regression has a cool diagnostic test called a Durbin-Watson statistic. It lets us test whether there’s autocorrelation among the errors after we’ve chosen a best-fit regression model. Autocorrelation among the errors would violate the underlying assumption that the errors are random and normally distributed, which is the assumption that allows us to draw statistical inferences from the best-fit line we draw. We can always draw a best-fit line to any scatterplot of data; we can only draw statistical inferences from it if the errors are random. Durbin-Watson is neat because it (1) addresses exactly this issue, but (2) has pass/fail/inconclusive values. Of particular interest to health actuaries is that costs, per member per month (pmpm), I’ve seen modeled generate an inconclusive Durbin-Watson result: too much autocorrelation for them to be modeled by linear regression, too little for them to be modeled with time series. This is one reason “the robots” haven’t been able to start their own health plan yet, and why there is at least another generation of job security in store for health actuaries.

But here’s the plot twist: As complete as the text was (and is) on the topic of regression, it was short-sighted of me to ever consider it the final word. There is no “final word” in any science, including regression. There’s always a new problem to be solved, a new twist on an old problem, and advances in hardware and software that “shuffle the deck” on what models are practical versus which are impractical.

And there’s always a new author’s take on a familiar subject.

✼ ✼ ✼

Fast-forward 28 years to 2015, when I found myself working in the reserves area of one of the Big 5 health plans. We closed the books and estimated incurred but not reported (IBNR) claims every 30 days, and (as with most insurers) relied on regression models to pick the incomplete date of service months, especially the most recent one. I was charged with explaining our regression models to our senior management and to the actuarial staff of the enterprise relying on the IBNR outcomes to make pricing and forecasting decisions, and to evaluate the accuracy of the forecasting model over long periods of time. I dusted off my Montgomery and Peck tome … and quickly realized the model was using an approach very different from anything the Atlanta dynamic duo had modeled or even discussed. 

After searching online, I discovered a prolific author of several texts, a University of New Mexico author by the name of Ronald Christensen. His flagship text” had the clever title of Plane Answers to Complex Questions, and was in several editions spanning three decades: the 1990s, the 2000s, and the 2010s. Christensen’s models were not quite the same as the regression models used in our company’s IBNR modeling (more on that later), but I noticed Christensen discussed regression, and linear modeling in general, in a different way than Montgomery and Peck did. He always stayed grounded in what the underlying linear equations were to the model he was building, and then showed how the coefficients were solved using vector and/or matrix arithmetic. (Less impressive, he rarely derived his own modeling methods, choosing instead to reference models published by other statisticians and then derive for himself the underlying linear mathematics. This characteristic of Christensen is particularly evident in 3 [!!] texts he has out, also in multiple volumes over the course of decades: Advanced Linear Modeling, Log-Linear Models and Logistic Regression, and Bayesian Ideas and Data Analysis. While all three reinforce Christensen’s passion for mathematics and statistics—as well as an obvious lifetime of reading on the topic that puts my three-year foray to shame—they did not deliver the bang-for-the-buck I needed to make progress in this field of knowledge, and I can’t recommend them except to the most dedicated explorer.) 

I walked away from this with two takeaways: Every author I approached would give me a fresh perspective on a familiar topic—fun fact: did you know “least squares” is not a statistical estimation technique, but simply a projective geometry outcome? See Chapter 2 of Plane Answers.—and I should really pay attention to the bibliography and citations provided by the author of a text; by doing so, I was relearning for myself the topic as originally learned by the author.

One final dimension: Multiple disciplines frequently have different takes on the same subject. Just as physicists and engineers use calculus slightly different than mathematicians do, so do economists use linear regression differently than mathematicians. I finally latched on to the fascinating universe of the econometricians with a gem of a book, Mostly Harmless Econometrics, published in 2009 by an MIT professor named Joshua Angrist.

I came across this book entirely by accident. I was reading a Health Care Cost Institute paper on some study of hospital unit costs, and a technique called Quantile Regression. There’s lots of different types of regression out there, so surely (my naïve self thought) either Montgomery & Peck or Christensen will have mentioned this esoteric technique. I hit the index of all five texts (four of those being Christensen’s), and was perplexed to see no mention at all. At a loss, I went to Google, which led to a Wikipedia article (another underrated source on statistics; lots of fellow “stat-heads” out there are prolific Wikipedia contributors). And the Wikipedia article cited a single chapter of the Angrist book. I was hooked, and off to the races.

Angrist spends a lot of time at the start of the book describing one of the familiar mathematical mantras those of us who are math majors learned, sometimes as early as high school: “Correlation does not prove causation.” An experiment may happen to show a correlation between, say, ice cream sales and city pool drownings; but that doesn’t mean that either is causing the other. Correlation only gets us so far; in the real world, there’s more going on. Who cares more about the real world than economists? Angrist points out that in statistics, causation is the whole point. It’s not enough to just say “correlation doesn’t prove causation” and move on; we need to dig deeper. And that, in fact, is exactly what the econometrics profession has done, going all the way back to Victorian England. They have developed a plethora of statistical techniques: instrumental variables, two stage least squares, differences-in-differences, regression continuity designs, and randomized trials (used to test the efficacy of new drugs, and thus important to all health actuaries), almost all of which are unfamiliar to mathematical statisticians, even at advanced levels. Instrumental variable remains to me a mysterious and odd omission from the mathematical field, somewhat analogous to it omitting the foundational physics concepts of Lagrangians and Hamiltonians. The discovery of these topics at such an advanced stage of my career made tangible the necessity of continuous learning, and was a humbling reminder that it’s impossible to know everything that’s out there—the best we can do is keep plugging away, and evangelizing on what we have learned that works. 

My next stop after digesting what Angrist’s book and a similar followup of his, Mastering ‘Metrics, was a more nuts-and-bolts offering from 1994, Time Series Analysis, written by James Hamilton, a professor from Princeton. A doorstopper, coming in at 800 pages, Hamilton’s work is aesthetically stunning, both outside and in. In 22 concise-yet-complete chapters, Hamilton reviews nearly every technique, tried-and-true (Difference Equations) to those of recent origin (ARCH and GARCH). Particularly eye-opening was Chapter 8, the deceptively titled Linear Regression Analysis. The math is the same as what the original Montgomery/Peck text covered, but the actual linear model is a bit different: An autoregressive multiple linear regression model is used, meaning the independent variables and the dependent variable are all in the same units. We’re hypothesizing a multilinear relationship, strong enough to make inferences, between a series of variables, all of different time periods. 

A quick digression. It’s at this point that the word “linear” actually becomes misleading, and I needed to put my mathematician hat back on. When you graph a best-fit line using the techniques of Hamilton Chapter 8, you get a squiggly set of line segments (depending on how many points of history and projected periods you’re modeling). That’s because the result isn’t a line—the “line” is the plane generated by the estimated coefficients, and the predicted values fall out of those coefficients being cross-multiplied with the observed actual values of the independent variables. Again, another epiphany: In many contexts, especially linear contexts, linear regression becomes a neutral method to evaluate data. The idea in business forecasting (usually) isn’t to get something right to 0.001% precision; it’s to set aside preconceived assumptions and to predict “inflection points” sooner than your competitors do. Sometimes the data scientists who didn’t get trained in rigorous mathematical statistics “get it right” in this area—if you are applying a tool consistently, if you can get find an edge that lets you sense areas worthy of further research, you’re on the path to finding business insights.

Another digression. Those of you who wrapped up a fellowship in the Society of Actuaries (SOA) in the early 1990s may recall an attempt by the SOA to get us trained on, and eventually employed in, the asset side of the balance sheet of financial institutions. Wall Street salaries were just starting to really take off (Liar’s Poker had been published in 1985—another book, by the way, which should be on every actuary’s reading list), and the SOA had some anxiety that actuaries would be left behind. Which, in a sense, we were; actuaries on Wall Street are as rare today they were back then, even though you would think our mathematical aptitude would make us naturals at options and derivatives. One reason the SOA’s efforts fell short, I believe, was in the hit-or-miss way they tried to teach us financial economics. The “no arbitrage” rule of financial economics—no asset can be simultaneously bought-and-then-sold for a different price violated defined pension plan assumption picking, where you discount your liabilities at a higher rate if your stated investment strategy is more aggressive—for a while in the mid-1990s, this divergence was called “The Great Debate.” It was never settled; it just faded away as defined benefit pension plans went away. Financial economics still exists, though. It’s explained very well in The Econometrics of Financial Markets, by Campbell/Lo/MacKinlay.

✼ ✼ ✼

We’re getting to the punchline. You may have noticed the title of the Hamilton book is Time Series Analysis. Once we’re in the realm of time series, we’ve done two things. First, we’re now in the “sweet spot” of the most important problem facing health (and, in a slight different way, property/casualty) actuaries: predicting medical pmpm’s by time period—be it month, quarter, or year. In the original linear regression readings we encountered, I mentioned the concept of autocorrelation—the correlation of predictive variable with themselves. Autocorrelation is a bad thing in “regular” linear regression, but a good thing in time series, and in fact something we look for, seek to harness, and then assume it will recur. 

Again, even after three years of wrestling with this issue, I remain amazed that mathematicians embrace linear regression but shy away from time series, leaving that field to the economists. Two exceptions are/were Peter Brockwell and Richard Davis, professors at the Department of Statistics at Colorado State University. In the 1980s, they published a veritable masterpiece, Time Series: Theory and Methods. Their textbook does it all: It starts with the rigorous theorems of mathematical probability (as structured by a Russian mathematician named Kolmogorov in the 1930s), building up, to AR (Auto Regressive), MA (Moving Average), ARMA, and ARIMA (Auto Regressive Integrated Moving Average) models; and culminating in SARIMA (Seasonal Auto Regressive Integrated Moving Average) modeling, which in its most natural format allows prediction of data series on a monthly basis that (1) trend over time, and (2) have seasonal tendencies, such as heavy or low months. This can be anything from economic series such as retail sales, to stock prices. The possibilities are almost endless to the skilled model; limited only by the availability of reliable data, which obviously becomes more and more accessible (and costless) with each passing year to even amateurs. 

R and Python (and even SAS) permit very easy modeling of ARIMA models. You’ll frequently see notation such as ARIMA(2, 1, 2), which means that the Auto Regressive elements are “second order,” the data is “differenced” once (meaning it has near-constant trend [the first derivative of the data {trend, in medical insurance} is significant, but the second derivative is near-zero]), and the Moving Average elements are “second order.” 

The order of the AR and MA elements of an ARIMA or SARIMA model have different implications for the signs of the derived coefficients and the behavior of the autocorrelations and partial autocorrelations, which the skilled model builder can use in developing “best models,” and then “best-fit” within a model. And, naturally, there’s a textbook that specifically addresses these tools and techniques: Forecasting with Univariate Box-Jenkins Models, by Alan Pankratz, published in 1983 by a DePauw University professor. Pankratz’s forgotten masterpiece is perhaps the most relentlessly actionable statistics text I have ever read. In 15 specific, actual, real-world case-studies, he uses UBJ models to come up with a best model, and best-fit parameters within each model. All of his equations are regrouped and reparameterized until the formula is as concise as possible. Inverse and complement relationships across all the formulas are presented and commented on. Graphs and tables and text are all seamlessly melded together, making the text digestable at the same rate as a Stephen King thriller. The only slowdown to reading the text in a single sitting is the temptation to open your laptop and begin modeling yourself, either in R or Python or even Excel. If you began with all the above texts, you’ll be kicking yourself for not getting started with Pankratz, and you’ll be frustrated at your undergrad departments for not bringing the book to your attention while you were still in your 20s. 

Now, you may have noticed the title of the book at this point: “Box-Jenkins” models. Even by the time you get to Pankratz, you will have noticed “Box” (sometimes alone, sometimes as Box-Cox, sometimes as Box-Jenkins) in literally all of the other texts I mentioned before: Montgomery/Peck, Christensen, Campbell et al. You will have found yourself wondering: “‘Box’? What is that, some kind of cubical model? A person? It must be a person—it sometimes says ‘G.E.P. Box.’” 

✼ ✼ ✼

We have long last arrived at the punchline to this article: “G.E.P. Box” was George Box, a statistics professor of British origin who ended his career at University of Wisconsin in Madison. He collaborated in the middle part of his career with a Professor Cox, thus inventing several Box-Cox papers and methods and formulas. But his final collaborators where Gwilym Jenkins and Gregory Reinsel, and together they published four editions of Time Series Analysis: Forecasting and Control. Just as theologians such as Warren Wiersbe write commentaries on the Bible, Pankratz’s UBJ Models text is a commentary of the Bible of time series, which is the Box/Jenkins/Reinsel text. It’s such a vital (to statistics enthusiasts, an almost-sacred) text that in May 2015, Greta Ljung published a Fifth Edition, crediting Box/Jenkins/Reinsel as the authors, even though all three were deceased. 

Time Series Analysis is a 650-page monster, which somehow brings in the mathematical, statistical, and econometrics perspectives to modeling. It gives all the theorems and proofs, and yet never loses sight that it’s a real-world problem we’re attempting to solve. But more importantly, the reader is saturated with Box’s wisdom in model building: Use a log-transform to dampen outlier data, keep parameters to the minimum necessary to solve the problem, use indicator variables when appropriate, difference data to achieve stationarity, don’t overfit your model, use Akaike information criteria (AIC) and Bayesian information criteria (BIC) diagnostics to choose the best model, etc.

Box/Jenkins/Reinsel end up in a more permanent place on the statistics Mount Rushmore, unlike Christensen, because although the latter shares the formers’ passion for the topic, Box, Jenkins, and Reinsel take an approach to statistical modeling and make it their own. If you ever inherit a model from a predecessor at work, you will be able to deduce if that predecessor was a disciple of George Box by the Box-like tools s/he used—or didn’t use—in building the model. And if you are ever charged with building a model yourself, you will train yourself to use Box-like tools when you hit obstacles in your model building. 

But most importantly, you will come to understand that building that model was not the “end” of your journey. It was a way station of the path of lifelong learning, which will continue even after your actuarial career is over. If anything, it will broaden even more when that day comes, because you’ll have more free time to tackle this grueling but rewarding topic.

PAUL CONLIN, MAAA, FSA is a senior actuarial director at Aetna, a CVS Health company, and works from home in Lake Zurich, Ill.

Print Article
Next article Looking for Truth in Shades of Gray
Previous article Enterprise Risk Management for a Captive Audience

Related posts