## Posts Tagged ‘the simpsons’

### Well, well, well. Look who’s come crawling back

July 22, 2012

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page.  Speaking very generally: we have some data, say $A, b$, and a class of functions $f$ from which we want to draw our model. The goal is to minimize the distance between $f(A)$ and $b$ — if we can measure this distance, we have a functional $R(f) = d(f(A), b)$, which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable $b$ changes a tiny bit, then the $f$ at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply $R(f) = |Af - b|^2$, i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional $R^*(f) = |Af - b_\delta|^2 + c_\delta\Omega(f)$, then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

(The title of this post is a Simpsons reference. Some things never change.)

### On bias

January 23, 2010

[From Homer the Smithers, Mr. Burns sends Smithers on a forced vacation and tasks him with finding a temporary replacement]

Smithers: I’ve got to find a replacement who won’t outshine me. Perhaps if I search the employee evaluations for the word ‘incompetent…’

[Computer beeps, screen displays “714 Matches Found’]

Smithers: 714 names?! Heh. Better be more specific. ‘Lazy,’ ‘clumsy,’ ‘dimwitted,’ ‘monstrously ugly.’

[After a couple seconds, computer beeps, screen displays ‘714 Matches Found’]

Smithers: Ah, nuts to this, I’ll just go get Homer Simpson.

Actually, I tried to find the statistical nomenclature for this kind of thing, but couldn’t. Anyone have any idea what this is? (I want to say selection bias, but that’s not quite it…)

New Extremal Toolbox post should be coming later this weekend.