I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s *Nature of Statistical Learning Theory*, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page. Speaking very generally: we have some data, say , and a class of functions from which we want to draw our model. The goal is to minimize the distance between and — if we can measure this distance, we have a functional , which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable changes a tiny bit, then the at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply , i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional , then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

(The title of this post is a Simpsons reference. Some things never change.)

### Like this:

Like Loading...

*Related*

Tags: embarrassing myself, math.ST, regularization, statistics, the simpsons

This entry was posted on July 22, 2012 at 16:46 and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

July 22, 2012 at 22:32 |

I love your “embarrassing myself” tag. Will adopt it!

July 22, 2012 at 22:40 |

Credit where credit’s due: I stole the tag from Scott Aaronson’s blog.

May 31, 2013 at 11:34 |

I do accept as true with all the concepts you’ve presented in your post. They are very convincing and will certainly work. Still, the posts are too quick for beginners. Could you please extend them a bit from next time? Thank you for the post.

July 25, 2013 at 22:08 |

Admiring the dedication you put into your site and in depth information you offer.

It’s great to come across a blog every once in a while that isn’t the same outdated rehashed information.

Excellent read! I’ve saved your site and I’m adding

your RSS feeds to my Google account.