Posts Tagged ‘embarrassing myself’

Well, well, well. Look who’s come crawling back

July 22, 2012

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page.  Speaking very generally: we have some data, say A, b, and a class of functions f from which we want to draw our model. The goal is to minimize the distance between f(A) and b — if we can measure this distance, we have a functional R(f) = d(f(A), b), which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable b changes a tiny bit, then the f at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply R(f) = |Af - b|^2, i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional R^*(f) = |Af - b_\delta|^2 + c_\delta\Omega(f), then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

(The title of this post is a Simpsons reference. Some things never change.)

2^n \geq n: The graph theory proof?

February 15, 2010

Theorem. For every positive integer n, 2^n \geq n.

Proof. Consider a tree on n vertices T = (V, E) with root v_0. Assign to each edge \{v, w\} an element of the vector space GF(2)^V, obtained by setting 1s in the the coordinates corresponding to v and w and 0s elsewhere. I claim that these vectors are linearly independent; for suppose otherwise, and letthe vectors corresponding to S \subset E sum to 0. There is a natural “distance” function on E w/r/t v_0; let e_0 = \{s, t\} have maximal distance in S, and suppose WLOG that t is farther than s from v_0. Then the coordinate corresponding to t is nonzero for exactly one element of S, and the sum over all elements of S must therefore be nonzero. This is a contradiction. So in particular these |E| = n-1 elements are distinct and nonzero, which means (by Pigeonhole) that there are at most 2^n-1 of them.

(more…)

Is the Super Bowl a Z/2Z-graded bowl?

February 8, 2010

And if so, what’s just a regular ungraded bowl?

(Hat tip to Vladimir S. for the bad pun; new post coming soon, really, I promise.)