## Posts Tagged ‘embarrassing myself’

### Well, well, well. Look who’s come crawling back

July 22, 2012

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page.  Speaking very generally: we have some data, say $A, b$, and a class of functions $f$ from which we want to draw our model. The goal is to minimize the distance between $f(A)$ and $b$ — if we can measure this distance, we have a functional $R(f) = d(f(A), b)$, which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable $b$ changes a tiny bit, then the $f$ at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply $R(f) = |Af - b|^2$, i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional $R^*(f) = |Af - b_\delta|^2 + c_\delta\Omega(f)$, then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

(The title of this post is a Simpsons reference. Some things never change.)

### 2^n \geq n: The graph theory proof?

February 15, 2010

Theorem. For every positive integer n, $2^n \geq n$.

Proof. Consider a tree on n vertices $T = (V, E)$ with root $v_0$. Assign to each edge $\{v, w\}$ an element of the vector space $GF(2)^V$, obtained by setting 1s in the the coordinates corresponding to v and w and 0s elsewhere. I claim that these vectors are linearly independent; for suppose otherwise, and letthe vectors corresponding to $S \subset E$ sum to 0. There is a natural “distance” function on E w/r/t $v_0$; let $e_0 = \{s, t\}$ have maximal distance in S, and suppose WLOG that t is farther than s from $v_0$. Then the coordinate corresponding to t is nonzero for exactly one element of S, and the sum over all elements of S must therefore be nonzero. This is a contradiction. So in particular these |E| = n-1 elements are distinct and nonzero, which means (by Pigeonhole) that there are at most $2^n-1$ of them.

### Is the Super Bowl a Z/2Z-graded bowl?

February 8, 2010

And if so, what’s just a regular ungraded bowl?

(Hat tip to Vladimir S. for the bad pun; new post coming soon, really, I promise.)