## Archive for the ‘Uncategorized’ Category

### Well, well, well. Look who’s come crawling back

July 22, 2012

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page.  Speaking very generally: we have some data, say $A, b$, and a class of functions $f$ from which we want to draw our model. The goal is to minimize the distance between $f(A)$ and $b$ — if we can measure this distance, we have a functional $R(f) = d(f(A), b)$, which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable $b$ changes a tiny bit, then the $f$ at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply $R(f) = |Af - b|^2$, i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional $R^*(f) = |Af - b_\delta|^2 + c_\delta\Omega(f)$, then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

(The title of this post is a Simpsons reference. Some things never change.)

### My game-design career comes to a complete stop

August 3, 2009

Serious post coming as soon as I think of something serious to post about.

### A FOCS-related question

July 27, 2009

Actually, it’s not so much related to FOCS as to a specific paper in FOCS. I’m talking about the paper of Basu and Zell in which they formulate and prove an analogue over the real numbers of Toda’s theorem. I happen to know virtually nothing about real complexity theory, and even in standard complexity theory Toda’s theorem is pretty far from my usual interests of circuit complexity and derandomization, but their abstract nevertheless caught my attention for one single reason: the method of proof.

As far as I know, all of the proofs of (the classical) Toda’s theorem rely on heavy machinery from complexity and combinatorics, and tend to be quite involved. Basu and Zell’s paper also seems to be pretty long and technical, but their result is proved (and formulated) using tools from algebraic topology. For instance, the real analogue of #P is defined using the Poincare polynomial (the generating function of the Betti numbers, from a combinatorialist’s viewpoint). Basu and Zell are also kind enough to provide a definition of #P from a “geometric” viewpoint as well. My question is whether this topological or geometric intuition over $GF_2$ (or in classical complexity) can be extended to produce a new proof of Toda’s theorem.

By the way, if this post sounds like I don’t know what I’m talking about even more so than usual, that’s because I don’t know what I’m talking about even more so than usual. Apologies.

### Non-update

July 16, 2009

This is sort of a word of explanation as to why I haven’t been as active in the past week or two as I had previously been, and a bit of a preview of my plans for the blog in the near future.

First, my research project on counting acyclic orientations has been moving along, but I don’t quite feel comfortable (yet) posting my progress on this blog. The good news is that there seems to be a promising path to an asymptotic formula for at least a toy case; unfortunately, there have been other demands on my time lately and I haven’t been able to set aside a few hours to pursue it. Once I have that chance, I’ll probably say a little more about what’s happening on that front. (more…)

July 5, 2009

### Generating functions: they’re like buttah

June 28, 2009

An amusing passage from Doron Zeilberger’s PCM article on enumerative combinatorics (which I’ve finally gotten around to reading):

According to the modern approach, pioneered by Polya, Tutte, and Schutzenberger, generating functions are neither “generating,” nor are they functions.

(The second post on Tutte(1, 0) is forthcoming; I need to get in touch with the student I’m working with before I post it. I’m also working on a post on category theory in combinatorics, which will hopefully be up on Monday.)

### Elementary properties of the Tutte polynomial at (1, 0)

June 25, 2009

The Tutte polynomial of a graph G is a widely-studied and very useful graph invariant. It’s a polynomial in two variables x, y that extends, among other things, the chromatic and flow polynomials of a graph. In a certain (well-defined) sense, it’s actually the most general object that obeys what’s called the deletion-contraction recurrence, which I’ll talk about in a bit.  One particular specialization of the Tutte polynomial of interest is its value at x = 1, y = 0. This has at least two useful combinatorial interpretations; it counts the number of acyclic orientations of the edges of G with a fixed unique source, or, equivalently, it counts what are called the “maximal parking functions” of G. As it happens, this second interpretation is the one I’m most interested in, but the first is more useful for this discussion, so we’ll only consider it here. Throughout this post, we’ll let M(G) = T_G(1,0) be the Tutte polynomial evaluated at (1, 0). (more…)