Where can I find a good proof that CRF based models and logistic regression based models are convex? Is there a general trick to test/prove that a model or objective function is convex?
Best Answer
One trick is to rewrite objective functions in terms of functions which are known to be convex.
Objective function of ML trained log-linear model is a sum of negative log-likelihoods, so it's sufficient to show that negative log-likelihood for each datapoint is convex.
Considering datapoint fixed, we can write its negative log-likelihood term as
$$-langle theta,phi(y)rangle+log sum_y exp(langle theta,phi(y)rangle)$$
First term is linear so it's sufficient to show that second term, known as the log-normalizer, is convex.
Write it as $f(mathbf{g}(mathbf{theta}))$ where $f(mathbf{y})=log sum_y exp y$ and $g_y(theta)=langle mathbf{theta},phi(y)rangle$. Here $g$ is a linear function, and $f$ is a known convex function called log-sum-exp. See page 72 of Boyd's Convex Optimization book. Composition of a convex function and a linear function is convex, see section 3.2.2
Another approach is to use the fact that log-normalizer is the cumulant generating function. For instance see example 3.41 in Boyd's book, or Proposition 3.1 in Wainwright's "Graphical models, exponential families, and variational inference" manuscript. This means that second derivative is the covariance matrix of sufficient statistic $phi$ which by definition is positive semi-definite, which means that Hessian of the log-normalizer is positive semi-definite. Positive semi-definite Hessian guarantees the function is convex, see section 3.1.4 of Boyd's book.
Technically, the log-normalizer is not the traditional cumulant generating function. CGF is $g(phi)=log(Z(theta+phi))-log(Z(theta))$. However, derivative of log-normalizer evaluated at $theta$ is the same as the derivative of the CGF evaluated at $mathbf{0}$, so it produces cumulants just like CGF.
I couldn't find full proof of equivalence, usually people omit it because it's just several steps of uninspiring algebra. A very terse derivation for continuous output space is on page 5 of Xinhua Zhang's "Graphical Models" thesis. I believe a saw full derivation in Lawrence D. Brown's "Fundamentals of statistical exponential families"