Proximal Operators - Optimization Perspective

Proximal Operator and Proximal Point Method

In many applications and domains, one can boil down the crux of a problem to one of optimization. For simplicity, assume we are optimizing a function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) over \((\mathbb{R},\lVert \cdot \rVert_2)\) that has nice properties - for instance, convex, lower-semi continuous, etc. Without loss of generality, suppose we wish to minimize said function:

\[\min_{x \in \mathbb{R}^d} f(x).\]

For exposition’s sake, assume further that \(f \in C^1(\mathbb{R}^d)\), then one can do a gradient descent style algorithm. That is, starting from some initialization \(x_0\), one obtains the sequence of iterates \(\{x_n\}_{n \geq 0}\) in an iterative fashion:

\[x_{n+1} = x_{n} - \alpha_{n+1} \nabla f(x_n),\]

where \(\{\alpha_n\}_{n \geq 0}\) is a sequence of step-sizes.

Because of the easy of statement, and the prevalence within many machine learning centric disciplines, gradient descent - and it’s closely related optimization relatives - enjoys a lot of publicity and study. Serving also as a way of teaching student’s how to prove different types of convergence rates, and how a function classes’ characteristics effect the outcome, gradient descent style algorithms rightfully take much of the spotlight!

This being said, today I wanted to focus on the lesser known proximal point method and the corresponding proximal operator. In some communities, where gradient descent is seen as forward euler, the natural foil is backward euler - which is actually the proximal point method! If you have taken any numerical analysis or numerical PDEs course, it should come out no surprise that this method comes with many theoretical benefits. Beyond nice benefits regarding numerical stability, the proximal operator itself is of much interest for theoretical applications. The main theoretical application we wish to build towards is that of establishing the notion of a gradient flow in 2-Wasserstein space. Before leaving theory, we highlight how we may allow ourselves to change our geometric interpretation of proximal. By doing so, we can also make connections with general regularization based ideas that often play a role in signal processing and statistical learning adjacent fields. Lastly, while these methods are generally impractical for general functions, we will highlight ideas for making them more computationally feasible - with a slight payment of using a surrogate for our functions of interest.

Proximal Operator and Proximal Point Method