Momentum Stiefel Optimizer, with Applications to Orthogonal Attention, and Optimal Transport
This talk will report a construction of momentum-accelerated gradient descent algorithms on Riemannian manifolds, focusing on a particular case known as Stiefel manifold. The treatment will be based on, firstly, the design of continuous-time optimization dynamics on the manifold, and then a thoughtful time-discretization that preserves all geometric structures. Since Stiefel manifold corresponds to matrices that satisfy orthogonality constraint, two practical applications will also be described: (1) we markedly improved the performance of trained-from-scratch Vision Transformer by appropriately placing orthogonality into its self-attention mechanism, and (2) our optimizer also makes the useful notion of Projection Robust Wasserstein Distance for high-dim. optimal transport even more effective.