ReLU nets are powerful memorizers: a tight analysis of finite sample expressive power
I will talk about finite sample expressivity, aka memorization power of ReLU networks. Recent results showed (unsurprisingly) that arbitrary input data could be perfectly memorized using a shallow ReLU network with one hidden layer having N hidden nodes. I will describe a more careful construction that trades of width with depth to show that a ReLU network with 2 hidden layers, each with 2*sqrt(N) hidden nodes, can perfectly memorize arbitrary datasets. Moreover, we prove that width of Θ(sqrt(N)) is necessary and sufficient for having perfect memorization power. A notable corollary of this result is that mild overparametrization suffices to permit a NN to achieve zero training loss!
We extend our results to deep networks too and combined with recent results on VC-dimension of deep nets, we show that our results on memorization power are nearly tight. Time permitting, I will mention expressive power results for Resnets, as well as how SGD behaves on optimizing such networks.