How exactly is the capacity of a NN related to its expressivity? However informal their usage may be, do these terms refer to one and the same concept? or is there a subtle difference between them?
E.g. for neural network capacity:
"Informally a model's capacity is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set"
E.g. for neural network expressivity:
"The fundamental question of neural network expressivity; how the architectural properties of a NN (depth, width, layer type) affect the resulting functions it can compute, and its ensuing performance"
There definitely is a lot of overlap and interchangeability in how those terms are commonly used. I think the main distinction is that expressivity is often used to talk about what classes of functions a neural network can approximate / learn, while capacity measure some notion of how much "brute force" ability the network has to contort itself into fitting the data. This is not the "only" definition, but just what I most often come across when reading.
From Understanding deep learning requires rethinking generalization
Much effort has gone into characterizing the expressivity of neural networks, e.g, Cybenko (1989); Mhaskar (1993); Delalleau & Bengio (2011); Mhaskar & Poggio (2016); Eldan & Shamir (2016); Telgarsky (2016); Cohen & Shashua (2016). Almost all of these results are at the “population level” showing what functions of the entire domain can and cannot be represented by certain classes of neural networks with the same number of parameters.
The effective capacity of neural networks is sufficient for memorizing the entire data set
Commonly, "expressivity" is used in claims about what types of functions a particular architecture can fit. For example, from PointNet
Theoretically and experimentally we find that the expressiveness of our network is strongly affected by the dimension of the max pooling layer, i.e., K in (1).
(Followed by a theoretical analysis of the robustness of their model.)
From Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
We also observe that to fully exploit 300M images, one needs higher capacity models. For example, in case of ResNet-50 the gain on COCO object detection is much smaller (1.87%) compared to (3%) when using ResNet-152.
This feels like it's referring more to the brute ability to fit more data than it is about any notion of flexibility or expressivity — after all, what can be "expressed" with 152 layers which can't with 50?
Suppose you want to learn some function which maps sets of objects to some label. A commonly used design pattern is to apply a per-object neural network to each object to obtain a feature vector for each object, then take the average/sum of the feature vectors and feed it into a second neural network.
If you make the neural networks big enough, perhaps you will have a very high capacity model. You might find that as you get more and more training data, your model can keep fitting them all without any problem. In fact even if you shuffle all the labels, the model has the capacity to just memorize what inputs should have what labels.
However, suppose later you find out that the inputs are actually ordered sets. Since the above architecture is totally unaware of the order of the input set (the average/sum operation throws that away), you'd realize it's not expressive enough when it comes to those types of problems (where order information is needed).
So, you can have a high capacity network, but with low expressivity with respect to a certain class of functions. You could also have an expressive model, but with limited capacity, for example if you didn't crank up the number of layers enough.
This is just my informal interpretation of the terms as they commonly appear in the "deep learning" literature. I am not aware of any canonical definition of either term, and to some degree they do get used interchangeably, so I think context is the most important thing here. Also, I don't closely follow the theory side of things, so it is entirely possible that community has assigned some more precise meanings to those terms.