* Given the training data, and the architecture of the network, why does SGD with backprop find the given f? vs. any other of an infinite set.
* Why are there are a set of f each with 0-loss that work?
* Given the weight space, and an f within it, why/when is a task/skill defined as a subset of that space covered by f?
I think a major reasons why these are hard to answer is that it's assumed that NNs are operating within an inferential statistical context (ie., reversing some latent structure in the data). But they're really bad at that. In my view, they are just representation-builders that find proxy representations in a proxy "task" space (def, aprox, proxy = "shadow of some real structure, as captured in an unrelated space").