The main problem I have with neural networks for computer vision is that they do not give me understanding. Even the best network, that has a 99.91% accuracy on the MNIST handwritten digits dataset, can not give me any insight. It does not allow me to observe how it actually performs its classification.
I can study the architecture and thoroughly understand all its mechanisms, but the functionality is hidden in the weights of the network. And there might be millions of them.
It may be my training as a physicist, but I like vision systems to be based on a clear theory. I want to be able to understand how it works. A mathematical theory helps me get to the heart of the phenomenon.
For example, take my formula for determining the size of a ball resting on the groundfloor from 1 photo. The formula for the radius of a ball (of course expressed as quadrance) is
It depends on the height of the camera , the visual semi-spread
of the ball in the projection, and the spread
between the visual direction and the gravitational direction.
There is not a single parameter in this formula. Perhaps more importantly, it gives extremely accurate experimental results.
Quite a number of assumptions has gone into this formula, but there is no dataset, no training, no adjustments of weights in a network. There is only a clear deterministic mechanism that gives very accurate results in situations in which the assumptions are valid.
I am not against neural networks. They show some amazing capabilities. But I am uncomfortable with them. I prefer understanding.