What if you took these same Neural Networks as they exist now, and tweaked the input and the parameters slightly. For the input, use individual frames of an hour long video of a leopard (in order), and instead of having it just identify whether or not there is a leopard, have it identify what in each image is the leopard, and have it try to predict the next frame.
It seems that this is more like the way that we learn to identify things. Then once we establish an understanding of a base class (big cat) we can apply that same model to new cats that we have never seen before with just a picture.