When we design a neural network, we use gradient descent to learn the parameters. Does this require the activation function to be differentiable?
No! For example, ReLU, which is a widely used activation function, is not differentiable in $z=0$. But they are usually non-differentiable at only a small number of points and they have right derivative and left derivatives at these points. We usually use one of the one-side derivatives. This is rational since digital computers are subject to numerical errors ($z=0$ has been probably some small value rounded to zero). Read chapter 6 of the following book for more details on activation functions:
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, http://deeplearningbook.org
- Solved – I cannot differentiate the loss function, what is the best method for optimizing the weights in the neural network
- Solved – Why does SGD and back propagation work with ReLUs
- Solved – Does a Neural Network actually need an activation function or is that just for Back Propagation
- Solved – Intuition behind Backpropagation gradients
- Solved – L1-norm cost function for Neural Network. (Regression)