Imagine, that the only operations I have are scalar addition and scalar multiplication and I want to implement different nonlinearities for neural networks with them. The only option I see here is to use some polynomial approximations.

I can use Taylor series approximations, but it gives poor results. For example, for smooth ReLU $mathrm{ln}(1 + e^x)$ its Taylor approximation with 9 terms greatly diverges outside the range $[-4,4]$ (look at the plots on WolframAlpha). This is unacceptable because activations in neural networks can go far beyond this range. For other nonlinearities (logistic sigmoid and tanh) we have a similar picture. Of course, I can use much more terms, but it will be computationally intractable.

So, I have two questions:

- Are there any good polynomial approximations of nonlinearities used in neural networks (ReLU, logistic, tanh)?
- What range should I care about? For example, if some function $f(x)$ is a good approximation of ReLU in the range [-100, 100], will it be enough for me in practice? In what range are activations of nonlinearities in AlexNet, GoogLeNet, VGG, etc?

**Contents**hide

#### Best Answer

The problem you're having is due to the asymptotic behavior of the remainder between the Taylor approximation of a function, and the function itself.

If $f$ is at least $k$ times differentiable and you approximate it with a $k$'th order polynomial, $$ P_k(x) = f(a) + f'(a) (x – a) + frac{1}{2} f''(a) (x-a)^2 + dots + frac{1}{k!} f^{(k)}(a) (x-a)^k, $$

Then the scaling of the error $P_k(x) – f(x)$ goes as $o(|x – a|^k).$ I'm using the little-o notation. This is particularly noticeable for an infinite-differentiable function, such as the smooth-RELU.

The example that's often used to illustrate this concept is $sin x.$ The function $sin x$ has infinite stationary points. This is required for the constant oscillation behavior. However, a polynomial of order $k$ can only have, at max, $k-1$ stationary points. So trying to fit a polynomial to a sine function can approximate it decently well around the point you're referencing. In fact, the higher order your polynomial, the wider the domain of validity your approximation will be. But it will always eventually diverge.

The example you gave, the smooth RELU, also has its own assortment of problems. You can either simulate it well near the origin, or at some infinite asymptotic point, but not both. To see this, look at its asymptotic behavior as $x rightarrow infty:$

$$ begin{split} lim_{x rightarrow infty} log (1 + e^x) &= lim_{x rightarrow infty} log left[frac{e^{-x} + 1}{e^{-x}}right] \ &= lim_{x rightarrow infty} log left[ frac{1}{e^{-x}} right] rightarrow x. end{split} $$ That is, its infinite asymptotic behavior is just that of the function $g(x) = x.$ However, there is an inflection point near the origin, which cannot possibly be represented well by a first order polynomial, so you need a higher order polynomial to capture that. However, if you use a polynomial $P_k(x)$ of order $k>1$ then, $$ lim_{x rightarrow infty} frac{P_k(x)}{x} = infty. $$ That is, you cannot *possibly* have consistent behavior as you increase $x.$ The higher order polynomial will always overtake $x$ at some point, and continue to grow more rapidly, whereas $x$ is the asymptotic behavior you're looking for. Furthermore, its asymptotic behavior on the left side is $h(x) = 0,$ which is itself a polynomial. So you have zero'th order behavior on the left, first order behavior on the right, and an inflection point near the middle, which suggests a polynomial of at least order 2. You cannot possible fit all of these behaviors by a single polynomial.

It is a matter of scale. Looking at the sine wave approximation, the polynomial is indeed a solid approximation of the function, and in fact, this is used by most calculators to give you the output of a $sin$ function. But this only works within a certain range. If you want to approximate activation functions in your Neural Network you have to keep in mind that the input to the activation functions are somewhat uncontrolled, because they're the linear combination of input nodes that you're *fitting* by looking for the appropriate weights. You have to ensure that your weights are small enough that the domain of validity of your Taylor approximations hold. Regularization, such as $L2$ weight decay, might do the trick, but it's no guarantee. Another way around this is to increase the order of the polynomial, to make the domain of validity larger. But, given the asymptotic arguments I illustrated above, there is no *general* way to ensure that a polynomial gives you the behavior you expect out of an infinitely-differentiable activation function.