I'm doing Causal Impact analytics with this python package. Since my control time series have a much larger scale (100-10000 times larger) than my modeled variable, at some point I tried to scale the control variables. This had a huge effect on my result and sometimes even changed a statistically significant positive result into a negative one.
Should the time series be normalized before using this package? Or is there something else that I'm doing wrong?
Here is a contained example where changing the scale radically changes the error estimates:
import numpy as np import pandas as pd from statsmodels.tsa.arima_process import arma_generate_sample from causal_impact.causal_impact import CausalImpact np.random.seed(1) x1 = arma_generate_sample(ar=[0.999], ma=[0.9], nsample=100) + 100 y = 1.2 * x1 + np.random.randn(100) intervention = 70 y[intervention:] = y[intervention:] + 10 data = pd.DataFrame(np.array([y, x1]).T, columns=["y","x1"]) ci = CausalImpact(data, intervention) ci.run() ci.plot()
data.x1 = data.x1*10000 ci = CausalImpact(data, intervention) ci.run() ci.plot()
Best Answer
We can actually see what's happening behind the scenes in both scenarios.
The package you mentioned uses statsmodels to fit the structural time series model.
It's actually quite simple to use it, you could do something like:
import numpy as np from statsmodels.tsa.arima_process import arma_generate_sample from statsmodels.tsa.statespace.structural import UnobservedComponents np.random.seed(1) x1 = arma_generate_sample(ar=[0.999], ma=[0.9], nsample=100) + 100 y = 1.2 * x1 + np.random.randn(100) intervention = 70 x_pre = x1[:intervention] x_pos = x1[intervention:] y_pre = y[:intervention] y_pos = y[intervention:] model = UnobservedComponents( y_pre, level='llevel', exog=x_pre ) fitted_model = model.fit()
If you want a deeper understanding of the "local level" model, here's a good source.
In a nutshell, it models the time series using a component known as "irregular" (y
noise), another component called "level" which tracks how much it grows (or decrease) from one point to the next and finally the coefficients of the linear regression (just as Google did in their proposed model).
Taking a look at the summary of our fitted model, here's what we find:
As we can see, the model does a pretty good job at fitting the parameters of our data.
The noise (sigma2.irregular
) is 0.896 and we used it as 1 (np.random.randn(100)
), the coefficient (beta.x1
) is 1.3223 and we used 1.2 and finally the level (sigma2.level
) is basically 0, which is correct as our simulated data has no leveling over time.
So far so good.
Let's see what happens though when we use your new suggested x1:
Totally different fitted parameters.
The $beta$ coefficient went to 0. Noise almost doubled and now there's a little bit of leveling.
Notice the warning we get when we try to fit this model:
/opt/conda/lib/python3.6/site-packages/statsmodels/base/model.py:508: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)
So basically what is happening is, the covariates are so big that the model has a pretty hard time trying to fit coefficients for the linear regression and everything else gets messed up.
So yes, it's a good idea to apply standardization to your data.
The downside to the library you are using is that it offers no support to do so.
I'd recommend using this new library instead as it offers standardization by default.