# Solved – Poisson Regression in R based on categorical time variables

I am trying to make a prediction on the number of visitors of a website and am wondering if it can be done based only on categorical time variables. I received some guidance on how to structure my input after a question on Stack Overflow, but I still can't make predictions close to the real values, no matter what I've tried and how I've combined the data.

Here are the variables in my dataset:

• website_id

• date_int – would act as a time index

• month – would be used for seasonal effect

• type – variable derived from the response variable (number of visits), representing the size of the website by calculating an average of number of visits (ranges from 1 to 5).

• D1, D2, D3, D4, D5, D6 – variables used to capture the "seasonal effect" for the day of week.

• visits

I'm leaving also a link to a sample data in case it might be relevant: train.csv.

This is what the code I've tried in R looks like.

``train = read.csv("trainData.csv", header = TRUE) head(train)  dates <- as.factor(train\$date_int) dates  months <- as.factor(train\$month) months  model<-glm(visits ~ dates + type , train, family=poisson) summary(model)  P = predict(model, newdata = train, type = "response") imp = round(P) imp ``

I'm new to R, but from what I've seen, all the examples, even those that should be similar (like estimating sales), use other variables beside those categorical time variables. I don't have other features to base my prediction on, so I feel the need to ask if a prediction can even be made using the input given in this situation?

Contents

I recommend reading "Forecasting: principles and practice" https://www.otexts.org/fpp/8/2 and there are good R libs for handling your data. Using those tools, I would generate a unique forecast for each website.

``require(forecast) w=99 WWWusage=ts(train\$visits[train\$website_id==w],frequency = 7) fit <- auto.arima(WWWusage) plot(forecast(fit,h=32))     ``

So to build all of them:

``for(w in sort(unique(train\$website_id))){     WWWusage=ts(train\$visits[train\$website_id==w],frequency=7)     fit <- auto.arima(WWWusage)     plot(forecast(fit,h=32))       title(paste("nnwebsite",w)) } ``

If you want to get some insight on how autoregression works, I wrote this code:

``w=99 train = read.csv("/tmp/trainData.csv", header = TRUE) train=train[train\$website_id==w,] repair=function(x) { if(length(x)==0) return(NA); return(x);} for(b in seq(7*5,7*9,7)){     train[[paste0('backshift',b)]]=rep(NA,nrow(train))     for(r in 1:nrow(train)){ #slow way to do this         train[r,paste0('backshift',b)]=repair(train\$visits[train\$website_id==train[r,'website_id']                                            & train\$date_int==(train[r,'date_int']-b)])     } } train=train[complete.cases(train),] rmse=function(x,y,k=0){       return( sqrt(sum((x-y)^2)/(length(x)-k))) }  require(MASS) train\$months <- as.factor(train\$month) train\$date=NULL   model<-glm(visits ~. , train, family=poisson) model=stepAIC(model,trace=F) summary(model)  P = predict(model, newdata = train, type = "response") imp = round(P) rmse(imp,train\$visits) train\$fit=imp with(train[train\$website_id==w,],{     plot(date_int,visits,type='l')     points(date_int,fit,col='red',type='l')     title(w) }) ``

Here is the poisson model:

``Call: glm(formula = visits ~ date_int + D1 + D2 + D3 + D4 + D5 + D6 +      backshift35 + backshift42 + backshift49 + backshift56 + backshift63 +      months, family = poisson, data = train)  Deviance Residuals:      Min       1Q   Median       3Q      Max   -72.386   -4.389    1.126    6.912   54.939    Coefficients:               Estimate Std. Error z value Pr(>|z|)     (Intercept)  1.003e+01  2.405e-02 416.935  < 2e-16 *** date_int    -3.380e-03  1.125e-04 -30.044  < 2e-16 *** D1          -4.000e-02  3.425e-03 -11.680  < 2e-16 *** D2           7.957e-01  9.623e-03  82.686  < 2e-16 *** D3           6.755e-01  8.258e-03  81.800  < 2e-16 *** D4           6.502e-01  7.997e-03  81.299  < 2e-16 *** D5           5.791e-01  7.470e-03  77.530  < 2e-16 *** D6           4.544e-01  6.173e-03  73.602  < 2e-16 *** backshift35  9.173e-06  5.502e-07  16.671  < 2e-16 *** backshift42 -1.368e-05  5.191e-07 -26.353  < 2e-16 *** backshift49  1.408e-06  5.656e-07   2.489  0.01280 *   backshift56 -2.305e-05  6.010e-07 -38.358  < 2e-16 *** backshift63  1.957e-06  6.035e-07   3.242  0.00119 **  months9     -3.231e-01  1.332e-02 -24.258  < 2e-16 *** months10    -2.589e-01  1.070e-02 -24.192  < 2e-16 *** months11    -2.907e-01  7.710e-03 -37.706  < 2e-16 *** months12    -3.915e-01  4.577e-03 -85.529  < 2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for poisson family taken to be 1)      Null deviance: 157970  on 134  degrees of freedom Residual deviance:  39894  on 118  degrees of freedom AIC: 41448  Number of Fisher Scoring iterations: 4 ``

You can try different values of website_id (w). I had backshift start at 5 weeks ago which allows you to easily forecast 5 weeks, but you can forecast farther than that by making predictions based on predictions.

Rate this post