Solved – Poisson Regression in R based on categorical time variables

I am trying to make a prediction on the number of visitors of a website and am wondering if it can be done based only on categorical time variables. I received some guidance on how to structure my input after a question on Stack Overflow, but I still can't make predictions close to the real values, no matter what I've tried and how I've combined the data.

Here are the variables in my dataset:

  • website_id

  • date_int – would act as a time index

  • month – would be used for seasonal effect

  • type – variable derived from the response variable (number of visits), representing the size of the website by calculating an average of number of visits (ranges from 1 to 5).

  • D1, D2, D3, D4, D5, D6 – variables used to capture the "seasonal effect" for the day of week.

  • visits

I'm leaving also a link to a sample data in case it might be relevant: train.csv.

This is what the code I've tried in R looks like.

train = read.csv("trainData.csv", header = TRUE) head(train)  dates <- as.factor(train$date_int) dates  months <- as.factor(train$month) months  model<-glm(visits ~ dates + type , train, family=poisson) summary(model)  P = predict(model, newdata = train, type = "response") imp = round(P) imp 

I'm new to R, but from what I've seen, all the examples, even those that should be similar (like estimating sales), use other variables beside those categorical time variables. I don't have other features to base my prediction on, so I feel the need to ask if a prediction can even be made using the input given in this situation?

I recommend reading "Forecasting: principles and practice" https://www.otexts.org/fpp/8/2 and there are good R libs for handling your data. Using those tools, I would generate a unique forecast for each website.

require(forecast) w=99 WWWusage=ts(train$visits[train$website_id==w],frequency = 7) fit <- auto.arima(WWWusage) plot(forecast(fit,h=32))     

So to build all of them:

for(w in sort(unique(train$website_id))){     WWWusage=ts(train$visits[train$website_id==w],frequency=7)     fit <- auto.arima(WWWusage)     plot(forecast(fit,h=32))       title(paste("nnwebsite",w)) } 

enter image description here

If you want to get some insight on how autoregression works, I wrote this code:

w=99 train = read.csv("/tmp/trainData.csv", header = TRUE) train=train[train$website_id==w,] repair=function(x) { if(length(x)==0) return(NA); return(x);} for(b in seq(7*5,7*9,7)){     train[[paste0('backshift',b)]]=rep(NA,nrow(train))     for(r in 1:nrow(train)){ #slow way to do this         train[r,paste0('backshift',b)]=repair(train$visits[train$website_id==train[r,'website_id']                                            & train$date_int==(train[r,'date_int']-b)])     } } train=train[complete.cases(train),] rmse=function(x,y,k=0){       return( sqrt(sum((x-y)^2)/(length(x)-k))) }  require(MASS) train$months <- as.factor(train$month) train$date=NULL   model<-glm(visits ~. , train, family=poisson) model=stepAIC(model,trace=F) summary(model)  P = predict(model, newdata = train, type = "response") imp = round(P) rmse(imp,train$visits) train$fit=imp with(train[train$website_id==w,],{     plot(date_int,visits,type='l')     points(date_int,fit,col='red',type='l')     title(w) }) 

Here is the poisson model:

Call: glm(formula = visits ~ date_int + D1 + D2 + D3 + D4 + D5 + D6 +      backshift35 + backshift42 + backshift49 + backshift56 + backshift63 +      months, family = poisson, data = train)  Deviance Residuals:      Min       1Q   Median       3Q      Max   -72.386   -4.389    1.126    6.912   54.939    Coefficients:               Estimate Std. Error z value Pr(>|z|)     (Intercept)  1.003e+01  2.405e-02 416.935  < 2e-16 *** date_int    -3.380e-03  1.125e-04 -30.044  < 2e-16 *** D1          -4.000e-02  3.425e-03 -11.680  < 2e-16 *** D2           7.957e-01  9.623e-03  82.686  < 2e-16 *** D3           6.755e-01  8.258e-03  81.800  < 2e-16 *** D4           6.502e-01  7.997e-03  81.299  < 2e-16 *** D5           5.791e-01  7.470e-03  77.530  < 2e-16 *** D6           4.544e-01  6.173e-03  73.602  < 2e-16 *** backshift35  9.173e-06  5.502e-07  16.671  < 2e-16 *** backshift42 -1.368e-05  5.191e-07 -26.353  < 2e-16 *** backshift49  1.408e-06  5.656e-07   2.489  0.01280 *   backshift56 -2.305e-05  6.010e-07 -38.358  < 2e-16 *** backshift63  1.957e-06  6.035e-07   3.242  0.00119 **  months9     -3.231e-01  1.332e-02 -24.258  < 2e-16 *** months10    -2.589e-01  1.070e-02 -24.192  < 2e-16 *** months11    -2.907e-01  7.710e-03 -37.706  < 2e-16 *** months12    -3.915e-01  4.577e-03 -85.529  < 2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for poisson family taken to be 1)      Null deviance: 157970  on 134  degrees of freedom Residual deviance:  39894  on 118  degrees of freedom AIC: 41448  Number of Fisher Scoring iterations: 4 

You can try different values of website_id (w). I had backshift start at 5 weeks ago which allows you to easily forecast 5 weeks, but you can forecast farther than that by making predictions based on predictions.

enter image description here

Similar Posts:

Rate this post

Leave a Comment