Solved – Linear regression with aggregated data or not

I am working on a simple linear regression model that predicts cost-per-truck based on distance between origin and destination. Let's assume there's a good linear relationship between Cost and Distance. I have historical data for ~250K shipments which corresponds to around ~2000 lanes (origin, destination pairs).

The goal of this model is to predict cost for new lanes (for which I have no historical data) and get the confidence/prediction intervals.

Some random dummy data –

# A tibble: 10 x 4      Lane   Distance ShipmentNo  Cost     <chr>      <dbl>      <int> <dbl>  1 A to B        100          1  812.  2 A to B        100          2 1055.  3 A to B        100          3  749.  4 A to B        100          4 1479.  5 A to B        100          5 1099.  6 C to D        500          6  754.  7 C to D        500          7 1146.  8 C to D        500          8 1221.  9 C to D        500          9 1173. 10 C to D        500         10  908. 

I am confused between 2 approaches –

  1. Build a model at the Shipment level i.e. every shipment is a data-point (250K points) on Cost vs. Distance scatter plot. Data to be divided into train and test data-sets i.e. test data represents unseen (new) shipments. In this case, the regression line represents mean shipment cost for a given distance. How would this translate into lane level prediction? Would it make sense to use the predicted mean shipment cost along with 95% confidence interval as a prediction for new lanes?

  2. Build a model at the Lane level (aggregated level) i.e. average lane rate (Total Cost / Total Shipments for each lane) is a data-point (now 2000 lanes/points) on Cost vs. Distance scatter plot. Aggregated data to be divided into train and test data-sets i.e. test data represents unseen (new) lanes. I suppose, in this case, the regression line represents the mean of average lane rate for a given distance. Would it make sense to use this mean of average lane rates along with 95% prediction interval as prediction for new lanes?

Not sure but option 2 seems right in terms of level of detail I need. However, I am concerned that by aggregating the data I may be losing information on (or even misrepresenting) the variation in cost of trucks for individual shipments. Also, in option 2 all data-points (lanes) seem to have same weight in spite of some lanes potentially having lot more shipments than others which doesn't seem right.

Based on other answers on this site, general opinions indicate that aggregation is a bad idea – however I am curious to know if it is justified in my case considering the goal is to predict lane rates and not individual shipment rate.

Looking for answers to my above questions and advise on best way to handle this problem. Thanks for the help!

I would suggest taking advantage of all the data by fitting a linear mixed effects model. This is a model that includes fixed effects and random effects. If you aggregate the data then you lose information, while if you fail to account for the probability that costs in one lane are more similar to costs in that lane than other lanes then you will obtain biased results.

To do this, you simply fit random intercepts for lane. Using the lme4 in R package as an example the would look like:

lmer(cost ~ distance + (1 | lane), data = mydata) 

The lane variable is treated as random in the sense that it can be thought of as a variable for which you have more than just a few (5-10 is often mentioned as being the minimum, although there is no consensus on the matter: you have 2000 so this is not an issue at all), and each lane then has it's own intercept which varies randomly around the global intercept (which is equivalent to just "the intercept" in a model without random effects). We do this in order to control for the non-independence of observations/measures within each lane, otherwise biased results are very likely.

You can also allow the effect of distance to be different for each lane by adding random slopes:

lmer(cost ~ distance + (distance | lane), data = mydata) 

And finally you can also allow distance to have a non-linear effect on cost by including nonlinear terms in the model equation, or using splines.

The above models will help you to understand the association between distance and cost for existing lanes, and it will also help you to understand the variation in costs by lane, and in the case of the random slopes model, to also see if the association between cost and distance differs between lanes. However, you say that you are interested in prediction for new lanes that have not been seen before. To get to grips with this problem, more information is needed about what the lanes actually are.

Similar Posts:

Rate this post

Leave a Comment