Solved – Including time of day in a linear regression model

The title may be entirely inappropriate for this question: that depends on whether I am on the right track. I am developing a statistical model to evaluate flower temperature based on air temperature. I'm not very good at anything statistics related, but I'm decent in programming in Python so I was thinking of building a linear regression in there.

However, I have difficulties figuring out how to build my regression and whether even a linear regression is a good choice at all.

The following plot shows how temperature and flower data varies with the time (Month-day Hour).

The black line = Air temperature.

The coloured lines = various flowers whose temperatures were measured.

Perhaps important, but I can't figure out how to make use of them: the arrows at the top show the wind and direction, while the blue line shows the incident solar radiation.

The y-axis shows the temperature (in Celsius) while the x-axis shows the time.

enter image description here

If I were to compute a linear regression, it would have to be valid for any hour of the day, hence I can't just evaluate how the flower temp varies with the air temp, because a 20 degree air temperature at 1 pm won't give out the same flower temperature as a 20 degree air temperature at 1 am. I tried separating night and day (day varies in between 6 am-7 pm), but even then the results were too chaotic.

After reading this post: Is time of the day (predictor in regression) a categorical or a continuous variable?, I thought maybe that using a categorical approach would work, but I am getting from this is that I would have 24 different equations for each hour of the day, which seems a bit much. I suppose that I am prepared to attempt such an approach, but I was hoping to get some advice before pressing on?

Perhaps I should simply use solar radiation instead of the time? But even then, the shape is periodic and I have no idea how to integrate a periodic component within a linear regression!

I think a partially linear modeling framework may be suitable for your problem. If you focus on one flower at the time, note that both the flower data and the air temperature data exhibit strong temporal cycles which peak roughly at the same time. So the simplest partially linear model you could consider for one flower would look like this:

FT_h = beta0 + beta1*AT_h + m(h) + epsilon_h,  

where FT_h is the flower temperature for the chosen flower at hour h, AT_h is the air temperature at hour h, m() is a smooth, unknown function meant to capture the temporal cycles you see in the temperature data and epsilon_h is an unknown error term. Here, h = 1, 2, 3, …, H is an index which counts how many hours you have represented in total in your flower data. In other words, this index counts your hours from the first to the last. If you have 9,000 hours represented in your data, for example, then H = 9,000. In this model, beta1 represents the hourly effect of air temperature on flower temperature, after controlling for temporal effects.

The model can be expanded by adding a linear effect for incident solar radiation (ISR):

FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h + epsilon_h.  

If you wanted to throw in wind direction as well, you could code this variable as taking the values North, South, East, West (or add variations like North-East, North-West, etc.) and include it in your model using dummy variables. For example, if you only code this variable as taking the values North, South, East or West, the flower-specific model could be expressed as:

FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h +        beta3*NorthDummy_h + beta3*EastDummy_h + beta4*WestDummy_h +               epsilon_h, 

where South is treated as the reference direction against which all others will be compared and NorthDummy_h is set to 1 if wind direction was North at hour h and 0 otherwise, EastDummy_h is set to 1 if wind direction was East at hour h and 0 otherwise and WestDummy_h is set to 1 if wind direction was West at hour h and 0 otherwise.

The challenging aspects of these models are:

  1. The need to estimate the (unknown) degree of smoothness of the (unknown) temporal effect m() carefully, given that this is just a nuisance effect and the real interest is in estimating beta1;

  2. The possibility that the error terms epsilon_h might be temporally correlated, which in turns can affect how item 1. above is addressed.

Many years ago, I conducted research on this very topic – see, for example – http://www.ghement.ca/217.pdf. However, I have not stayed current on the topic so it's possible there have been several advances on ways to handle item 1.

Intuitively, the temporal signal seen in the data is really strong while the air temperature signal is likely tiny by comparison. So you need to find the right balance when determining the degree of smoothness of the temporal effect, so as not to throw the baby with the bath water.

If you are interested in comparing effects of air temperature across flowers, you can expand the model even further. But I would start small to make sure I get a handle first on the simpler, flower-specific models.

Similar Posts:

Rate this post

Leave a Comment