Solved – Simplifying variable effects in a GLM in R

Apologies, but it looks like my question is off topic for this forum. Thanks for all the excellent replies though. For those who have come across this question if they've been looking for something similar – then the short answer to my question below is very likely 'no'.

Please note this question has been edited in the light of the excellent responses below.

Can anyone recommend a way to easily ‘simplify’ the effects of variables in an R based GLM model? By simplify I mean any or all of the following:

  • Group levels of a variable with similar effects.
  • Apply a curve to the effects of an ordinal variable.
  • Hand-smooth / alter the effects of a variable.
  • Band (and re-band) a continuous variable.

Given that one has accepted the pros and cons of taking any of these approaches, are there any packages that might help with this? I’ve drawn a blank so far.

Thanks everyone for your responses.

For clarity (I hope) I have edited the question to remove the (inadvertent) confusing use of terminology , notably removing the use of ‘factor’, which I used in the original wording as a general term for ‘predictor variable’ rather than in the particular R sense of a variable with specific levels (something which as a relatively new user of R I unfortunately overlooked).

Also please note that I’m asking this question from an actuarial rather than a pure research standpoint.

In this field, there are reasonably well understood risk profiles (the chance of having a car accident, illness or death at different age for instance) and also a degree of leeway in applying ‘reasonable’ tweaks to the effects of individual variables (to the extent that most practitioners end up using software specifically designed for this purpose – a package called ‘Emblem’ being pretty much ubiquitous in the insurance industry in the UK).

I’m aware of the costs and benefits of treating any results in this way, but was just wondering if there were methods in R that could aid the process once the pros and cons are accepted.

Original Question:

Can anyone recommend a way to easily ‘simplify’ factors in an R based GLM model? By simplify I mean any or all of the following:

  • Combine levels of a factor with similar effects.
  • Apply a curve to the
    effects of a factor. Hand-smooth / alter the effects of a factor.
  • Band (and re-band) factors in the model.

Are there any packages that might help with this? I’ve drawn a blank so far.

It's a simplification to have fewer parameters, but simplifying a model in the light of particular results has a cost too. Statistically-minded people differ in their views on modifying a model in the light of model results. Opposite arguments can both have merit: it's natural to want a model to be parsimonious as well as a good fit to the data (what else do we want?), but it's also risky to capitalise on chance and be over-responsive to what may be idiosyncrasies in a particular dataset. Some would argue that you are spending degrees of freedom in making such modifications; the question is whether the accountancy is honest and explicit. Whether there is a specific significance test that justifies a choice is not quite the issue here.

What is often neglected is the importance of fitting a model in a form that is both more transparent to the reader and easy for other researchers to compare given different data.

So consider this kind of statement:

"Factor $X$ as initially coded had 3 levels. But on an initial fit levels 2 and 3 were similar in their effects, so we merged them and the published model is for a two-level version."

(In some fields, you might be lucky to get this kind of explanation, as the project might be written up as if the final coding was that determined at the outset.)

There are at least two problems with this other than those already hinted at (discussed in various literatures under headings such as "data snooping"):

  1. If it was thought worthwhile to use 3 levels in data production (e.g. these were the categories offered to people taking a questionnaire survey) there is information in the pattern of the coefficients, even if some seem similar in magnitude in one particular dataset.

  2. This kind of adhockery makes it difficult for other researchers to do similar studies and check whether patterns are similar with other datasets. It's not obvious that the simplification defensible in one dataset will make sense with another dataset.

All of this implies, as a counsel of perfection, being open about initial and final models and why a model was modified. In practice there often need to be compromises and may be severe constraints depending on (e.g.) instructions from supervisors, conventions on reporting in different fields, or pressures from reviewers and editors on allowed space in journals.

This answer deliberately does not focus on what software might be used, within R or within any other statistical software, and why can now be explained. Model simplification does, and should, depend on substantive knowledge on what makes sense in the light of the underlying science, or in the light of practical knowledge about the situation in which the data were produced or in which the model is to be used. It is precisely the kind of decision that is, and should be, difficult to automate. That prejudice is, naturally, not a proof that software purporting to help is impossible or non-existent.

Similar Posts:

Rate this post

Leave a Comment