I am fairly new to statistics, so please forgive me for using probably the wrong vocabulary.

I have a dataset of locations with measurements of soil carbon `SOC1mTHA`

. I want to study the predictors that best explain `SOC1mTHA`

, among 15 potential predictors. I have used the `glmulti`

package, and finally concluded that based on AICc of all potential model combinations, the best set of predictors is:

`> global.model <- glm(SOC1mTHA ~ CN + pH + MATcru + PETsplash + MAPcru + + ECM + PGB2, data= dat2) > summary(global.model) Call: glm(formula = SOC1mTHA ~ CN + pH + MATcru + PETsplash + MAPcru + ECM + PGB2, data = dat2) Deviance Residuals: Min 1Q Median 3Q Max -246.02 -78.72 -31.71 101.80 289.56 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -165.25860 179.19404 -0.922 0.358695 CN 35.42207 5.54562 6.387 5.85e-09 *** pH 6.37200 1.80541 3.529 0.000639 *** MATcru 18.86837 5.68337 3.320 0.001270 ** PETsplash -0.87748 0.12285 -7.143 1.69e-10 *** MAPcru 0.11970 0.02665 4.491 1.96e-05 *** ECM -1.86698 0.51799 -3.604 0.000496 *** PGB2 2.76935 0.62110 4.459 2.22e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 14572.72) Null deviance: 4772480 on 104 degrees of freedom Residual deviance: 1413554 on 97 degrees of freedom AIC: 1314.3 Number of Fisher Scoring iterations: 2 `

However, when visualising the regressions of the individual predictors with `visreg`

, it appears that one of the regressions is nonlinear. In particular, the relationship of soil `pH`

and `SOC1mTHA`

could be a humped-shaped gaussian curve around pH=7.0 (which makes sense in biological terms). Please, see attached figure.

`visreg(global.model) `

What would be the next step? Should I define `global.model`

as composed by 6 linear regressions + 1 nonlinear regression? How could I do it? Sorry if my question is silly or to broad. Some readings that would help me figure out the solution are also welcome.

**EDITED**

Following the suggestion by @adibender I have changed the strategy, running model selection and GAM in one single step with `select=TRUE`

in order to detect potential nonlinear predictors beyond `pH`

.

I have therefore included the full model with all potential predictors in GAM:

`> global.model2 <- gam(SOC1mTHA ~ s(CN) + s(pH) + s(MATcru) + s(PETsplash) + s(MAPcru) + + s(ECM) + s(PGB2) + s(moisture) + s(clay), + data= dat2, select = TRUE) > summary(global.model2) Family: gaussian Link function: identity Formula: SOC1mTHA ~ s(CN) + s(pH) + s(MATcru) + s(PETsplash) + s(MAPcru) + s(ECM) + s(PGB2) + s(moisture) + s(clay) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 286.743 3.703 77.44 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(CN) 5.285e-11 9 0.000 0.230133 s(pH) 7.145e+00 9 2.841 0.000272 *** s(MATcru) 7.475e+00 9 27.030 < 2e-16 *** s(PETsplash) 2.308e+00 9 3.510 3.22e-09 *** s(MAPcru) 7.441e+00 9 11.023 8.01e-16 *** s(ECM) 5.550e+00 8 15.786 < 2e-16 *** s(PGB2) 1.498e+00 8 1.987 2.80e-05 *** s(moisture) 4.146e-11 9 0.000 0.610718 s(clay) 1.935e+00 9 2.325 4.34e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Rank: 81/82 R-sq.(adj) = 0.969 Deviance explained = 97.9% GCV = 2139.9 Scale est. = 1439.8 n = 105 `

How do I know what are the important predictors kept by

`gam`

, similar to what I would obtain with`glmulti`

, stepwise regression or a simple "best model" based on AIC? In other words, how do I know which predictors are unimportant, so I don't even need to gather data on them to predict`SOC1mTHA`

?I have never used a gam model, so I am not sure how to interpret the results. What would be the form of my final model that can be used to predict

`SOC1mTHA`

based on the results of`summary(global.model2)`

above?

**Contents**hide

#### Best Answer

**Non-linear effects**: If you have enough observations, you should be assuming potential non-linearity in all continuous covariates and fit a Generalized Additive Model (GAM) instead. If effects are linear, they will be estimated as such due to penalty.

To fit such models you can use `mgcv::gam`

`library(mgcv) global.model <- gam(SOC1mTHA ~ CN + s(pH) + MATcru + PETsplash + MAPcru + ECM + PGB2, data= dat2) plot(global.model) `

**Model/Variable selection**: Note, if you additionally set `select=TRUE`

in the call to `gam`

, this will also perform model selection, as individual terms may not only be penalized towards a linear effect, but also towards 0 (see this reference).

**Warning**: No matter what type of model selection scheme you choose, standard inference will no longer be valid. E.g. the p-values displayed in your output are worthless, unless you adjust for previous AIC selection. This field of statistics is called Post Selection Inference (POSI), but there are few *ready to use* solutions.