Solved – How to create a reliable regression model with a large number of variables and a few observations in R

I am newbie with R and I am trying to create a model that explains sales value. In particular i want explain how this series of variable (downloaded from http://data.un.org/ and merged with Excel) impact on my sales value. For doing this, i use a Linear Regression (*lm()*function) with R. My dataset is small for the number of variables I have:

frame_data --> 27 obs of 40 Variables 

When I run the model:

linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~. ,data=frame_data) summary(linearMod) 

the results is this:

    Call: lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~      ., data = numeric_frame_data)  Residuals: ALL 14 residuals are 0: no residual degrees of freedom!  Coefficients: (26 not defined because of singularities)                                                                        Estimate Std. Error t value Pr(>|t|) (Intercept)                                                          -4.946e-09         NA      NA       NA `Net Sales Quantity`                                                  2.411e-14         NA      NA       NA `Net Sales Value (Net of Inv. Disc.) - Euro ACT`                      1.130e+00         NA      NA       NA `Population aged 0 to 14 years old (percentage)`                     -3.121e-11         NA      NA       NA `Population aged 60+ years old (percentage)`                          3.428e-11         NA      NA       NA `Population density`                                                 -7.512e-13         NA      NA       NA `Population mid-year estimates (millions)`                            4.933e-09         NA      NA       NA `Population mid-year estimates for females (millions)`               -5.170e-09         NA      NA       NA `Population mid-year estimates for males (millions)`                 -4.684e-09         NA      NA       NA `Sex ratio (males per 100 females)`                                   2.526e-11         NA      NA       NA `Surface area (thousand km2)`                                        -8.268e-15         NA      NA       NA `Tourism expenditure (millions of US dollars)`                       -1.624e-14         NA      NA       NA `Tourist/visitor arrivals (thousands)`                                5.703e-15         NA      NA       NA `Gross enrollement ratio - Primary (male)`                            2.413e-11         NA      NA       NA `Gross enrollment ratio - Primary (female)`                                  NA         NA      NA       NA `Gross enrollment ratio - Secondary (female)`                                NA         NA      NA       NA `Gross enrollment ratio - Secondary (male)`                                  NA         NA      NA       NA `Gross enrollment ratio - Tertiary (female)`                                 NA         NA      NA       NA `Gross enrollment ratio - Tertiary (male)`                                   NA         NA      NA       NA `Students enrolled in primary education (thousands)`                         NA         NA      NA       NA `Students enrolled in secondary education (thousands)`                       NA         NA      NA       NA `Students enrolled in tertiary education (thousands)`                        NA         NA      NA       NA `Assault rate per 100,000 population`                                        NA         NA      NA       NA `Intentional homicide rates per 100,000`                                     NA         NA      NA       NA `Kidnapping at the national level, rate per 100,000`                         NA         NA      NA       NA `Percentage of male and female intentional homicide victims, Female`         NA         NA      NA       NA `Percentage of male and female intentional homicide victims, Male`           NA         NA      NA       NA `Robbery at the national level, rate per 100,000 population`                 NA         NA      NA       NA `Theft at the national level, rate per 100,000 population`                   NA         NA      NA       NA `Total Sexual Violence at the national level, rate per 100,000`              NA         NA      NA       NA `GDP in constant 2010 prices (millions of US dollars)`                       NA         NA      NA       NA `GDP in current prices (millions of US dollars)`                             NA         NA      NA       NA `GDP per capita (US dollars)`                                                NA         NA      NA       NA `GDP real rates of growth (percent)`                                         NA         NA      NA       NA `Labour force participation - Female`                                        NA         NA      NA       NA `Labour force participation - Male`                                          NA         NA      NA       NA `Labour force participation - Total`                                         NA         NA      NA       NA `Unemployment rate - Female`                                                 NA         NA      NA       NA `Unemployment rate - Male`                                                   NA         NA      NA       NA `Unemployment rate - Total`                                                  NA         NA      NA       NA  Residual standard error: NaN on 0 degrees of freedom   (13 observations deleted due to missingness) Multiple R-squared:      1, Adjusted R-squared:    NaN  F-statistic:   NaN on 13 and 0 DF,  p-value: NA 

Reading online i have undestand that my dataset is too small [1].
When i Reduce the number of variables:

linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~        # Variable Y                  `Population aged 0 to 14 years old (percentage)` +        # Variable X                 `Population aged 60+ years old (percentage)`     +                 `Population density`     +                 `Population mid-year estimates (millions)`     +                 `Population mid-year estimates for females (millions)`     +                 `Population mid-year estimates for males (millions)`     +                 # `Sex ratio (males per 100 females)`     +                 # `Surface area (thousand km2)`     +                 # `Tourism expenditure (millions of US dollars)`     +                 # `Tourist/visitor arrivals (thousands)`     +                 # `Gross enrollement ratio - Primary (male)`     +                 # `Gross enrollment ratio - Primary (female)`     +                 # `Gross enrollment ratio - Secondary (female)`     +                 # `Gross enrollment ratio - Secondary (male)`     +                 # `Gross enrollment ratio - Tertiary (female)`     +                 # `Gross enrollment ratio - Tertiary (male)`     +                  `Students enrolled in primary education (thousands)`     +                  `Students enrolled in secondary education (thousands)`     +                  `Students enrolled in tertiary education (thousands)`     +                  `GDP in constant 2010 prices (millions of US dollars)`     +                  `GDP in current prices (millions of US dollars)`     +                  `GDP per capita (US dollars)`     +                  `GDP real rates of growth (percent)`     +                 #  `Assault rate per 100,000 population`     +                 #  `Intentional homicide rates per 100,000`     +                 #  `Kidnapping at the national level, rate per 100,000`     +                 #  `Percentage of male and female intentional homicide victims, Female`     +                 #  `Percentage of male and female intentional homicide victims, Male`     +                 #  `Robbery at the national level, rate per 100,000 population`     +                 #  `Theft at the national level, rate per 100,000 population`     +                 #  `Total Sexual Violence at the national level, rate per 100,000`     +                 #  `Labour force participation - Female`     +                 #  `Labour force participation - Male`     +                 #  `Labour force participation - Total`     +                 #  `Unemployment rate - Female`     +                 #  `Unemployment rate - Male`     +                  `Unemployment rate - Total`                 ,data=frame_data)                                         # My dataframe summary(linearMod) 

My new Result is this:

Call: lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~      `Population aged 0 to 14 years old (percentage)` + `Population aged 60+ years old (percentage)` +          `Population density` + `Population mid-year estimates (millions)` +          `Population mid-year estimates for females (millions)` +          `Population mid-year estimates for males (millions)` +          `Students enrolled in primary education (thousands)` +          `Students enrolled in secondary education (thousands)` +          `Students enrolled in tertiary education (thousands)` +          `GDP in constant 2010 prices (millions of US dollars)` +          `GDP in current prices (millions of US dollars)` + `GDP per capita (US dollars)` +          `GDP real rates of growth (percent)` + `Unemployment rate - Total`,      data = frame_data)  Residuals:     Min      1Q  Median      3Q     Max  -377123 -127525   20489   95333  388344   Coefficients:                                                          Estimate Std. Error t value Pr(>|t|)   (Intercept)                                             4.853e+06  3.148e+06   1.542   0.1671   `Population aged 0 to 14 years old (percentage)`       -1.107e+05  7.591e+04  -1.459   0.1880   `Population aged 60+ years old (percentage)`           -9.130e+04  7.583e+04  -1.204   0.2677   `Population density`                                   -2.468e+02  7.284e+02  -0.339   0.7446   `Population mid-year estimates (millions)`              3.789e+07  2.503e+07   1.513   0.1740   `Population mid-year estimates for females (millions)` -3.698e+07  2.532e+07  -1.460   0.1876   `Population mid-year estimates for males (millions)`   -3.875e+07  2.474e+07  -1.566   0.1613   `Students enrolled in primary education (thousands)`    1.017e+03  4.584e+02   2.219   0.0620 . `Students enrolled in secondary education (thousands)` -1.526e+03  5.463e+02  -2.793   0.0268 * `Students enrolled in tertiary education (thousands)`   3.491e+02  6.538e+02   0.534   0.6099   `GDP in constant 2010 prices (millions of US dollars)` -5.915e+00  3.390e+00  -1.745   0.1246   `GDP in current prices (millions of US dollars)`        7.579e+00  3.463e+00   2.188   0.0649 . `GDP per capita (US dollars)`                           5.528e+00  8.886e+00   0.622   0.5536   `GDP real rates of growth (percent)`                   -1.470e+05  1.274e+05  -1.154   0.2863   `Unemployment rate - Total`                            -7.130e+04  4.676e+04  -1.525   0.1711   --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 320700 on 7 degrees of freedom   (5 observations deleted due to missingness) Multiple R-squared:  0.877, Adjusted R-squared:  0.6311  F-statistic: 3.566 on 14 and 7 DF,  p-value: 0.0487 

If i understand, the model start to find something. I don't know how to select the best variables for my models. Starting from my first model I tried with stepAIC and step [2][3] but I obtain:

AIC is -infinity for this model, so 'step' cannot proceed 

Maybe I'm just making a big mess.

Reference:

[1] https://stackoverflow.com/questions/47386290/summary-of-model-returning-na.

[2] http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

[3] https://www.rdocumentation.org/packages/stats/versions/3.6.0/topics/step

You have way too many predictors for this regression. You started off with 27 data points in your dv and then you had some missing data which has further reduced the number. Regression with a sample this size is not unheard off, but you really have to be careful about what an analysis like this can tell you. With that in mind, before someone can tell you what the 'best' variables are for your model, you need to explain what you are trying to do with your model. Are you using it primarily for prediction or are you trying to actually create a model that explains sales value. If the former, you are probably stuck with such a small sample, but methods like penalized regression can work well when you have too many variables in your model. If the latter, you really want to be guided by a theory or an idea in your head about how these variables relate to the outcome and which is the most important.

Similar Posts:

Rate this post

Leave a Comment