# Solved – How to create a reliable regression model with a large number of variables and a few observations in R

I am newbie with R and I am trying to create a model that explains sales value. In particular i want explain how this series of variable (downloaded from http://data.un.org/ and merged with Excel) impact on my sales value. For doing this, i use a Linear Regression (*lm()*function) with R. My dataset is small for the number of variables I have:

``frame_data --> 27 obs of 40 Variables ``

When I run the model:

``linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~. ,data=frame_data) summary(linearMod) ``

the results is this:

``    Call: lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~      ., data = numeric_frame_data)  Residuals: ALL 14 residuals are 0: no residual degrees of freedom!  Coefficients: (26 not defined because of singularities)                                                                        Estimate Std. Error t value Pr(>|t|) (Intercept)                                                          -4.946e-09         NA      NA       NA `Net Sales Quantity`                                                  2.411e-14         NA      NA       NA `Net Sales Value (Net of Inv. Disc.) - Euro ACT`                      1.130e+00         NA      NA       NA `Population aged 0 to 14 years old (percentage)`                     -3.121e-11         NA      NA       NA `Population aged 60+ years old (percentage)`                          3.428e-11         NA      NA       NA `Population density`                                                 -7.512e-13         NA      NA       NA `Population mid-year estimates (millions)`                            4.933e-09         NA      NA       NA `Population mid-year estimates for females (millions)`               -5.170e-09         NA      NA       NA `Population mid-year estimates for males (millions)`                 -4.684e-09         NA      NA       NA `Sex ratio (males per 100 females)`                                   2.526e-11         NA      NA       NA `Surface area (thousand km2)`                                        -8.268e-15         NA      NA       NA `Tourism expenditure (millions of US dollars)`                       -1.624e-14         NA      NA       NA `Tourist/visitor arrivals (thousands)`                                5.703e-15         NA      NA       NA `Gross enrollement ratio - Primary (male)`                            2.413e-11         NA      NA       NA `Gross enrollment ratio - Primary (female)`                                  NA         NA      NA       NA `Gross enrollment ratio - Secondary (female)`                                NA         NA      NA       NA `Gross enrollment ratio - Secondary (male)`                                  NA         NA      NA       NA `Gross enrollment ratio - Tertiary (female)`                                 NA         NA      NA       NA `Gross enrollment ratio - Tertiary (male)`                                   NA         NA      NA       NA `Students enrolled in primary education (thousands)`                         NA         NA      NA       NA `Students enrolled in secondary education (thousands)`                       NA         NA      NA       NA `Students enrolled in tertiary education (thousands)`                        NA         NA      NA       NA `Assault rate per 100,000 population`                                        NA         NA      NA       NA `Intentional homicide rates per 100,000`                                     NA         NA      NA       NA `Kidnapping at the national level, rate per 100,000`                         NA         NA      NA       NA `Percentage of male and female intentional homicide victims, Female`         NA         NA      NA       NA `Percentage of male and female intentional homicide victims, Male`           NA         NA      NA       NA `Robbery at the national level, rate per 100,000 population`                 NA         NA      NA       NA `Theft at the national level, rate per 100,000 population`                   NA         NA      NA       NA `Total Sexual Violence at the national level, rate per 100,000`              NA         NA      NA       NA `GDP in constant 2010 prices (millions of US dollars)`                       NA         NA      NA       NA `GDP in current prices (millions of US dollars)`                             NA         NA      NA       NA `GDP per capita (US dollars)`                                                NA         NA      NA       NA `GDP real rates of growth (percent)`                                         NA         NA      NA       NA `Labour force participation - Female`                                        NA         NA      NA       NA `Labour force participation - Male`                                          NA         NA      NA       NA `Labour force participation - Total`                                         NA         NA      NA       NA `Unemployment rate - Female`                                                 NA         NA      NA       NA `Unemployment rate - Male`                                                   NA         NA      NA       NA `Unemployment rate - Total`                                                  NA         NA      NA       NA  Residual standard error: NaN on 0 degrees of freedom   (13 observations deleted due to missingness) Multiple R-squared:      1, Adjusted R-squared:    NaN  F-statistic:   NaN on 13 and 0 DF,  p-value: NA ``

Reading online i have undestand that my dataset is too small [1].
When i Reduce the number of variables:

``linearMod <- lm(`Net Sales Value (Net of Inv. Disc.) - US Dollar` ~        # Variable Y                  `Population aged 0 to 14 years old (percentage)` +        # Variable X                 `Population aged 60+ years old (percentage)`     +                 `Population density`     +                 `Population mid-year estimates (millions)`     +                 `Population mid-year estimates for females (millions)`     +                 `Population mid-year estimates for males (millions)`     +                 # `Sex ratio (males per 100 females)`     +                 # `Surface area (thousand km2)`     +                 # `Tourism expenditure (millions of US dollars)`     +                 # `Tourist/visitor arrivals (thousands)`     +                 # `Gross enrollement ratio - Primary (male)`     +                 # `Gross enrollment ratio - Primary (female)`     +                 # `Gross enrollment ratio - Secondary (female)`     +                 # `Gross enrollment ratio - Secondary (male)`     +                 # `Gross enrollment ratio - Tertiary (female)`     +                 # `Gross enrollment ratio - Tertiary (male)`     +                  `Students enrolled in primary education (thousands)`     +                  `Students enrolled in secondary education (thousands)`     +                  `Students enrolled in tertiary education (thousands)`     +                  `GDP in constant 2010 prices (millions of US dollars)`     +                  `GDP in current prices (millions of US dollars)`     +                  `GDP per capita (US dollars)`     +                  `GDP real rates of growth (percent)`     +                 #  `Assault rate per 100,000 population`     +                 #  `Intentional homicide rates per 100,000`     +                 #  `Kidnapping at the national level, rate per 100,000`     +                 #  `Percentage of male and female intentional homicide victims, Female`     +                 #  `Percentage of male and female intentional homicide victims, Male`     +                 #  `Robbery at the national level, rate per 100,000 population`     +                 #  `Theft at the national level, rate per 100,000 population`     +                 #  `Total Sexual Violence at the national level, rate per 100,000`     +                 #  `Labour force participation - Female`     +                 #  `Labour force participation - Male`     +                 #  `Labour force participation - Total`     +                 #  `Unemployment rate - Female`     +                 #  `Unemployment rate - Male`     +                  `Unemployment rate - Total`                 ,data=frame_data)                                         # My dataframe summary(linearMod) ``

My new Result is this:

``Call: lm(formula = `Net Sales Value (Net of Inv. Disc.) - US Dollar` ~      `Population aged 0 to 14 years old (percentage)` + `Population aged 60+ years old (percentage)` +          `Population density` + `Population mid-year estimates (millions)` +          `Population mid-year estimates for females (millions)` +          `Population mid-year estimates for males (millions)` +          `Students enrolled in primary education (thousands)` +          `Students enrolled in secondary education (thousands)` +          `Students enrolled in tertiary education (thousands)` +          `GDP in constant 2010 prices (millions of US dollars)` +          `GDP in current prices (millions of US dollars)` + `GDP per capita (US dollars)` +          `GDP real rates of growth (percent)` + `Unemployment rate - Total`,      data = frame_data)  Residuals:     Min      1Q  Median      3Q     Max  -377123 -127525   20489   95333  388344   Coefficients:                                                          Estimate Std. Error t value Pr(>|t|)   (Intercept)                                             4.853e+06  3.148e+06   1.542   0.1671   `Population aged 0 to 14 years old (percentage)`       -1.107e+05  7.591e+04  -1.459   0.1880   `Population aged 60+ years old (percentage)`           -9.130e+04  7.583e+04  -1.204   0.2677   `Population density`                                   -2.468e+02  7.284e+02  -0.339   0.7446   `Population mid-year estimates (millions)`              3.789e+07  2.503e+07   1.513   0.1740   `Population mid-year estimates for females (millions)` -3.698e+07  2.532e+07  -1.460   0.1876   `Population mid-year estimates for males (millions)`   -3.875e+07  2.474e+07  -1.566   0.1613   `Students enrolled in primary education (thousands)`    1.017e+03  4.584e+02   2.219   0.0620 . `Students enrolled in secondary education (thousands)` -1.526e+03  5.463e+02  -2.793   0.0268 * `Students enrolled in tertiary education (thousands)`   3.491e+02  6.538e+02   0.534   0.6099   `GDP in constant 2010 prices (millions of US dollars)` -5.915e+00  3.390e+00  -1.745   0.1246   `GDP in current prices (millions of US dollars)`        7.579e+00  3.463e+00   2.188   0.0649 . `GDP per capita (US dollars)`                           5.528e+00  8.886e+00   0.622   0.5536   `GDP real rates of growth (percent)`                   -1.470e+05  1.274e+05  -1.154   0.2863   `Unemployment rate - Total`                            -7.130e+04  4.676e+04  -1.525   0.1711   --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 320700 on 7 degrees of freedom   (5 observations deleted due to missingness) Multiple R-squared:  0.877, Adjusted R-squared:  0.6311  F-statistic: 3.566 on 14 and 7 DF,  p-value: 0.0487 ``

If i understand, the model start to find something. I don't know how to select the best variables for my models. Starting from my first model I tried with stepAIC and step [2][3] but I obtain:

``AIC is -infinity for this model, so 'step' cannot proceed ``

Maybe I'm just making a big mess.

Reference:

Contents