It know it's very general question but I'm wondering what kind of issues should I expect if the number of categories in my dependent variable (or even in predictor variables) are more than for example 10.

**Contents**hide

#### Best Answer

There are a number of ways to think about this question. Probably the first consideration is resource dependent, boiling down to where you are doing your analysis: laptop or massively parallel platform? You should ask how much RAM or memory is accessible. RAM impacts the ability of your software to, e.g., invert a cross-product matrix or converge to a solution with a closed form algorithm. Quite obviously, the bigger the platform and the more RAM available, the bigger the matrix that can be handled. Next, there are software considerations, for instance, R is notoriously unable to deal with too much categorical information, whether in the target or feature variables. Other packages such as SAS have much greater inherent capacity.

Next, there is the issue of the approach or *theory* underpinning the analysis — e.g., frequentist or Bayesian? Inference or prediction and classification? Statistics or machine learning and computer science? Precise or approximate?

Historically, frequentists have thrown up their hands in defeat when, e.g., a cross-product matrix became too big to invert. A good example of this are probit models with more than 3 levels to the target. Using classic, closed form statistical models, there isn't enough CPU in 10,000 years to solve this. Bayesians, on the other hand, were the first to identify workarounds to this problem. Let me illustrate this with a couple of examples. Fifteen years ago Steenburgh and Ainslie wrote a paper *Massively Categorical Variables: Revealing the Information in Zip Codes* offering a hierarchical bayesian solution to this problem. In your case, you have a multinomial target — their approach is readily generalizable from features to targets. That the Ainslie method (and many Bayesian models) generates a boatload of parameters is not insuperable. It just may not be the most efficient solution. Next, in Gelman and Hill's book *Data Analysis Using Regression and Multilevel/Hierarchical Models*, they propose the possibility of Bayesian analysis with a multilevel categorical variable some of which contain only a single observation, i.e., very sparse information. The key to this counter-intuitive notion is that the information for that single observation across multiple draws will be summarized by the posterior. Note that these are *Bayesian* approximating heuristic workarounds.

Today, even frequentists have access to such heuristic, approximating workarounds, e.g., bootstrapping, jacknifing, Breiman's random forests, computer science driven algorithms such as "divide and conquer" (D&C) or "bags of little jacknifes" (BLJ) for massive data mining, see, e.g., Wang, et al's paper, *A Survey of Statistical Methods and Computing for Big Data* Survey of Statistical Methods and Computing for Big Data*. These approaches don't render Bayesian solutions obsolete (previously the only game in town for, e.g., inverting huge cross-products matrices), they just make Bayesian approaches unnecessary. Software considerations arise again with these resampling methods insofar as I've heard that R doesn't easily permit the large, even massive, numbers of iterative looping required but, then, I'm not an R guy so I could easily be wrong.

Questions concerning the accuracy of these approximating workarounds have been addressed by Minge and Chen in a paper titled *A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data*. They concluded that there was no significant loss of precision with these approaches relative to analyses based on "full information," fixed data.

Finally, the consideration of inference in the face of massive information has had many implications for statistical analysis in the 21st c. To mention only one, classic 20th c statistical analyses and approaches have to be adapted and updated to reflect today's realities. Hastie and Efron's new book *Computer Age Statistical Inference* contains a multitude of suggestions wrt deriving inferences from large amounts of information. In particular, I like their discussion in chapter 10 of bootstrapping and jacknifing versus, e.g., classic approaches to Taylor expansion.

### Similar Posts:

- Solved – Bayesian vs Frequentist: practical difference w.r.t. machine learning
- Solved – Bayesian vs Frequentist: practical difference w.r.t. machine learning
- Solved – What frequentist statistics topics should I know before learning Bayesian statistics
- Solved – Use of information theory in applied data science
- Solved – the difference between a Frequentist approach with meta-analysis and a Bayesian approach