Solved – Faster option than glmnet for elastic net regularized regression

I attempted to predict crime category from its X and Y coordinates based on San Francisco Crime Statistics (https://www.kaggle.com/c/sf-crime). And it turned out that glmnet is pretty slow for fitting on this dataset: when I fit on a sample of 100k observations, it takes about 3 minutes. If I extrapolate that time to the whole dataset (878k observations) – the whole fit will take about 25 minutes.

So, my questions are

  1. What is the maximum data set size to work in R interactively?
  2. How faster would be other languages (Python, Java) on the similar task?
  3. Does such performance mean that I should not at all attempt at all to solve such a big problem in R and choose instead other language?

First, you have a good idea that you can get a feel for things with a smaller sample, to work out the kinks in your approach. And it's definitely true, as you've noticed, that things that take a while to run really break up your focus. It's annoying. But…

Second, you have to define "interactively". Do you mean instantaneously, 1 second, 10 seconds, or what?

Third, you need to account for your hardware, the software you're using, how much data you have, and what algorithm you're executing.

For example, doing an lm on 100K rows of data might be "interactive" for you. Obviously glmnet is doing a lot more than that.

In terms of other languages, Python or Java may be faster if you have a lot of code outside of glmnet that you're executing. As one of the comments says, glmnet is coded in C or Fortran and will be as fast as possible. If you're doing looping or a lot of code around your glmnet, it's easier in R to do something inefficient than it is in programming languages not designed for statistics like Python or Java.

It's possible to parallelize some algorithms. I'm not sure if glmnet is one of them. If you find a language that has a parallelized version — say one that uses multiple cores, and you're running on a machine with multiple cores and enough RAM — that will speed things up sub-linearly. Four cores won't be four times as fast, but it should be 2-3x faster.

So, the answer is "no, probably not". Your algorithm will usually trump anything else, and perhaps glmnet alone isn't the best option.

Similar Posts:

Rate this post

Leave a Comment

Solved – Faster option than glmnet for elastic net regularized regression

I attempted to predict crime category from its X and Y coordinates based on San Francisco Crime Statistics (https://www.kaggle.com/c/sf-crime). And it turned out that glmnet is pretty slow for fitting on this dataset: when I fit on a sample of 100k observations, it takes about 3 minutes. If I extrapolate that time to the whole dataset (878k observations) – the whole fit will take about 25 minutes.

So, my questions are

  1. What is the maximum data set size to work in R interactively?
  2. How faster would be other languages (Python, Java) on the similar task?
  3. Does such performance mean that I should not at all attempt at all to solve such a big problem in R and choose instead other language?

Best Answer

First, you have a good idea that you can get a feel for things with a smaller sample, to work out the kinks in your approach. And it's definitely true, as you've noticed, that things that take a while to run really break up your focus. It's annoying. But…

Second, you have to define "interactively". Do you mean instantaneously, 1 second, 10 seconds, or what?

Third, you need to account for your hardware, the software you're using, how much data you have, and what algorithm you're executing.

For example, doing an lm on 100K rows of data might be "interactive" for you. Obviously glmnet is doing a lot more than that.

In terms of other languages, Python or Java may be faster if you have a lot of code outside of glmnet that you're executing. As one of the comments says, glmnet is coded in C or Fortran and will be as fast as possible. If you're doing looping or a lot of code around your glmnet, it's easier in R to do something inefficient than it is in programming languages not designed for statistics like Python or Java.

It's possible to parallelize some algorithms. I'm not sure if glmnet is one of them. If you find a language that has a parallelized version — say one that uses multiple cores, and you're running on a machine with multiple cores and enough RAM — that will speed things up sub-linearly. Four cores won't be four times as fast, but it should be 2-3x faster.

So, the answer is "no, probably not". Your algorithm will usually trump anything else, and perhaps glmnet alone isn't the best option.

Similar Posts:

Rate this post

Leave a Comment