I am in the market for a new system (probably a laptop) that would be be used primarily for Bayesian/MCMC analyses. If I had unlimited funds I would obviously buy very high end hardware and be done with it. Unfortunately, this is the real world and I have to make my budget stretch. I am looking for advice about the type of system I should get particularly in terms of processor(s) and amount of RAM. Please note that I am not looking for specific recommendations regarding RAM and processor. I am looking for general guidance about how to financially prioritize the various aspects of the computational hardware. I am fine with a PC or a Mac. If it is relevant, most of my computations would involve running R and/or C.
Best Answer
It's a little tough to provide specific recommendations, particularly without knowing too much about your budget and goals. However:
A lot of data analysis can now be done on…nearly anything. If you plan on doing a lot of $t$-tests, ANOVAs, or regression modeling, I think you would be hard-pressed to find a system that was too slow, even with relatively large data sets (tens of thousands of observations).
However, some techniques are considerably more power-hungry. Bootstrapping or other permutation/resampling tests require a fair bit of math, as do things like MCMC. Tuning and evaluating machine learning approaches can also eat as many cycles as you care to throw at it, particularly if you're being careful (e.g., nested cross-validations for finding hyperparameters).
In most cases, having a high-end computer won't make previously intractable problems tractable, but being able to adjust some code and see the results sooner, rather than later, will make a big difference to your productivity/quality of life.
Disk space has rarely been an issue for me, so I would suggest focusing your money on RAM and CPU.
More and faster RAM is always better, obviously, and it's a big win if all of your data and intermediate computations fit into memory (and even better if they fit into the processor's cache). You could try to calculate your RAM needs, but I've noticed that RAM prices tend to have an "elbow" where they go non-linear: 1 GB is twice the price of 2 GB, which is twice the price of 4 GB…but 64 GB is much more expensive than 2×32 GB. Plus, RAM is fairly cheap and easy to upgrade, particularly if you have some extra slots on the motherboard, so I'd buy just before the elbow.
CPUs vary in terms of speed, cache, and number of cores. More is better here too, obviously. Speed and cache size don't take any skill to exploit, but your ability to get a lot out of a large number of cores depends on your programming abilities and the type of analysis. Matlab and Revolution R make it easier to parallelize parallelizable computations, but that will largely be your responsibility if you are working in C.
In a similar vein, computing on GPUs has become increasingly popular, since (some) GPUs can be blazingly fast if you're doing some massively parallel computation. If you go this route, picking out a GPU has all the same hassles as selecting a CPU (# of cores, speed, amount of ram). However, there are several competing standards (OpenGL or CUDA, mainly). If you/your libraries use CUDA, then you need to get an NVIDIA card; there are more options for OpenCL. As a practical note, since high-end graphics cards are a little atypical for normal office use, you should give your IT or purchasing department a heads-up so they don't think you are trying to build a gaming rig on the company dime (seriously!). Also, be aware that 1) this will take some work on your part and 2) it's not a panecea–moving data on and off the GPU is dog-slow!
If your code or data lives on a network, a fast Ethernet card can be nice. I previously worked somewhere the home directories and data were all served off of a (local) fileserver, and switching from 100 Mb to gigabit Ethernet did massively reduce the amount of time I spent waiting for large data sets to load. If you go this route, you'll also need to ensure that everything between you and the fileserver is upgraded as well. If all your data is local, an SSD can provide similar speed-ups.
Finally, I'd suggest not driving yourself insane looking for the optimal machine. You can always rent time on EC2 or something if you find yourself in a bind.
Similar Posts:
- Solved – Fastest way to run ridge regression on large datasets where n>>p
- Solved – Use Random Forest model to make predictions from sensor data
- Solved – What kinds of statistical problems are likely to benefit from quantum computing
- Solved – Conditional vs. Exact Maximum Likelihood Estimation
- Solved – Random Forest in a Big Data setting