In the last while, a number of people that were fighting with slow R code asked me for advice. Commendably, they had ticked off the usual suspects such as
- Do you loop over things that could be vectorized?
- Do you have everything as arrays, do you avoid resizing, do you avoid accessing with $NAMES?
- Have you used a Profiler?
Curiously, however, when I asked about parallelization, hardly anyone had given it much thought. People were running one instance of R on their local machine or on a cluster, although most people nowadays have access to larger systems and the problem was in most cases trivially parallelizable.
What I mean by trivially parallelizable is that, in typical statistical calculations, you have to do the same thing many times. For example, people were doing cross-validations, where they made the same fits with different subsets of the data, or they made the same analysis many times, e.g. to look at the expected error of the statistical estimates or something like that. This type of problem is what is called embarrassingly parallel – it is trivial to split up those tasks and run them on different processors of your local computer, or on a larger machine or a cluster system at your institute, and your speed-up will usually be close to the ideal situation where 1/runtime is linear with the number of cores.
So, is it complicated to parallelize R code? Usually not. Here’s a few solutions, in ascending technical complexity:
1) Trivial parallelization – the pedestrian’s method
Forgive me for insulting your intelligence with this one, but to keep the list complete: you can just split up your job by hand in n subjobs, create a script for each of those, and then start n versions of R on your local machine, remote server, or cluster. Those jobs will then probably write to file, and you have to collect the outputs with a further script afterwards to calculate the final results. Embarrassingly simple, but it works, implementation time might not be more than 20 minutes, so why not?
If you keep on doing this, you might want to think about starting your new R versions from an R script, and collect automatically after they have finished.
2) Consumer parallelization – the Volkswagen/Toyota/GM methods
Option 1) works all right in some situations, but if you have the need to do communication between processes, for example to calculate intermediate results that need to be shared, writing to file is not a good solution any more: we therefore need the possibility to do some direct communication between the different processes that are solving our problem.
Explicitly organizing this communication between parallel process, however, is not trivial (see third solution below). For simple tasks, it is better to use some ready-made solutions that allow us to do typical tasks in parallel without having to worry much about how the communication is organized and so on. I call such solutions “consumer parallelization” because they come at affordable costs for most people while providing much benefit at the same time.
Some basic parallelization of that kind is provided directly in R since 2.14, and there are a number of additional packages that provide simple methods for parallelizing things such as loops, lapply etc., or specialized packages for certain tasks that include parallel functionality. Look at http://cran.r-project.org/web/views/HighPerformanceComputing.html. Which is best for you depends on what you want to parallelize, whether you are running your code on a shared memory system or a distributed memory system like a cluster, and whether you want to run your stuff on some specific parallel architecture such as a grid or Amazon’s EC2. In any case, many of these solution are relatively easy to implement, so that a medium experienced R user should be able to parallelize simple code segments with the help of the manual in a moderate amount of time (I would guess in the order of hours for a typical real-life situation).
3) Rocket science parallelization – the Ferrari method
If you have a problem that is very complex to parallelize, or a problem so big that you want to go on one of the really large clusters (I mean larger than university clusters), you will probably go for Rmpi, which is an R implementation of mpi, the de facto standard in science for message parsing between distributed computers. It is installed on most large clusters, and gives you full control over the parallelization. However, if you haven’t use Rmpi before, be sure to reserve some time, or better, take a course -> you don’t want to do this before you have tried whether the other methods would also do the job.