Parallelization

Most desktop computers today have at least 2 CPUs, and some have 8 or more. R is inherently a ‘single-threaded’ application, meaning that it doesn’t naturally take advantage of any extra CPUs other than the single one it is running on. This is changing rapidly. For a good paper on how to exploit multiple CPUs in R, read this excellent paper.

I personally use a few packages in R that exploit the multiple CPUs on my desktop (an 8 CPU MacPro Harpertown early 2008 with 32GB RAM). These should work on any unix language box. There are several other packages available (see here), most of which I have not tried. Here are the ones I use:

 pnmath0 - downloadable from Luke Tierney’s personal webpage. This package is nice because the user only has to load it and forget about it. It requires NO OTHER CODING. It works behind the scenes to figure out which algorithms can be parallelized and then does so if there is predicted to be a speed advantage. Note that this only speeds things up for those functions that have been coded by Luke to take advantage of parallel processes.

The next three are used in conjunction:

foreach - downloadable from CRAN.
doMC - downloadable from CRAN
multicore - downloadable from CRAN

Multicore merely provides the backend for foreach to work: it enables R to spawn X number of processes, each of which have access to the same shared memory. doMC allows foreach to access the multiple CPUs enabled by multicore. I use the foreach package  mostly to do loops. Many times, looping takes a very long time but doesn’t take a lot of memory (RAM) per loop. These are the exact situations when to use multiple processes. To do this, here’s the code:

require(foreach) #loads package foreach
require(doMC) #loads both doMC and multicore
search()   #make sure all 3 packages are loaded
multicore:::detectCores(all.tests=TRUE)  #to see how many cores are available
registerDoMC(cores=6) #tells R to spawn a maximum of 6 processes I like having 2 CPUs in reserve
x <- foreach(i=1:10) %dopar% svd(matrix(rnorm(1000*1000),ncol=1000))

#The following first cuts up the file using awk then reads it in and cbinds using foreach()
chrs <- foreach(i=1:22, .combine='rbind') %dopar% {
system(paste("awk '{print $1,$3,$4}' ~/Downloads/genotypes_chr",i,"_CEU_r27_nr.b36_fwd.txt > pos.chr",i,sep=''))
read.table(paste("pos.chr",i,sep=''),header=TRUE,comment.char="")
}
 

Note that the foreach() syntax is similar but not identical to the for() syntax. It is important to use the %dopar% syntax if you want to use multiple processors. Otherwise, use %do%. It should also be noted that if RAM is your limiting resource (if each iteration of a loop needs all the RAM), using parallel processes will make you run out of RAM and will substantially SLOW DOWN your code! For more help on foreach(), see the manual and vignette, available here from CRAN.

[Matthew C Keller's Home Page] [Biosketch] [Vita] [Publications] [Keller Lab] [Program Code] [R] [64 bit R on Mac] [Aquamacs] [Bioconductor] [Memory] [Parallelization] [UNIX] [Courses] [Links]