Parallel Computing in R
R is an open-source programming language and software environment for statistical computing and graphics. The core R installation provides the language interpreter and many statistical and modeling functions. Many statistical analysis tasks in areas such as bioinformatics are computationally very intensive, while lots of them rely on embarrassingly parallel computations. Multiple computers or even multiple processor cores on standard desktop computers, which are widespread nowadays, can easily contribute to faster analyses. However, providing software for parallel or high performance computing (HPC) with R was not a development goal.
State-of-the-art In general, parallel computing deals with hardware and software for computation in which many calculations are carried out simultaneously. The main goal of parallel computing is the improvement in calculating capacity. Parallel computing can be performed one a single machine (node) with multiple processor cores, a cluster of multiple nodes connecting using LAN, or grid of multiple servers connecting using WAN. Several methods have been developed, adding multiple library packages to R, to support parallel computing with a little modification of R script for execution in parallel environment. Currently no rigorous study on the performance of these methods has been conducted. In this article, we used the multicore and snowfall packages of R software because of their ease of use. Multicore-based parallelization using R multicore package R multicore package provides functions for parallel execution of R code on machines with multiple cores or CPUs. Unlike other parallel processing methods all jobs share the full state of R when spawned, so no data or code needs to be initialized. The actual spawning is very fast as well because no new R instance needs to be started. To install this package, please use the following command:
install.packages("multicore");
The key procedure that makes a script running on multiple cores provided by the “multicore” package is “mclapply” which is a parallelized version of “lapply”. In a nutshells,
mclapply(X, FUN, mc.cores = #cpucores, ...);
where, X is a vector (atomic or list) or an expressions vector. Other objects (including classed objects) will be coerced by as.list; FUN is the function to be applied to each element of X, and mc.cores is the maximum number of cpu cores used to run FUN. The return value is a “list” of values returned by the instances of FUN corresponding to elements in X. Following is a simple script using “multicore”,
library(multicore);
myfunc
Cluster-based parallelization using R “snowfall” package A cluster is a group of computers or nodes connected to each other using LAN. Cluster-based parallelization makes these nodes working together to accomplish a given job. Snowfall is an R package for construction of a cluster. Installation of snowfall is as following:
install.packages("snowfall");
snowfall packages provides the basic routines to construct a cluster, distribute the data to cluster nodes and perform R script on the node cpu cores in parallel. The basic commands to set up a cluster, distribute data, and collect results are as follows:
-
sfInit(socketHosts=NULL, cpus=NULL, type=NULL, parallel=TRUE);
This function sets up the machines which will be used in the parallel calculation. By listing machines multiple times, multiple CPUs on those machines will be used. In addition, the parameter “type” tells which type of messaging system will be used which is by default et to “SOCK” to use TCP/IP protocol. The parameter “parallel” allows selecting parallel operation (TRUE) or sequential operation (FALSE). The parameter socketHosts allows to indicate a list of computing computers or nodes to be used. These nodes should be SSH-accessible from the node running sfInit function using SSH host-key database.
-
sfExport(list_of_names_of_data_and_functions_to_be_distributed);
This function distributes copies of data and/or functions which will be needed on the remote nodes in the cluster.
-
sfLibrary(library_name);
This is used each time to load a library which will be needed in the distributed computation. The name of the library required is provided without quotes.
-
sfClusterSetupRNG();
This sets up the random number generator to insure that each distributed machine gets a unique stream of random numbers.
-
sfLapply(compute.TaskList, clusterComputeNode);
This function revokes the routine “clusterComputeNode” in parallel.
-
sfStop();
This should be called with no arguments when the distributed computations are finished, to close the connections with the other machines.
An example of using snowfall on HPC High Performance Computing (HPC) systems provide a set of computing nodes having multiple cpu cores and being connected to each other using LAN. Each compute node has a LAN IP which is listed in the /etc/hosts file. In this example, I applied snowfall on a HPC system having 17 computing nodes with the IPs in the range of 10.54.50.0 to 10.54.50.16, where 10.54.50.0 is assigned to the node #0, 10.54.50.1 to the node #1,…
# IP list of all compute node
COMPUTE_NODES
That’s it. :) I did a research in which the semantic similarities among genes of the HGU133a microarray technology were computed. Selection of GO-BP annotations, with an degree of believe cutoff at 0.7, results in 7999 annotated genes using 5194 unique GO-BP terms. This job took 22 hour of running time on a 8-core machine. However, it took only 46 minutes by using a HPC system with 10 nodes and 7 cpu cores each. Here below is the status of the HPC nodes at a peak time.
Node State CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11
-1 up 0.0% 0.0% 0.0% 1.0% 2.0% 18.2% 0.0% 0.0% 0.0% 0.0% 1.0% 1.0%
0 up 4.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
1 up 2.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
2 up 2.0% 100.0% 100.0% 100.0% 0.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 100.0%
3 up 100.0% 0.0% 0.0% 100.0% 100.0% 100.0% 100.0% 100.0% 0.0% 0.0% 0.0% 0.0%
4 up 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
5 up 97.0% 97.0% 5.9% 96.0% 93.0% 39.6% 0.0% 0.0% 96.0% 88.1% 68.0% 93.9%
6 up 97.0% 0.0% 98.0% 92.9% 99.0% 28.0% 4.0% 98.0% 84.2% 6.9% 97.0% 79.2%
7 up 96.0% 96.0% 0.0% 21.0% 97.0% 97.0% 44.6% 98.0% 43.0% 0.0% 86.1% 97.0%
8 up 99.0% 47.0% 98.0% 0.0% 81.0% 97.0% 89.2% 57.4% 97.0% 97.0% 0.0% 17.0%
9 up 98.0% 0.0% 97.0% 98.0% 0.0% 97.0% 97.0% 0.0% 97.0% 0.0% 97.0% 97.0%
10 up 97.0% 56.6% 38.4% 0.0% 97.0% 98.0% 97.0% 97.0% 98.0% 97.0% 0.0% 0.0%
11 up 98.0% 97.0% 4.0% 97.0% 0.0% 97.0% 97.0% 0.0% 0.0% 92.0% 98.0% 97.0%
12 up 98.0% 26.0% 17.0% 96.0% 98.0% 89.2% 96.0% 92.0% 8.0% 58.0% 96.0% 0.0%
13 up 97.0% 92.0% 96.0% 29.0% 97.0% 0.0% 82.2% 81.8% 97.0% 3.0% 98.0% 0.0%
14 up 97.0% 58.4% 0.0% 76.0% 89.0% 84.2% 66.0% 30.0% 66.3% 93.9% 71.7% 32.0%
15 up 97.0% 52.0% 56.0% 0.0% 96.0% 85.0% 86.0% 76.0% 0.0% 97.0% 97.0% 30.0%
16 up 100.0% 100.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 100.0% 100.0% 0.0% 0.0%
References
- Knaus J., Porzelius C., Binder H. and Schwarzer G. (2009). Easier Parallel Computing in R with snowfall and sfCluster, Cont. research articles, pp. 55+.
- Schmidberger M. et al. (2009). State of the Art in Parallel Computing with R, J. of statistical software. Vol. 31, pp. 1+.