Rdsm: Threads Programming for R

Description

Rdsm provides a threads programming environment for R, not available within Ritself. Moreover, it is usable both on a multicore machine and across a network of multiple machines. The package gives the ``look and feel'' of the shared memory world view that ordinary system threads provide, again even across multiple machines on a network.

The ``dsm'' in ``Rdsm'' stands for distributed shared memory, a term from the parallel processing community in which nodes in a cluster share (real or conceptual) memory. It is based on a similar package the author wrote for Perl some years ago (Matloff (2002)).

Rdsm can be used for:

parallel computation, as with the programKNN.Rincluded with this package
the development of ``dashboard'' controllers and parallel I/O, like the programWebProbe.R
the development of collaborative tools, as with the programAuction.R

Rdsm can easily be used with variables produced by Jay Emerson and Mike Kane's bigmemory package, thus enhancing the latter package by adding a threads capability. In bigmemory case, if the code run on a multicore machine, then the shared memory is real, and the access may be considerably faster than to Rdsm variables. Rdsm provides a function newbm() for creating bigmemory variables.

Arguments

Quick Introduction to <pkg>Rdsm</pkg>

The Rdsm code in MatMul.R in the examples included in this package serves as a quick introduction, using a matrix-multiply example common in parallel processing packages. There are especially detailed comments in this example, but here is an overview:

The code finds the product of matrices m1 and m2, placing the produce in prd. The core lines of the code are

myid <- myinfo$myid # this thread's ID # determine number of columns of m1 k <- if(class(m1) == "big.matrix") dim(m1)[2] else m1$size[2] nth <- myinfo$nclnt # number of threads chunksize <- k/nth # determine which columns of m1 this thread will process firstcol <- 1 + (myid-1) * chunksize lastcol <- firstcol + chunksize - 1 # process this thread's share of the columns prd[,firstcol:lastcol] <- m1[,] %*% m2[,firstcol:lastcol]

The work is parallelized by assigning each thread a certain set of columns of prd. Each thread then computes its columns and then places them in the proper section of prd. This is a classical shared-memory pattern, thus illustrating the point that Rdsm brings threads programming to R. The matrix prd here is a shared variable, created beforehand via a call to cnewdsm() in the case of an Rdsm variable or via a call to newbm() if a bigmemory variable is desired.

Other examples, including directions for running them, are given in the examples/ and testscripts/ directories in this package.

Advantages of the Shared-Memory Paradigm

Whether the platform is a multicore machine or a set of networked computers, a major advantage of Rdsm is that it gives the programmer a shared-memory world view, considered by many in the parallel processing community to be one of the clearest forms of parallel programming (Chandra (2001), Hess et al (2003) etc.).

Suppose for instance we wish to copy x to y. In a message-passing setting such as Rmpi, x and y may reside in processes 2 and 5, say. The programmer would write code (described here in pseudocode)

send x to process 5

to run on process 2, and write code receive data item from process 2 set y to received item to run on process 5. By contrast, in a shared-memory environment, the programmer would merely write y <- x which is vastly simpler. (Brackets would be required, as explained below.) This also means that it is easy to convert sequential Rcode to parallel Rdsm code.

Packages such as snow, arguably in the message-passing realm, do feature more convenient messaging operations, but still shared memory tends to have the simplest code. (It should be noted, though, that in some applications message-passing can yield somewhat better performance.)

Communication

Rdsm runs via network sockets, and Rdsm shared variables are accessed via this mechanism. If one's code also contains bigmemory shared variables, these are handled in that package's environment, either physical shared memory or file backing in a shared file system.

Rdsm data communication is binary in the case of vectors and matrices, but serialize() and unserialize() are used for lists.

Launching <pkg>Rdsm</pkg>

Start R and load Rdsm.

Manual operation:

To run Rdsm manually, run Rin n+1 different terminal (shell) windows, where n is the desired number of clients, i.e. degree of parallelism. Each client runs one thread. You will use one of the n+1 instances of Rfor the server.

Then:

Runsrvr()in your server window, with argument n, which is 2 by default.
In each client window, runinit().
In each client window, run yourRdsmapplication function.

You may have several application functions to run, or may want to run the same one multiple times. This is fine as long as srvr() is still running; you do not need to rerun init() at the clients. Application-program Rdsm variables etc. will be retained from one run to the next.

Automatic launching:

If you are running on a Unix-family system (Linux, Mac OS, or Cygwin on Windows), Rdsm launch and management can be made much more convenient via Rdsm's autolaunch capability. One opens just one window, and autolaunch automatically creates windows for the server and clients, and then in each window starts Rand loads Rdsm (and optionally bigmemory).

Then each time the user wishes to issue a command to all the clients, say a command to run an Rdsm application, he/she merely types the command in the original window, and it will be sent to the client windows, thus saving a lot of typing.

Here's a quick summary example of autolaunch. Say we wish to run two threads, with our application consisting of a function x() contained in the source code file y.R. We would open a single terminal window, run Rin it, and then run the following code:

alinit(2) # create clients cmdtoclnts('source("y.R")') # have clients source the app code go() # set up server/client connections cmdtoclnts('x(3,100)') # first run of app cmdtoclnts('x(12,5000)') # second run of app ...

Here's what it does:

The call toalinit()opens two other terminal windows, startsRin them, and loads theRdsmlibrary.
The call tocmdtoclnts()then has the instances ofRat the client windows load our application source file.
The call togo()then startssrvr()in the server window andinit()in each client window.
We then run our application a couple of times, and of course could run otherRdsmapplications after sourcing their code.

Accessing <pkg>Rdsm</pkg> Variables

The variables in a typical Rdsm application program consist of a few shared variables, produced by either Rdsm or bigmemory, and many ``ordinary'' variables. Regular Rsyntax is used to access the shared variables, just as with the ordinary ones.

For example, suppose your program includes m, a 4x5 shared matrix variable. If you wished to fill the second column with 1, 2, 3 and 4, you would write m[,2] <- 1:4 just as you would in ordinary R.

Note carefully that you must always use brackets with shared variables. For instance, to copy the shared vector x to an ordinary Rvariable y, write y <- x[]

not y <- x

Built-in Variables

Rdsm's built-in variables are stored in a single global (but not shared) variable myinfo, a list consisting of these components:

myid: the ID number of this client, starting with 1
nclnt: the total number of clients

Built-in Synchronization Functions

Rdsm includes some built-in synchronization functions similar to those of threaded or other shared-memory programming systems:

barr(): barrier operation, synchs all threads to the same code line
lock(): lock operation, gives thread exclusive access to shared variables
unlock(): unlock operation, relinquishes exclusive access
wait(): wait operation
signal(): signal operation; releases all waiting clients
signal1(): same assignal(), but releases only the first waiting client
fa(): fetch-and-add operation

Built-in Initialization/Shutdown Functions

init(): initializes a client's connection to the server
srvr(): initializes the server
dsmexit(): can be called when a client has finished its work (note: this will stop the server when all clients make this call, and thus this function should not be used in most applications)

Shared-Variable Creation Functions

cnewdsm(): creates anRdsmvariable
newbm(): creates abigmemoryvariable

Internal Structure

Though transparent to the Rdsm programmer, internally Rdsm variables (but not bigmemory ones) have the following architecture.

The Rdsm application variables reside on the server. Each read from or write to an Rdsm variable involves a transaction with the server. Rdsm variables reference vectors, matrices and lists, but have the special Rdsm classes dsmv, dsmm and dsml, respectively. Indexing operations for these classes communicate with the server to read or write the desired objects.

See the bigmemory package for details of the structure used for those variables. These are of the matrix type only, class big.matrix. Of course, a vector can be represented as a one-row vector. Again, all this is transparent to the programmer. However, as with any system, a good understanding of the internals can result in your writing much better code.

References

Chandra, Rohit (2001), Parallel Programming in OpenMP, Kaufmann, pp.10ff (especially Table 1.1).

Hess, Matthias et al (2003), Experiences Using OpenMP Based on Compiler Directive Software DSM on a PC Cluster, in OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, Michael Voss (ed.), Springer, p.216.

Matloff, Norman (2002), PerlDSM: A Distributed Shared Memory System for Perl. Proceedings of PDPTA 2002, 2002, 63-68.