Learn R Programming

future (version 1.3.0)

makeClusterPSOCK: Create a Parallel Socket Cluster

Description

Create a Parallel Socket Cluster

Usage

makeClusterPSOCK(workers, makeNode = makeNodePSOCK, port = c("auto",
  "random"), ..., verbose = getOption("future.debug", FALSE))

makeNodePSOCK(worker = "localhost", master = NULL, port, connectTimeout = 2 * 60, timeout = 30 * 24 * 60 * 60, rscript = NULL, homogeneous = NULL, rscript_args = NULL, methods = TRUE, useXDR = TRUE, outfile = "/dev/null", renice = NA_integer_, rshcmd = "ssh", user = NULL, revtunnel = TRUE, rshopts = NULL, rank = 1L, manual = FALSE, dryrun = FALSE, verbose = FALSE)

Arguments

workers
The host names of workers (as a character vector) or the number of localhost workers (as a positive integer).
makeNode
A function that creates a "SOCKnode" or "SOCK0node" object, which represents a connection to a worker.
port
The port number of the master used to for communicating with all the workers (via socket connections). If an integer vector of ports, then a random one among those is chosen. If "random", then a random port in 11000:11999 is chosen. If "auto" (default), then the default is taken from environment variable R_PARALLEL_PORT, otherwise "random" is used.
...
Optional arguments passed to makeNode(workers[i], ..., rank=i) where i = seq_along{workers}.
verbose
If TRUE, informative messages are outputted.
worker
The host name or IP number of the machine where the worker should run.
master
The host name or IP number of the master / calling machine, as known to the workers. If NULL (default), then the default is Sys.info()[["nodename"]] unless worker is the localhost ("localhost" or "127.0.0.1") or revtunnel = TRUE in case it is "localhost".
connectTimeout
The maximum time (in seconds) allowed for each socket connection between the master and a worker to be established (defaults to 2 minutes). See note below on current lack of support on Linux and macOS systems.
timeout
The maximum time (in seconds) allowed to pass without the master and a worker communicate with each other (defaults to 30 days).
rscript, homogeneous
The system command for launching Rscript on the worker. If NULL (default), the default is "Rscript" unless homogenenous is TRUE, which in case it is file.path(R.home("bin"), "Rscript"). Argument homogenenous defaults to FALSE, unless master is the localhost ("localhost" or "127.0.0.1").
rscript_args
Additional arguments to Rscript (as a character vector).
methods
If TRUE, then the methods package is also loaded.
useXDR
If TRUE, the communication between master and workers, which is binary, will be use big-endian (XDR).
outfile
Where to direct the stdout and stderr connection output from the workers.
renice
A numerical 'niceness' (priority) to set for the worker processes.
rshcmd
The command to be run on the master to launch a process on another host. Only applicable if machine is not localhost.
user
(optional) The user name to be used when communicating with another host.
revtunnel
If TRUE, a reverse SSH tunneling is set up for each worker such that the worker R process sets up a socket connect to its local port (port - rank + 1) which then reaches the master on port port. If FALSE, then the worker will try to connect directly to port port on master.
rshopts
Additional arguments to rshcmd (as a character vector).
rank
A unique one-based index for each worker (automatically set).
manual
If TRUE the workers will need to be run manually.
dryrun
If TRUE, nothing is set up, but a message suggesting how to launch the worker from the terminal is outputted. This is useful for troubleshooting.

Value

An object of class c("SOCKcluster", "cluster") consisting of a list of "SOCKnode" or "SOCK0node" workers. makeNodePSOCK() returns a "SOCKnode" or "SOCK0node" object representing an established connection to a worker.

Connection time out

Argument connectTimeout does not work properly on Unix and macOS due to limitation in R itself. For more details on this, please R devel thread 'BUG?: On Linux setTimeLimit() fails to propagate timeout error when it occurs (works on Windows)' on 2016-10-26 (https://stat.ethz.ch/pipermail/r-devel/2016-October/073309.html). When used, the timeout will eventually trigger an error, but it won't happen until the socket connection timeout timeout itself happens.

Details

The makeClusterPSOCK() function is similar to makePSOCKcluster of the parallel package, but provides more flexibility in controlling the setup of the system calls that launch the background R workers and how to connect to external machines. The default is to use reverse SSH tunneling for workers running on other machines. This avoids the complication of otherwise having to configure port forwarding in firewalls, which often requires static IP address but which also most users don't have privileges to do themselves. It also has the advantage of not having to know the internal and / or the public IP address / host name of the master. If there is no communication between the master and a worker within the timeout limit, then the corresponding socket connection will be closed automatically. This will eventually result in an error in code trying to access the connection.

Examples

Run this code
## Setup of three R workers on two remote machines are set up
workers <- c("n1.remote.org", "n2.remote.org", "n1.remote.org")
cl <- makeClusterPSOCK(workers, dryrun = TRUE)

## Same setup when the two machines are on the local network and
## have identical software setups
cl <- makeClusterPSOCK(
  workers,
  revtunnel = FALSE, homogeneous = TRUE,
  dryrun = TRUE
)

## Setup of remote worker with more detailed control on
## authentication and reverse SSH tunnelling
cl <- makeClusterPSOCK(
  "remote.server.org", user = "johnny",
  ## Manual configuration of reverse SSH tunnelling
  revtunnel = FALSE,
  rshopts = c("-v", "-R 11000:gateway:11942"),
  master = "gateway", port = 11942,
  ## Run Rscript nicely and skip any startup scripts
  rscript = c("nice", "/path/to/Rscript"),
  rscript_args = c("--vanilla"),
  dryrun = TRUE
)

## Setup of Docker worker running rocker/r-base
## (requires installation of future package)
cl <- makeClusterPSOCK(
  "localhost",
  ## Launch Rscript inside Docker container
  rscript = c(
    "docker", "run", "--net=host", "rocker/r-base",
    "Rscript"
  ),
  ## Install future package
  rscript_args = c(
    "-e", shQuote("install.packages('future')")
  ),
  dryrun = TRUE
)
                       

## Setup of udocker.py worker running rocker/r-base
## (requires installation of future package and extra quoting)
cl <- makeClusterPSOCK(
  "localhost",
  ## Launch Rscript inside Docker container (using udocker)
  rscript = c(
    "udocker.py", "run", "rocker/r-base",
    "Rscript"
  ), 
  ## Install future package and manually launch parallel workers
  ## (need double shQuote():s because udocker.py drops one level)
  rscript_args = c(
    "-e", shQuote(shQuote("install.packages('future')")),
    "-e", shQuote(shQuote("parallel:::.slaveRSOCK()"))
  ),
  dryrun = TRUE
)


## Launching worker on Amazon AWS EC2 running one of the
## Amazon Machine Images (AMI) provided by RStudio
## (http://www.louisaslett.com/RStudio_AMI/)
public_ip <- "1.2.3.4"
ssh_private_key_file <- "~/.ssh/my-private-aws-key.pem"
cl <- makeClusterPSOCK(
  ## Public IP number of EC2 instance
  public_ip,
  ## User name (always 'ubuntu')
  user = "ubuntu",
  ## Use private SSH key registered with AWS
  rshopts = c(
    "-o", "StrictHostKeyChecking=no",
    "-o", "IdentitiesOnly=yes",
    "-i", ssh_private_key_file
  ),
  ## Set up .libPaths() for the 'ubuntu' user and
  ## install future package
  rscript_args = c(
    "-e", shQuote("local({
      p <- Sys.getenv('R_LIBS_USER')
      dir.create(p, recursive = TRUE, showWarnings = FALSE)
      .libPaths(p)
    })"),
    "-e", shQuote("install.packages('future')")
  ),
  dryrun = TRUE
)


## Launching worker on Google Cloud Engine (GCE) running a
## container based VM (with a #cloud-config specification)
public_ip <- "1.2.3.4"
user <- "johnny"
ssh_private_key_file <- "~/.ssh/google_compute_engine"
cl <- makeClusterPSOCK(
  ## Public IP number of GCE instance
  public_ip,
  ## User name (== SSH key label (sic!))
  user = user,
  ## Use private SSH key registered with GCE
  rshopts = c(
    "-o", "StrictHostKeyChecking=no",
    "-o", "IdentitiesOnly=yes",
    "-i", ssh_private_key_file
  ),
  ## Launch Rscript inside Docker container
  rscript = c(
    "docker", "run", "--net=host", "rocker/r-base",
    "Rscript"
  ),
  ## Install future package
  rscript_args = c(
    "-e", shQuote("install.packages('future')")
  ),
  dryrun = TRUE
)

Run the code above in your browser using DataLab