You are here:

 


R is a programming language and software environment for statistical computing and graphics.

For use on the kingspeak, ember, and ash clusters, and on Linux desktops, we have installed R from the source code. We also installed a number of external R libraries. If there is another library that you want to use, please try to install the library in your own environment. If you run into trouble, feel free to ask us to perform the installation.

The currently supported version is 3.3.2 (Centos7). It was built with the Intel compilers and its threaded Math Kernel Library (MKL). The presence of MKL may result in a considerable speed-up when compared to R builds which rely solely on non-optimized mathematical libraries. As a rule of thumb, programs that use a lot of floating point numerical calculations should benefit from multi-threading the most.

By default we have turned off multi-threading by setting the environmental variable OMP_NUM_THREADS to 1, i.e.

setenv OMP_NUM_THREADS 1   # Tcsh/Csh Shell
export OMP_NUM_THREADS=1 # Bash Shell

to facilitate easier use of parallel independent calculations. If you want to run R in a multithreaded fashion (e.g. on a compute node), we strongly recommend not to use more threads than there are physical cores on the node. 

How to load R in your environment?

You can obtain R in your environment by loading the R module i.e.:

module load R

The command R --version returns the version of R you have loaded:

R --version
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

The command which R returns the location where the R executable resides:

which R
/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/bin/R 

Note: if you use an ~/.Rprofile file, it should be independent of the version of R, i.e. library paths should NEVER be set within this file.

Running an R batch script on the command line

There are several ways to launch an R script on the command line:

  1. Rscript yourfile.R

  2. R CMD BATCH yourfile.R
  3. R --no-save < yourfile.R
  4. ./yourfile2.R

The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout).  A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.

The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:

  • One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R 
  • As a result we have a new file yourfile2.R
  • The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)

The files seaice.R and seaice2.R can be used/seen as examples for yourfile.R, respectively yourfile2.R. Note that the scripts seaice.R and seaice2.Rrequire the data file sea-ice.txt.

Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs() function, e.g., if we have a script called myScript:

## myScript.R
args <- commandArgs(trailingOnly =TRUE)
rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

then we can call it with arguments as e.g.:

> Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853

Runing a R batch script on the cluster (using SLURM)

In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.

Below you will find the content of the corresponding Slurm batch script runR.sl:

#!/bin/bash
#SBATCH --time=00:10:00 # Walltime
#SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized)
#SBATCH --ntasks=1 # We only run one R instance = 1 task
#SBATCH --cpus-per-task=12 # number of threads we want to run on
#SBATCH --account=owner-guest
#SBATCH --partition=ember-guest
#SBATCH -o slurm-%j.out-%N
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@utah.edu # Your email address
#SBATCH --job-name=seaIce

export FILENAME=seaice.R
export SCR_DIR=/scratch/general/lustre/$USER/$SLURM_JOBID
export WORK_DIR=$HOME/TestBench/R/SeaIce

# Load R (version 3.3.2)
module load R

# Take advantage of all the threads (linear algebra)
# $SLURM_CPUS_ON_NODE returns actual number of cores on node
# rather than $SLURM_JOB_CPUS_PER_NODE, which returns what --cpus-per-task asks for
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

# Create scratch & copy everything over to scratch
mkdir -p $SCR_DIR
cd $SCR_DIR
cp -p $WORK_DIR/* .

# Run the R script in batch, redirecting the job output to a file
Rscript $FILENAME > $SLURM_JOBID.out

# Copy results over + clean up
cd $WORK_DIR
cp -pR $SCR_DIR/* .
rm -rf $SCR_DIR

echo "End of program at `date`"

We run the script under Slurm as sbatch runR.sl.

Running many independent R batch calculations in one job

We mentioned above that both versions of R were built using the multi-threaded MKL library. The thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).

#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=$USER@utah.edu
#SBATCH -o out.%j
#SBATCH -e err.%j
#SBATCH --account=owner-guest
#SBATCH --partition=lonepeak-guest
#SBATCH --job-name=test-RArr

# Job Parameters
export EXE=./rwrapper.sh
export WORK_DIR=~/TestBench/Slurm/RMulti
export SCRATCH_DIR=/scratch/local/$SLURM_JOBID
export SCRIPT_DIR=$WORK_DIR/RFiles
export OUT_DIR=$WORK_DIR/`echo $UUFSCELL | cut -b1-4`/$SLURM_JOBID

# Load R
module load R

# Run an array of serial jobs
export OMP_NUM_THREADS=1

echo " Calculation started at:`date`"
echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"

# Create the my.config.$SLURM_JOBID file on the fly
for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
done > my.config.$UUFSCELL.$SLURM_JOBID

# Running a task on each core
cd $WORK_DIR
srun --multi-prog my.config.$UUFSCELL.$SLURM_JOBID

# Clean-up the root scratch dir
rm -rf $SCRATCH_DIR

echo " Calculation ended at:`date`"

Parallel R

The R environment itself is not parallelized, which is important to keep in mind when running on CHPC cluster nodes which have at least 8 CPU cores. Typical unvectorized R programs will run using only a single core.

The R installation detailed above can run certain workloads (mostly linear algebra) using multiple threads through the Intel Math Kernel Library (MKL). We recommend to benchmark your first run using OMP_NUM_THREADS=1, and then using higher core count (e.g. for 8 core node, OMP_NUM_THREADS=8), to see if it achieves any speed-up.

If the multi-threading does not provide much speedup, or, one needs to run on more than one node, some kind of parallelization of the R code is necessary. There are numerous R packages that implement various levels of parallelism, which are summarized at this CRAN page.

In our relatively limited experience, if the parallel tasks are independent of each other, one can relatively simply use the foreach package. Or, even better, run the parallel tasks completely independently through the SLURM --multi-prog. If you need any assistance, contact us.

RStudio

RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.

RStudio is installed on Linux systems and can be invoked (after loading R) as follows:

module load RStudio
rstudio

Installing additional R packages 

R Library locations

There is a short training video that parallels this section of the documentation.

R packages are installed in libraries. Before addressing the installation of R packages as such, we will first detail the hierarchical structure of the R libraries that are installed on the CHPC Linux systems.

The command .libPaths() returns the names of the libraries (directories) which are accessible to the R executable which has been loaded in your environment.

In the recently installed R distributions, we can have three library levels:

  • Core/Default Library
  • Site Library
  • User Libraries

 The Core & Default R Packages were installed in a sub directory of the main installation directory when a new version of R was compiled. The location of the library can be retrieved by the .Library command. Among the packages in this library we have "base", "datasets", "utils", etc.

which R
/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/bin/R
R
> .Library
[1] "/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/lib64/R/library"

The Site Library contains all the external packages that have been installed by the CHPC staff for a well-defined version of R, i.e. each version of R has its own Site Library  (note that each R version may have been compiled with a different version of a compiler, different compiler flags or for a different version of the OS). The location of the Site Library library can be found within R using either .Library.site or Sys.getenv("R_LIBS_SITE") or by invoking echo $R_LIBS_SITE in the shell.

echo $R_LIBS_SITE
/uufs/chpc.utah.edu/sys/installdir/RLibs/3.3.2i/
R
>.Library.site
[1] "/uufs/chpc.utah.edu/sys/installdir/RLibs/3.2.3i/"
>Sys.getenv("R_LIBS_SITE")
[1] "/uufs/chpc.utah.edu/sys/installdir/RLibs/3.2.3i/"

The User Library is a subdirectory in the user's space (e. g. $HOME) where the user can install his/her packages. Note that each version of R for which you want to install your own packages, should have its own user library directory. The User Library subdirectories are by default not present and should be created if the user wants to install R packages themselves.

Setting up your own Library

In the following lines we will describe in detail how to set up your own User Library for R 3.3.2i. The same technique can be applied to other versions of R. 

module load R/3.3.2
which R
/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/bin/R

The R module that was loaded (i.e. R/3.3.2) is the following file:

/uufs/chpc.utah.edu/sys/modulefiles/CHPC/Core/R/3.3.2.lua

The set-up of the User Library goes through the following steps:

  1. If you don't have created your own "modules" directory yet, then create your own module directory (e.g. ~/MyModules).
    mkdir -p ~/MyModules
  2. Create an R subdirectory in ~/MyModules. The R subdirectory will contain all your own future R modules.
    mkdir ~/MyModules/R
  3. Copy the R/3.3.2 module from the CHPC modules directory into your own R module space.
    cp /uufs/chpc.utah.edu/sys/modulefiles/CHPC-c7/Core/R/3.3.2.lua ~/MyModules/R
  4. We now have 2 modules with exactly the same relative name. We must make the relative name of the new module unique. (You can modify the name of the new module by inserting e.g. your unid, ...)
    mv ~/MyModules/R/3.3.2.lua ~/MyModules/R/3.3.2.$USER.lua
  5. We can only load the new module if the newly created module directory is visible to LMOD, i.e. when it is inserted in the MODULEPATH environmental variable. You can add it to the LMOD MODULEPATH variable as follows:
    module use ~/MyModules
    You can insert the module use ~/MyModules statement in your ~/.custom.sh file or ~/.custom.csh file so that the new module becomes always visible.
  6. We will now create a new directory where we will install our new R packages that can be used with the CHPC 3.3.2i executable of R (i.e. /uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/bin/R):
    mkdir -p ~/software/pkg/RLibs/3.3.2i
  7. Edit the newly created module e.g. ~/MyModules/R/3.3.2.$USER.lua, to add the following line:
    setenv("R_LIBS_USER","/uufs/chpc.utah.edu/common/home/u0xxyyzz/software/pkg/RLibs/3.3.2i/")
    The string u0xxyyzz must be replaced by your unid.

We have now set up our own User Library installation directory where we can install packages that can be compiled with the same compiler that was used to build CHPC's version of R 3.3.2i (i.e. the Intel 2017 compiler) . 

 Installing packages in your environment

 After setting up the module for our own version of R and the User Library we can install packages in our own environment. To start, we first need to load our own version of R:

module load R/3.3.2.$USER
echo $R_LIBS_USER

The content of $R_LIBS_USER should refer to your newly created directory (see Previous Section #7). Within your new R environment you can also use .libPaths() to see the paths to all your libraries.

We can now install libraries in 2 different ways:

  • High-level version using install.packages() (invoked within R)
  • Low-level version using R CMD INSTALL (invoked from a Linux Shell)
High-Level InstallAtion

The high-level installation is the easiest way to install packages. It is the preferred way when the package to be installed does not depend on C, C++, Fortran libraries which are installed in non-traditional directories. The R function to be invoked is install.packages()

R
>library(maRketSim)
Error in library(maRketSim) : there is no package called ‘maRketSim’
>install.packages(c("maRketSim"),lib=c("/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i"),
repos=c("http://cran.us.r-project.org"),verbose=TRUE)
>library(maRketSim)

The library($PACKAGE) function tries to load a package $PACKAGE. If R can't find it an error will be printed on stdout. The install.packages() function has several flags. The lib flag needs to be followed by the directory where you want to install the package (should be $R_LIBS_USER). From the installation output we notice that the install.packages() function calls the low-level installation command (R CMD INSTALL). This command will be discussed in the next section:

'/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/lib64/R/bin/R CMD INSTALL -l \
'/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i' \
/tmp/RtmpH90XAY/downloaded_packages/maRketSim_0.9.2.tar.gz'

Note that the lib flag can be also used with other repository packages, e.g. with Bioconductor. As we have some Bioconductor packages installed in our default location, use also lib.loc flag to tell Bioconductor to tell where the "original" Bioconductor location is:

source("https://bioconductor.org/biocLite.R")
biocLite(pkgs, lib.loc = "/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i", lib="/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i")

 

Low-Level InSTALLATION

The low-level installation is to be used when you need to install R packages that depend on external libraries that are installed in non-default locations. E.g. Let's consider the package RNetCDF (already installed within 3.3.2i).  

The installation of this package depends on the external libraries netcdf-c and udunits2. The command to be invoked to install the RNetCDF package in a User Library is:

export PATH=/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/bin:$PATH
export PATH=/uufs/chpc.utah.edu/sys/installdir/udunits/2.2.20/bin:$PATH
R CMD INSTALL --library=/uufs/chpc.utah.edu/common/home/$USER/RLibs/3.3.2i \
--configure-args="CPPFLAGS='-I/uufs/chpc.utah.edu/sys/installdir/udunits/2.2.20/include'\
LDFLAGS='-Wl,-rpath=/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/lib \
-L/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/lib -lnetcdf \
-Wl,-rpath=/uufs/chpc.utah.edu/sys/installdir/udunits/2.2.20/lib\
-L/uufs/chpc.utah.edu/sys/installdir/udunits/2.2.20/lib -ludunits2 ' \
--with-nc-config=/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/bin/nc-config " RNetCDF_1.8-2.tar.gz

R CMD INSTALL calls ./configure under the hood. The best way to tackle such an installation is to download the tar.gz file first, find the appropriate installation flags (different for each package!) and then feed those flags to the R CMD INSTALL command.

 If you have trouble or questions, please send an email to issues@chpc.utah.edu.

Last Updated: 9/15/17