Center for High Performance Computing - The University of Utah

R is a programming language and software environment for statistical computing and graphics.

For use on the notchpeak, kingspeak, lonepeak, and granite clusters, and on Linux desktops, we have installed R from the source code. We also installed a significant number of external R libraries. If there is another library that you want to use, please try to install the library in your own environment. If you run into trouble, feel free to ask us to perform the installation.

The latest supported version is 4.4.2 (Rocky 8). It was built with the GNU compilers and a threaded version of OpenBlas. The presence of OpenBlas may result in a considerable speed-up when compared to R builds which rely solely on non-optimized mathematical libraries. As a rule of thumb, programs that use a lot of floating point numerical calculations should benefit from multi-threading the most.

By default we have turned off multi-threading by setting the environmental variable OMP_NUM_THREADS to 1, i.e.

setenv OMP_NUM_THREADS 1   # Tcsh/Csh Shell
export OMP_NUM_THREADS=1   # Bash Shell

to facilitate easier use of parallel independent calculations. If you want to run R in a multithreaded fashion (e.g. on a compute node), we strongly recommend not to use more threads than there are physical cores on the node.

How to load R in your environment

You can obtain R in your environment by loading the R module i.e.:

module load R

The command R --version returns the version of R you have loaded:

R --version

R version 4.4.2 (2024-10-31) -- "Pile of Leaves"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

The command which R returns the location where the R executable resides:

which R

/uufs/chpc.utah.edu/sys/installdir/r8/R/4.4.2/bin/R

We also maintain a number of older versions of R. You can list these versions with the command "module spider R", and load a specific version with a command such as "module load R/4.1.1".

Note: if you use an ~/.Rprofile file, it should be independent of the version of R, i.e. library paths should NEVER be set within this file.

Running an R batch script on the command line

There are several ways to launch an R script on the command line:

Rscript yourfile.R
R CMD BATCH yourfile.R
R --no-save < yourfile.R
./yourfile2.R

The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCH command) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.

The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:

One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
As a result we have a new file yourfile2.R
The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)

The files seaice.R and seaice2.R can be used/seen as examples for yourfile.R, respectively yourfile2.R. Note that the scripts seaice.R and seaice2.Rrequire the data file sea-ice.txt.

Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs() function, e.g., if we have a script called myScript:

## myScript.R args <- commandArgs(trailingOnly =TRUE) rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

then we can call it with arguments as e.g.:

> Rscript myScript.R 5100[1]98.46435100.0462699.4493798.52910100.78853

Running a R batch script on the cluster (using SLURM)

In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.

Below you will find the content of the corresponding Slurm batch script runR.sl:

#!/bin/bash
#SBATCH --time=00:10:00 # Walltime
#SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized)
#SBATCH --ntasks=1 # We only run one R instance = 1 task
#SBATCH --cpus-per-task=32 # number of threads we want to run on
#SBATCH --account=owner-guest
#SBATCH --partition=notchpeak-guest
#SBATCH -o slurm-%j.out-%N
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@utah.edu # Your email address
#SBATCH --job-name=seaIce

export FILENAME=seaice.R
export SCR_DIR=/scratch/general/nfs1/$USER/$SLURM_JOBID
export WORK_DIR=$HOME/TestBench/R/SeaIce

# Load R (version 4.4.2)
module load R

# Take advantage of all the threads (linear algebra)
# $SLURM_CPUS_ON_NODE returns actual number of cores on node
# rather than $SLURM_JOB_CPUS_PER_NODE, which returns what --cpus-per-task asks for
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

# Create scratch & copy everything over to scratch
mkdir -p $SCR_DIR
cd $SCR_DIR
cp -p $WORK_DIR/* .

# Run the R script in batch, redirecting the job output to a file
Rscript $FILENAME > $SLURM_JOBID.out

# Copy results over + clean up
cd $WORK_DIR
cp -pR $SCR_DIR/* .
rm -rf $SCR_DIR

echo "End of program at `date`"

We run the script under Slurm as sbatch runR.sl.

Running many independent R batch caculations as one job

We mentioned above that both versions of R were built using the multi-threaded OpenBLAS library. The thread based parallelization is useful for vectorized R programs, but, not all workflows vectorize. Therefore, if one has many independent calculations to run, it is more efficient to run single threaded R and use SLURM's capability of running independent calculations within a job in parallel. The SLURM script beneath (myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).

#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --nodes=1
#SBATCH --mail-type=FAIL,BEGIN,END
#SBATCH --mail-user=$USER@utah.edu
#SBATCH -o out.%j
#SBATCH -e err.%j
#SBATCH --account=owner-guest
#SBATCH --partition=lonepeak-guest
#SBATCH --job-name=test-RArr

# Job Parameters
export EXE=./rwrapper.sh
export WORK_DIR=~/TestBench/Slurm/RMulti
export SCRATCH_DIR=/scratch/local/$USER/$SLURM_JOBID
export SCRIPT_DIR=$WORK_DIR/RFiles
export OUT_DIR=$WORK_DIR/`echo $UUFSCELL | cut -b1-4`/$SLURM_JOBID

# Load R
module load R

# Run an array of serial jobs
export OMP_NUM_THREADS=1

echo " Calculation started at:`date`"
echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"

# Create the my.config.$SLURM_JOBID file on the fly
for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
done > my.config.$UUFSCELL.$SLURM_JOBID

# Running a task on each core
cd $WORK_DIR
srun --multi-prog my.config.$UUFSCELL.$SLURM_JOBID

# Clean-up the root scratch dir
rm -rf $SCRATCH_DIR

echo " Calculation ended at:`date`"

Parallel R

The R environment itself is not parallelized, which is important to keep in mind when running on CHPC cluster nodes which have at least 8 CPU cores. Typical unvectorized R programs will run using only a single core.

The R installation detailed above can run certain workloads (mostly linear algebra) using multiple threads through the OpenBLAS library. We recommend to benchmark your first run using OMP_NUM_THREADS=1, and then using higher core count (e.g. for 8 core node, OMP_NUM_THREADS=8), to see if it achieves any speed-up.

If the multi-threading does not provide much speedup, or, one needs to run on more than one node, some kind of parallelization of the R code is necessary. There are numerous R packages that implement various levels of parallelism, which are summarized at this CRAN page.

If the computational tasks are independent of each other, one can relatively simply use the foreach package, or parallelized versions of the *apply functions, which use the parallel package's multiple R workers. It is most common to equate the number of R workers to the CPU cores (SLURM job tasks), and set OMP_NUM_THREADS=1 to turn off the multi-threading. For running on a single compute node, here are the SLURM script example and R code example.

To run on multiple cluster compute nodes, one also has to tell R what hosts to run on. This requires creating a list of hosts in the SLURM script, srun -n $SLURM_NTASKS hostname > hostlist.txt (like in this SLURM script), and inside of the R program, feeding this list to the makeCluster() function, as in the following example, which would work on any number of cluster nodes:

# load the parallel libraries
library(parallel)
library(foreach)
library(doParallel)
# import hostlist
hostlist <- paste(unlist(read.delim(file="hostlist.txt", header=F, sep =" ")))
# launch the parallel R workers
cl <- makeCluster(hostlist)
registerDoParallel(cl)
# Fix bug in R < 4.0, give the workers path to optional R packages
# clusterEvalQ(cl,.libPaths("/uufs/chpc.utah.edu/sys/installdir/RLibs/3.5.2i"))
# run the parallel calculation
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
system.time({
r <- foreach(icount(trials), .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})

Finally, some R libraries have their internal parallelization. To quickest way to find out if the library/function that you are using can be run in parallel is to do a web search.

For example, to find if library momentuHMM has any parallel options, we can search for R momentuHMM parallel. The first hit is the library's manual. Searching for the parallel keyword in the manual, we can find a few functions that allow parallel processing.

RStudio

RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.

RStudio is installed on Linux systems and can be invoked (after loading R) as follows:

module load RStudio
rstudio

Installing additional R packages

R library locations

R packages are installed in libraries (locations where the code of the packages reside). Before addressing the installation of R packages as such, we will first detail the hierarchical structure of the R libraries that are installed on the CHPC Linux systems.

The command .libPaths() returns the names of the libraries (locations) which are accessible to the R executable which has been loaded in your environment.

In the recently installed R distributions, we can have three library levels:

Core/Default Library
Site Library
User Libraries

The Core & Default R Packages were installed in a sub directory of the main installation directory when a new version of R was compiled. The location of the library can be retrieved by the .Library command. Among the packages in this library we have "base", "datasets", "utils", etc.

which R
/uufs/chpc.utah.edu/sys/installdir/r8/R/4.4.2/bin/R

> .Library
[1] "/uufs/chpc.utah.edu/sys/installdir/r8/R/4.4.2/lib64/R/library"

The Site Library contains all the external packages that have been installed by the CHPC staff for a well-defined version of R, i.e. each version of R has its own Site Library (note that each R version may have been compiled with a different version of a compiler, different compiler flags or for a different version of the OS). The location of the Site Library library can be found within R using either .Library.site or Sys.getenv("R_LIBS_SITE") or by invoking echo $R_LIBS_SITE in the shell.

echo $R_LIBS_SITE
/uufs/chpc.utah.edu/sys/installdir/r8/RLibs/4.4.2

>R
> .Library.site
[1] "/uufs/chpc.utah.edu/sys/installdir/r8/RLibs/4.4.2"
> Sys.getenv("R_LIBS_SITE")
[1] "/uufs/chpc.utah.edu/sys/installdir/r8/RLibs/4.4.2"

The User Library is a subdirectory in the user's space (e. g. $HOME) where the user can install his/her packages. Note that each major version of R for which you want to install your own packages, should have its own user library directory. Modern R versions automatically set and create an user library location in user's home directory.

Check the existence of the R User Library

In the following lines we will describe how to check if R User Library is active. Please, note that the User Library is only compatible within the minor version.

module load R
which R
/uufs/chpc.utah.edu/sys/installdir/R/4.4.2/bin/R
R
> Sys.getenv("R_LIBS_USER")
[1] "~/R/x86_64-pc-linux-gnu-library/4.4"

This shows that R/4.4.2 has an User Library located at ~/R/x86_64-pc-linux-gnu-library/4.4.

Installing packages in your environment

We can install libraries in 2 different ways:

High-level version using install.packages() (invoked within R)
Low-level version using R CMD INSTALL (invoked from a Linux Shell)

High-Level Installation

The high-level installation is the easiest way to install packages. It is the preferred way when the package to be installed does not depend on C, C++, Fortran libraries which are installed in non-traditional directories, and particularly when the R code is available via CRAN, the Comprehensive R Archive Network. The R function to be invoked is install.packages():

R
>library(hermite)
Error in library(hermite) : there is no package called ‘hermite’
>install.packages(pkgs=c("hermite"),
                  repos=c("http://cran.us.r-project.org"),verbose=TRUE)
>library(hermite)

The library($PACKAGE) function tries to load the package $PACKAGE. If R can't find the package an error will be printed on stdout. The install.packages() function has several flags. The first argument is the vector containing the names of the packages that you want to install. You can also specify the directory lib flag, i.e. the location where you would like to install the package. If the lib flag is not explicitly specified, the new package will be installed in the first entry of the .libPaths() command (for regular users this corresponds to the content of the command Sys.getenv("R_LIBS_USER"). The repos flag specifies the repo that is to be used; the verbose flag is by default set to FALSE and can be very handy when errors arise during the compilation or loading process. From the installation output we notice that the install.packages() function calls the low-level installation command (R CMD INSTALL). This command will be discussed in the next section:

DONE (hermite)
3): succeeded '/uufs/chpc.utah.edu/sys/installdir/r8/R/4.4.2/lib64/R/bin/R CMD INSTALL \
-l '/uufs/chpc.utah.edu/common/home/u0253283/R/x86_64-pc-linux-gnu-library/4.4' \
   '/tmp/Rtmpnx37ZX/downloaded_packages/hermite_1.1.2.tar.gz''

An alternative install function is used by the Bioconductor software repository. Bioconductor is the primary repository for R code for the life sciences, and uses the BiocManager::install() function:

BiocManager::install(pkgs)

Where "pkgs" is a character vector with one or more names of packages to be installed. This command, for example will install the Bioconductor DESeq2 package:

BiocManager::install("DESeq2")

BiocManager::install has a number of optional arguments. Run the command "?BiocManager::install" within R to see the complete documentation on the function.

Low-Level Installation

The low-level installation should only be used when you need to install R packages that depend on external libraries that are installed in non-default locations. E.g. Let's consider the package RNetCDF (already installed within CHPC's R).

The installation of the following package depends on the external libraries netcdf-c and udunits2. The command to be invoked to install the RNetCDF package in a User Library is (assuming Bash shell):

module load intel netcdf-c udunits
export PATH=$NETCDFC/bin:$PATH (or in tcsh shell, setenv PATH $NETCDFC/bin:$PATH)
export PATH=$UDUNITS/bin:$PATH (or in tcsh shell, setenv PATH $UDUNITS/bin:$PATH)
wget https://cran.r-project.org/src/contrib/RNetCDF_1.9-1.tar.gz
R CMD INSTALL --configure-args="CPPFLAGS='-I$UDUNITS/include'\
                LDFLAGS='-Wl,-rpath=$NETCDFC/lib \
               -L$NETCDFC/lib -lnetcdf \
               -Wl,-rpath=$UDUNITS/lib\
               -L$UDUNITS/lib -ludunits2 ' \
               --with-nc-config=$NETCDFC/bin/nc-config " RNetCDF_1.9-1.tar.gz

R CMD INSTALL calls ./configure under the hood. The best way to tackle such an installation is to download the tar.gz file first, find the appropriate installation flags (different for each package!) and then feed those flags to the R CMD INSTALL command.

If you have trouble or questions, please send an email to helpdesk@chpc.utah.edu.

Potential Problems

Package installation in Open OnDemand RStudio Server App

The RStudio Server does not run X, the Linux graphical environment. Some R librares require X to install, for example library 'rpanel'. The symptom of this issue is an error message like:

Warning message:
In fun(libname, pkgname) : couldn't connect to display ":0"
Error in structure(.External(.C_dotTcl, ...), class = "tclObj") : 
  [tcl] couldn't connect to display ":0".

These packages have to be installed in the FastX terminal session, as follows:

- open FastX terminal to one of our clusters
- load the Ondemand R module, e.g.:
ml R/3.6.2-ood-geospatial
- start R and do the installation:
R > install.packages('rpanel')
(answer 'yes' to use personal library)

R (Programming Language)