R is a programming language and software environment for statistical computing and graphics.
For use on the kingspeak, ember, and ash clusters, and on Linux desktops, we have installed R from the source code. We also installed a number of external R libraries. If there is another library that you want to use, please try to install the library in your own environment. If you run into trouble, feel free to ask us to perform the installation.
The currently supported version is 3.3.2 (Centos7). It was built with the Intel compilers and its threaded Math Kernel Library (MKL). The presence of MKL may result in a considerable speed-up when compared to R builds which rely solely on non-optimized mathematical libraries. As a rule of thumb, programs that use a lot of floating point numerical calculations should benefit from multi-threading the most.
By default we have turned off multi-threading by setting the environmental variable OMP_NUM_THREADS to 1, i.e.
setenv OMP_NUM_THREADS 1 # Tcsh/Csh Shell
export OMP_NUM_THREADS=1 # Bash Shell
to facilitate easier use of parallel independent calculations. If you want to run R in a multithreaded fashion (e.g. on a compute node), we strongly recommend not to use more threads than there are physical cores on the node.
You can obtain R in your environment by loading the R module i.e.:
module load R
R --version returns the version of R you have loaded:
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
which R returns the location where the R executable resides:
Note: if you use an ~/.Rprofile file, it should be independent of the version of R, i.e. library paths should NEVER be set within this file.
There are several ways to launch an R script on the command line:
R CMD BATCH yourfile.R
R --no-save < yourfile.R
The first approach (i.e. using the
Rscript command) redirects the output into stdout. The second approach (i.e. using the
R CMD BATCH command) redirects its output into a file (in case
yourfile.Rout). A third approach is to redirect the input of the file
yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags:
The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:
- One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
- As a result we have a new file yourfile2.R
- The permissions of the R script (i.e.
yourfile2.R)need to be altered (-> executable)
Sometimes we need to feed arguments to the R script. This is especially useful if
running parallel independent calculations - different arguments can be used to differentiate
between the calculations, e.g. by feeding in different initial parameters. To read
the arguments, one can use the
commandArgs() function, e.g., if we have a script called
## myScript.R args <- commandArgs(trailingOnly =TRUE) rnorm(n=as.numeric(args), mean=as.numeric(args))
then we can call it with arguments as e.g.:
> Rscript myScript.R 510098.46435100.0462699.4493798.52910100.78853
In the previous section we described how to launch an R script on the command line. In order to run a R batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.
Below you will find the content of the corresponding Slurm batch script runR.sl:
#SBATCH --time=00:10:00 # Walltime
#SBATCH --nodes=1 # Use 1 Node (Unless code is multi-node parallelized)
#SBATCH --ntasks=1 # We only run one R instance = 1 task
#SBATCH --cpus-per-task=12 # number of threads we want to run on
#SBATCH -o slurm-%j.out-%N
#SBATCH --mail-user=$USER@utah.edu # Your email address
# Load R (version 3.3.2)
module load R
# Take advantage of all the threads (linear algebra)
# $SLURM_CPUS_ON_NODE returns actual number of cores on node
# rather than $SLURM_JOB_CPUS_PER_NODE, which returns what --cpus-per-task asks for
# Create scratch & copy everything over to scratch
mkdir -p $SCR_DIR
cp -p $WORK_DIR/* .
# Run the R script in batch, redirecting the job output to a file
Rscript $FILENAME > $SLURM_JOBID.out
# Copy results over + clean up
cp -pR $SCR_DIR/* .
rm -rf $SCR_DIR
echo "End of program at `date`"
We run the script under Slurm as
We mentioned above that both versions of R were built using the multi-threaded MKL
library. The thread based parallelization is useful for vectorized R programs, but,
not all workflows vectorize. Therefore, if one has many independent calculations to
run, it is more efficient to run single threaded R and use SLURM's capability of running
independent calculations within a job in parallel. The SLURM script beneath (
myRArr.sl) lets you run an independent R job on each core of a node. Note that you also need
one or several scripts which perform the actual calculation. The SLURM script (myRArr.sl), the R wrapper script (rwrapper.sh) and the actual R script (mcex.r).
#SBATCH -o out.%j
#SBATCH -e err.%j
# Job Parameters
export OUT_DIR=$WORK_DIR/`echo $UUFSCELL | cut -b1-4`/$SLURM_JOBID
# Load R
module load R
# Run an array of serial jobs
echo " Calculation started at:`date`"
echo " #$SLURM_TASKS_PER_NODE cores detected on `hostname`"
# Create the my.config.$SLURM_JOBID file on the fly
for (( i=0; i < $SLURM_TASKS_PER_NODE ; i++ )); \
do echo $i $EXE $i $SCRATCH_DIR/$i $SCRIPT_DIR $OUT_DIR/$i ; \
done > my.config.$UUFSCELL.$SLURM_JOBID
# Running a task on each core
srun --multi-prog my.config.$UUFSCELL.$SLURM_JOBID
# Clean-up the root scratch dir
rm -rf $SCRATCH_DIR
echo " Calculation ended at:`date`"
The R environment itself is not parallelized, which is important to keep in mind when running on CHPC cluster nodes which have at least 8 CPU cores. Typical unvectorized R programs will run using only a single core.
The R installation detailed above can run certain workloads (mostly linear algebra) using multiple threads through the Intel Math Kernel Library (MKL). We recommend to benchmark your first run using OMP_NUM_THREADS=1, and then using higher core count (e.g. for 8 core node, OMP_NUM_THREADS=8), to see if it achieves any speed-up.
If the multi-threading does not provide much speedup, or, one needs to run on more than one node, some kind of parallelization of the R code is necessary. There are numerous R packages that implement various levels of parallelism, which are summarized at this CRAN page.
In our relatively limited experience, if the parallel tasks are independent of each other, one can relatively simply use the foreach package. Or, even better, run the parallel tasks completely independently through the SLURM --multi-prog. If you need any assistance, contact us.
RStudio is an Integrated Development Environment (IDE) for R. It includes a console, syntax highlighting editor that supports direct code execution, as well as tools for plotting, debugging, history and workspace management. For more information see RStudio webpage.
RStudio is installed on Linux systems and can be invoked (after loading R) as follows:
module load RStudio
There is a short training video that parallels this section of the documentation.
R packages are installed in libraries. Before addressing the installation of R packages as such, we will first detail the hierarchical structure of the R libraries that are installed on the CHPC Linux systems.
.libPaths() returns the names of the libraries (directories) which are accessible to the R executable
which has been loaded in your environment.
In the recently installed R distributions, we can have three library levels:
- Core/Default Library
- Site Library
- User Libraries
The Core & Default R Packages were installed in a sub directory of the main installation directory when a new version
of R was compiled. The location of the library can be retrieved by the
.Library command. Among the packages in this library we have "base", "datasets", "utils",
The Site Library contains all the external packages that have been installed by the CHPC staff for a well-defined version of R, i.e. each version of R has its own Site Library (note that each R version may have been compiled with a different version of a compiler,
different compiler flags or for a different version of the OS). The location of the Site Library library can be found within R using either
Sys.getenv("R_LIBS_SITE") or by invoking
echo $R_LIBS_SITE in the shell.
The User Library is a subdirectory in the user's space (e. g. $HOME) where the user can install his/her packages. Note that each version of R for which you want to install your own packages, should have its own user library directory. The User Library subdirectories are by default not present and should be created if the user wants to install R packages themselves.
In the following lines we will describe in detail how to set up your own User Library for R 3.3.2i. The same technique can be applied to other versions of R.
module load R/3.3.2
The R module that was loaded (i.e. R/3.3.2) is the following file:
The set-up of the User Library goes through the following steps:
- If you don't have created your own "modules" directory yet, then create your own module
mkdir -p ~/MyModules
- Create an R subdirectory in ~/MyModules. The R subdirectory will contain all your
own future R modules.
- Copy the R/3.3.2 module from the CHPC modules directory into your own R module space.
cp /uufs/chpc.utah.edu/sys/modulefiles/CHPC-c7/Core/R/3.3.2.lua ~/MyModules/R
- We now have 2 modules with exactly the same relative name. We must make the relative name of the new module unique. (You can modify the name of the new module by inserting e.g. your unid, ...)
mv ~/MyModules/R/3.3.2.lua ~/MyModules/R/3.3.2.$USER.lua
- We can only load the new module if the newly created module directory is visible to LMOD, i.e. when it is inserted in the MODULEPATH environmental variable. You can add it
to the LMOD MODULEPATH variable as follows:
module use ~/MyModules
You can insert the
module use ~/MyModulesstatement in your
~/.custom.cshfile so that the new module becomes always visible.
- We will now create a new directory where we will install our new R packages that can
be used with the CHPC 3.3.2i executable of R (i.e.
mkdir -p ~/software/pkg/RLibs/3.3.2i
- Edit the newly created module e.g.
~/MyModules/R/3.3.2.$USER.lua, to add the following line:
u0xxyyzzmust be replaced by your
We have now set up our own User Library installation directory where we can install packages that can be compiled with the same compiler that was used to build CHPC's version of R 3.3.2i (i.e. the Intel 2017 compiler) .
After setting up the module for our own version of R and the User Library we can install packages in our own environment. To start, we first need to load our own version of R:
module load R/3.3.2.$USER
The content of $R_LIBS_USER should refer to your newly created directory (see Previous Section #7). Within your new R environment you can also use .libPaths() to see the paths to all your libraries.
We can now install libraries in 2 different ways:
- High-level version using
install.packages()(invoked within R)
- Low-level version using
R CMD INSTALL(invoked from a Linux Shell)
The high-level installation is the easiest way to install packages. It is the preferred way when the package to be installed does not depend on C, C++, Fortran libraries which are installed in non-traditional directories.
The R function to be invoked is
Error in library(maRketSim) : there is no package called ‘maRketSim’
library($PACKAGE) function tries to load a package
$PACKAGE. If R can't find it an error will be printed on stdout. The
install.packages() function has several flags. The
lib flag needs to be followed by the directory where you want to install the package
$R_LIBS_USER). From the installation output we notice that the
install.packages() function calls the low-level installation command (
R CMD INSTALL). This command will be discussed in the next section:
'/uufs/chpc.utah.edu/sys/installdir/R/3.3.2i/lib64/R/bin/R CMD INSTALL -l \
Note that the
lib flag can be also used with other repository packages, e.g. with Bioconductor. As
we have some Bioconductor packages installed in our default location, use also
lib.loc flag to tell Bioconductor to tell where the "original" Bioconductor location is:
biocLite(pkgs, lib.loc = "/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i", lib="/uufs/chpc.utah.edu/common/home/$USER/software/pkg/RLibs/3.3.2i")
The low-level installation is to be used when you need to install R packages that
depend on external libraries that are installed in non-default locations. E.g. Let's consider the package
RNetCDF (already installed within 3.3.2i).
The installation of this package depends on the external libraries netcdf-c and udunits2. The command to be invoked to install the
RNetCDF package in a User Library is:
R CMD INSTALL --library=/uufs/chpc.utah.edu/common/home/$USER/RLibs/3.3.2i \
-L/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/lib -lnetcdf \
-L/uufs/chpc.utah.edu/sys/installdir/udunits/2.2.20/lib -ludunits2 ' \
--with-nc-config=/uufs/chpc.utah.edu/sys/installdir/netcdf-c/4.3.2i/bin/nc-config " RNetCDF_1.8-2.tar.gz
R CMD INSTALL calls
./configure under the hood. The best way to tackle such an installation is to download the tar.gz
file first, find the appropriate installation flags (different for each package!)
and then feed those flags to the
R CMD INSTALL command.
If you have trouble or questions, please send an email to firstname.lastname@example.org.