Skip to content

I can't ssh to machine anymore, getting a serious error:

SSH_Error

 

While looking scary, this error is usually benign. It occurs when the SSH keys on the machine you are trying to connect to change, most commonly with operation system upgrade. There are two ways to get rid of this message and log in:

  1. open file ~/.ssh/known_hosts in a text editor and delete the lines that contain the host name you are connecting to                                      
  2. use ssh-keygen  command with -R flag to remove the ssh keys for the given host
    e.g. ssh-keygen -R kingspeak1.chpc.utah.edu

On the subsequent ssh connection to the machine says something like the message below and let you login:

Warning: Permanently added 'astro02.astro.utah.edu,155.101.26.110 (ECDSA) to the list of known hosts 

 


 My calculations, or other file operations complain that the file can't be accessed, or it does not exist, even though I have just created or modified it.

This error may have many incarnations but it may look something like this:

ERROR on proc 0: Cannot open input script in.npt-218K-continue (../lammps.cpp:327)

It also occurs randomly, sometimes the program works, sometimes not.

This error is most likely due to the way how the file system writes files. For performance reasons, it writes parts of the file into a memory buffer, which gets periodically written to the disk. If another machine tries to access the file before the machine that writes the file writes it to the disk, this error occurs. For NFS, which we use for all our home directories and group spaces, it is well described here. There are several ways to deal with this:

  1. Use the Linux sync command to forcefully flush the buffers to the disk. Do this both at the machine where the file writing and file reading occurs BEFORE the file is accessed. To ensure that all compute nodes in the job sync, do "srun -n $SLURM_NNODES --ntasks-per-node=1 sync ".
  2. Sometimes adding the Linux sleep command can help, to provide extra time window for the syncing to occur.
  3. Inside of the code, use fflush for C/C++ or flush for Fortran. For other languages, such as Python and Matlab, google them for "flush" to see what options are there.

If neither of these help, please, try other file system to see if the error persists (e.g. /scratch/global/lustre, or /scratch/local), and let us know.


Starting Emacs editor is very slow.

Emacs's initialization includes accessing many files, which can be slow in the network file system environment. The workaround is to run EMacs in the server mode (as a daemon), and start each terminal session using emacsclient command. The Emacs daemon stays in the background even if one disconnects from that particular system, so, it needs to be started only once per system start.

The easiest way is to create an alias for the emacs command as

alias emacs emacsclient -a \"\"

Note the escaped double quote characters (\"). This will start the emacs as a daemon if it's not started already, and proceeds to run in the client mode.

Note that by default emacsclient starts in the terminal, to force to start Emacs GUI, add "-c" flag, e.g. (assuming the aforementioned alias is in place) "emacs -c myfile.txt".

Another solution, suggested by a user, is to add this line to your .emacs file in your home directory:

(setq locate-dominating-stop-dir-regexp "\\`\\(?:/uufs/chpc.utah.edu/common/home/[^\\/]+/\\|~/\\)\\'")

 

Opening Emacs file is very slow

We have yet to find the root of this problem but it's most likely caused by the number of files in a directory and the type of the file that Emacs is filtering through. The workaround is to read the file without any contents conversion, M-x find-file-literally <Enter> filename <Enter>. After opening the file, one can tell Emacs to encode the file accordingly, e.g. to syntax highlight shell scripts, M-x sh-mode <Enter>.

To make this change permanent, the ~/.emacs file to add:

(global-set-key "\C-c\C-f" 'find-file-literally)

Troubleshooting Slurm jobs that won't start (errors and other reasons)

  • Batch job submission failed: Invalid account or account/partition combination specified

    This error usually indicates that one is trying to run a job in general partition, but the research group does not have an allocation or has used all of their allocation for the current quarter. To view current allocation status, see this page. If your group is either not listed in the first table on this page or there is a 0 in the first column (allocation amount) your group does not have a current allocation. In this case, your group may want to consider completing an allocation request.

    Jobs without allocation must run in the freecycle partion. They will have lower priority and will be preemptable. You can see what partitions/accounts you have access to by running the myallocation command. There also are examples of account–partition pairs on the Slurm documentation page. Alternatives include using unallocated clusters (kingspeak, lonepeak, and tangent) or running on owner nodes as a guest (with the possibility of preemption).

    This error can also be caused by an invalid combination of values for account and partition: not all accounts work on all partitions. Check the spelling in your batch script or interactive command and be sure you have access to the account and partition. To view the combinations you can use, use the sacctmgr command; more information (including example commands) can be found on the Slurm documentation page.

  • Batch job submission failed: Node count specification invalid

    The number of nodes that can be used for a single job is limited; attempting to submit a job that uses more will result in the above error. This limit is approximately one-half the total number of general nodes on each cluster (currently 32 on notchpeak, 24 on kingspeak, and 106 on lonepeak).

    The limit on the number of nodes can be exceeded with a reservation or QOS specification. Requests are evaluated on a case-by-case basis; please contact us (helpdesk@chpc.utah.edu) to learn more.

  • Required node not available (down, drained, or reserved)
    or job has "reason code" ReqNodeNotAvail

    This occurs when a reservation is in place on one or more of the nodes requested by the job. The "Required node not available (down, drained, or reserved)" message can occur when submitting a job interactively (with srun, for instance); when submitting a script (often with sbatch), however, the job will enter the queue without complaint and Slurm will assign it the "reason code" (which provides some insight into why the job has not yet started) "ReqNodeNotAvail."

    The presence of a reservation on a node likely means it is in maintenance. It is possible there is a downtime on the cluster in question; please check the news page and subscribe to the mailing list via the User Portal so you will be notified of impactful maintenance periods.


I would like to change my shell (to bash or tcsh)

You can change your shell in the Edit Profile page by selecting the shell you'd like and clicking "Change." This change should take effect within fifteen minutes and you will need to log in again on any resources you were using at the time. That includes terminating all FastX sessions you may have running.

If you only need to use a different shell (but don't want to change your own), you can open the shell or pass commands as arguments (e.g. tcsh or tcsh -c "echo hello").



I would like to change my email address

You can change the email address CHPC uses to contact you in the Edit Profile page .


 My program crashed because /tmp filled up

Linux defines temporary file system at /tmp or /var/tmp where temporary user and system files are stored. CHPC cluster nodes set up temporary file systems as a RAM disk with limited capacity. All interactive and compute nodes have also a spinning disk local storage at /scratch/local. If an user program is known to need temporary storage, it is advantageous to set environment variable TMPDIR which defines the location of the temporary storage and point it to /scratch/local. Or, even better, create an user specific directory, /scratch/local/$USER, and set /scratch/local to that as shown in our sample/uufs/chpc.utah.edu/sys/modulefiles/templates/custom.[csh,sh] script.


I am getting a message "Disk quota exceeded" when logging in

Default CHPC home directories have a 50 GB storage limit, which when exceeded does not allow one to write any more files to the home directory. As some access tools like FastX rely on storing small files in user's home directory upon logging in, it will fail.

To display quota information either runmydiskquotacommand or log to CHPC user personal details and scroll down to Filesystem Quotas. For themydiskquota output, focus on the /home/hpc line, the first figure is the important one - the storage usage in GB. It has to be under 50 GB. The Overall capacity covers the whole file system, not just one's home directory. Also, be aware that the quota information refreshes once per hour, so, when deleting files it may not reflect immediately. However, if one removes enough data to get under quota, one will be able to write again, even if the report still shows being over the quota.

The remedy for this is to clean up files in the home. To do that, log in using a terminal tool, such as putty or Git bash on Windows, terminal on a Mac, or Open OnDemand Clusters -> Shell access. Delete large files (using the rm command). You may be also able to scp large files back to your desktop with WinSCP (Windows) or Cyberduck (Mac).

To keep large files, explore other storage solutions at CHPC. As a temporary solution, you may move the large files to one of the scratch servers, e.g.:

mkdir -p /scratch/general/vast/$USER
mv big_file /scratch/general/vast/$USER

To find large files, in the text terminal, run ncdu command in your home directory to show disk space used per directory (the largest directory will be at the top). Then cd to the directory with the largest usage and continue till you find the largest files. Remove them or move them to the scratch as shown above. If you clean a few files and are able to open FastX session again, a graphical tool baobabis similar to ncdu but shows the usage in an easier to comprehend graphic.


How do I check that my job is running efficiently

Run the following command: pestat -u $USER. Any information shown in red is a warning sign. In the example output below the user's jobs are only utilizing one CPU out of 16 or 28 available:

Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot              (MB)     (MB)  JobId User ...
   kp016      kingspeak*    alloc  16  16    1.00*    64000    55494  7430561 u0123456  
   kp378      schmidt-kp    alloc  28  28    1.00*   256000   250656  7430496 u0123456 

 Another possibility is that the job is running low on memory, although we now limit the maximum memory used by the job via SLURM so this is much less common than it used to be. However, if you notice low free memory along with low CPU utilization, like in the example below, try to submit on nodes with more memory using the #SBATCH --mem=xxx option. An example of pestat output of a high memory job with a low CPU utilization is below:

Hostname Partition Node  Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId User ...
kp296 emcore-kp alloc 24 24 1.03* 128000 2166* 7430458 u0123456

 My calculations are running slower than expected?

First, check how efficiently are your jobs using the compute nodes. This can give some clues on what's the problem.

There can be multiple reasons for this, ranging from user mistakes to hardware and software issues, the most common, in the order of commonality:

  • not parallelizing the calculation. In the HPC environment, we obtain speed by distributing the workload onto many processors. There are different ways to parallelize depending on the program workflow, starting from independent calculations, to explicit parallelization of the program using OpenMP, MPI or interpreted languages like Python, R or Matlab. For some basic information on how explicit parallelization see our Introduction to Parallel Computing lecture, or contact us.
  • user mistakes in not supplying the correct number of parallel tasks or hard coding the number of tasks to run instead of using SLURM variables like $SLURM_NTASKS or $SLURM_CPUS_PER_NODE. Check your SLURM script and program input files and if in doubt contact us.
  • inefficient parallelization. Especially MPI can be sensitive to how efficiently is the program parallelization implemented. If you need help in analyzing and fixing the parallel performance contact us.
  • hardware or software issues on the cluster. If you rule out any of the issues listed above please contact us.

Jobs are running out of memory

We are enforcing memory limits in jobs through the SLURM scheduler. If your job ends prematurely, please, check if the job output has "Killed" at or near the end of the output. That would signalize that the job was killed due to being low on memory.

Occasionally the SLURM memory check does not work and your jobs end up either slowing down the nodes where the job runs or puts the nodes to a bad state. This requires sysadmin interaction to recover the nodes, and we usually notify the user and ask to correct their behavior - either by asking more memory for the job (SLURM's --mem option), or checking what they are doing and lowering their memory needs.

In either case, in these situations it is imperative to monitor the memory usage of the job. A good initial check is to run the pestat command. If you notice that the memory is low, the next step would be to ssh to the affected node, and run the top command. Observe the memory load of the program and if it is high and you notice kswapd processes taking some CPU time, this indicates that the program is using too much memory. Delete the job before it ends up putting the node in a bad state, remedy the memory needs as suggested above.

I am running a campus VPN on my home computer and can't connect to certain websites

Virtual Private Network (VPN) makes your computer to look like it was on the campus, even if you are off-site. However, the way how the campus VPN is set up, you may not be able to access certain out of campus internet resources.

We recommend to only use VPN if one needs to:

  • map network drives to CHPC file servers
  • use remote desktop to connect to CHPC machines that allow remote desktop (e.g. Windows servers)
  • connect to the Protected Environment resources

All other resources do not need VPN. These include:

  • ssh or FastX to connect to CHPC general environment Linux clusters
  • accessing secure websites that require University authenticated login, such as CHPC's webpage, various other campus webpages (Canvas, HR,...) Box, Google Drive, etc.

I would like to unsubscribe from CHPC e-mail messages

You can do that by going to our unsubscribe page. However, note that e-mail announcements are essential in keeping our user base aware of what's happening with our systems and services, therefore we strongly recommend to stay subscribed. We try to keep the messages to a minimum, at most a couple a week, unless there is a computer downtime or other critical issue. For ease of e-mail navigation/deletion, we have adopted four standard headlines for all of our messages:

  • CHPC DOWNTIME - for announcements related to planned or unplanned computer downtimes
  • CHPC PRESENTATION - for CHPC lectures and presentations
  • CHPC ALLOCATIONS - for announcements regarding the CHPC resources allocation
  • CHPC INFORMATION - for announcements different than the three above

I can not log into CHPC with either host can not be reached or lockout message

One possibility for connection failure is that the server is down, either due to planned or unplanned downtime. The best way to follow up on those is to make sure you have an up to date e-mail contact at your CHPC Profile, in order to get our announcements, or, check our latest news. To determine if this is the case, watch our announcements, or try to connect to another CHPC machine in case this is a localized failure.

Another common possibility is a connection disabling or a lockout after multiple failed login attempts due to incorrect passwords. Both CHPC and campus authentication implement various measures to prevent "brute force" login attacks. After certain number of failed logins within a certain time period, there login will be disabled for a certain period of time. These parameters vary between the general and protected environment and the campus authentication and often are not public, for security reasons, but, in general, the period of disablement is an hour or less from the last failed login attempt.

The best approach to deal with this is to try to log into a different machine (e.g. kingspeak2 instead of kingspeak1, or even a different cluster), or, in the worst case, wait till the login is enabled again. It is counter-productive to try to login again, even with the correct password, since the disablement timer resets.

If SSH login to no CHPC Linux cluster works, the problem may be with the campus authentication lockout for the SSH. CHPC uses the campus authentication to verify user identity. To check if campus authentication works, try any of the CIS logins. If it does, you may still be able to access CHPC Linux resources via the onemand.chpc.utah.edu web portal, which also provides Linux clusters terminal access in its Clusters -> Shell Access menu. Regardless, contact the campus help desk and inform them on the AD SSH lockout and ask them to unlock you.

How can I use all processors in multi-node job with different CPU core counts per node

In some cases (e.g. when owner has several generations of nodes) it is desireable to run a multi-node job that spans nodes with different CPU core counts. To utilize all the CPUs on these nodes, instead of defining the job's process count with the SLURM_NTASKS variable provided by SLURM, one has to explicitly specify the total CPU core count on these nodes.

This is achieved by specifying only --nodes, not --ntasks, in the node/tasks request part of the SLURM batch script. The second step is to calculate how many CPU cores are available on all the job's nodes, which is stored in the JOB_TASKS variable. This variable is supplied e.g. to mpirun to specify the number of tasks.

#!/bin/bash
#SBATCH --nodes=12
....
JOB_TASKS=`echo $SLURM_JOB_CPUS_PER_NODE | sed -e 's/[()]//g' -e 's/x/*/g' -e 's/,/+/g' | bc`

mpirun -np $JOB_TASKS program_to_run

I can't access files from my group space or scratch in Jupyter or RStudio Server

Unfortunately both Jupyter and RStudio Server set one's home as a root of its file system, so, one can't step "one directory down" and then browse back up to the group file spaces.

There's a trick to get these portals to access the group space, or a scratch space  - create a symbolic link from that space to your home directory. To do this, in a terminal window, run

ln -s /uufs/chpc.utah.edu/common/home/my-group1 ~/

Replace the my-group1  with the appropriate group space name. You'll then see the my-group1 directory in the root of your home and access it this way.

Similarly for the scratches, we can do e.g. for /scratch/general/vast, :

ln -s /scratch/general/vast/$USER ~/vast

I am getting violation of usage policy warnings, though I don't run anything

We monitor usage of the Linux cluster interactive nodes, since they are a shared resource, and send warnings if CPU or memory usage of user processes exceeds limits. Even if one does not run time consuming calculations, sometimes the FastX remote desktop puts enough load on the system to trigger these warnings. If one receives these warnings, please, log in using FastX into the affected node and terminate the FastX session. Keep in mind that FastX sessions stay alive after closing the session, they need to be terminated to quit them.

My RStudio Server session does not start, or crashes on start

Sometimes, the RStudio Server session files stored in user's home directory get corrupted, preventing starting new RStudio Server sessions. This is especially common if one chooses to automatically save the work space, and don't terminate the RStudio Server session before deleting the job that runs it (e.g. in Open OnDemand). 

To remedy this situation, first try to remove the session files, by running  rm -rf ~/.local/share/rstudio/sessions/*.
If that does not help, move away the whole RStudio settings directories.   mv ~/.local/share/rstudio ~/.local/share/rstudio-old .Be aware that this will reset some customizations you may have done. If this does not work, move away the user settings,   mv ~.config/rstudio ~/.config/rstudio-old . Finally, if even this does not help, contact our help desk. We have seen cases where the RStudio project files were saved in certain user home directories that were corrupted and they had to be removed.

At least some of these situations can be prevented if one terminates the Open OnDemand RStudio Server session via its menu File - Quit Session, rather than just closing the web browser tab and deleting the job.

Last Updated: 10/3/23