Tangent Users - Issue with Jobs not starting - Center for High Performance Computing

October 24th update:

About three weeks ago, we posted an update that Tangent was back in service. Here is an additional update as while some jobs are running, there are still a number of jobs that are attempting to start unsuccessfully.

These issues have to do with the changes being made by the administrators of the other usage of the tangent cluster hardware (Rob Ricci’s Flux group -- see the tangent user guide https://www.chpc.utah.edu/documentation/guides/tangent.php for additional details on the nature of the tangent cluster and the Apt project) adding a reservation system for their use of the hardware. WE do not yet have a mechanism to identify which of the “free” nodes are actually not free but reserved and which are really free for use as tangent compute nodes with this reservation system in place. Therefore slurm sometimes tries to start a job on a node that it thinks is free while it is actually is not, and gets stuck the node does not complete the provisioning step.

If you decide to continue to submit jobs to tangent, please take care to watch the jobs and if they get stuck in the Configuring state (CF) and you are unable to remove the job from the queue, which will most likely be the case, please send in a ticket and we will clear the job for you. If you do not want to watch and have to resubmit jobs that get in this state, you may want to use other resources until we let you know that this issue has been resolved. We will also watch the batch queue on tangent and identity jobs in this state and clear them.

Thank you for your patience through this process.

OCTOBER 3rd update:

Tangent is now back in service. Please feel free to submit jobs, but as you do so keep in mind the dynamic nature of the tangent cluster. The group that administers the servers are still in the process of making changes to the process of how these resources are allocated between different uses and these changes have the potential to impact their usage as tangent cluster compute nodes. Therefore, if you observe any problems with tangent jobs, please report them to issues@chpc.utah.edu

For those who were given access to the reservation a couple of weeks ago to help us with the testing, please make sure you remove the

#SBATCH --res

line from your script.

August 4th update:

Tangent update #2:

The issues with nodes not completing the configuration process continue. Since yesterday no jobs have successfully started. Therefore we have put a reservation on the nodes to stop jobs from trying to start. We are working with the FLUX group to resolve the current issue and test before returning the nodes to service.

We suggest that users of tangent look to the other resources available to run their jobs until we send out a message that tangent is again ready for use.

July 28th update:

Currently, there are changes being made by the FLUX group that administers the servers that are used for the tangent compute nodes. You can see details about the project that provides this resource in the tangent user’s guide at https://www.chpc.utah.edu/documentation/guides/tangent.php. Note that these changes are part of the dynamic nature of this project, and in the near future we anticipate that there may be additional inconsistencies in the behavior of the tangent batch system.

We are in communication with this group on the changes, and we will continue to work with them as they make further changes, to evaluate the impact (and provide feedback) and to work to adapt to the changes.

We have attempted to address the configuration issue mentioned in the previous message to mitigate the impact as much as possible. To this end, we will continue to make adjustments to do this when possible.

If you notice additional problems, please report them to issues@chpc.utah.edu

Original Post September 26th:

There is an issue on tangent where jobs are not starting. Jobs are trying to start, but the nodes are not completing the configuration step, and then after about 25 minutes or so they exit from the queue. We are working on resolving the issue, and will update when we know more.