Here is all the latest news from the UCF ARCC:

Stokes Login Node Outages

Last night (November 23, 2020) at around 9:33 pm EST euser1, the login node for the Stokes HPC cluster, went offline. It was brought back online again at around 10:00 pm but then came down again at around 3:00 am. We have brought it back up again as of 10:13 am. We are currently diagnosing the issue and will continue to monitor the status of the machine. All batch jobs on Stokes should not be affected by this outage, as the compute nodes and the management node where Slurm, the job scheduler, runs are all fine. The Newton GPU cluster is completely unaffected by this outage. We apologize for any inconvenience and will keep you apprised as the situation develops. Thank You, Glenn & Jamie

Changes due to ARCC Summer 2020 Maintenance Cycle

Greetings ARCC users,

Stokes and Newton have returned to operation! Please remember that we have two such scheduled maintenance downtimes per year, one after Fall term and one typically after Spring term (our Summer maintenance was a COVID-related delay).

Please take a moment to read over the changes:

  1. We retired all the old 12-core nodes (92 in all), and re-racked 28 new nodes. Each of the new nodes has 48 cores. This means we retired 1,104 cores and added 1,344 cores! The new machines have faster networking and a lot more memory (384GB in each node).
  2. We replaced one of the controllers in our file system and performance tested the file system to ensure that we had resolved some of the performance problems we've had over the Summer. We believe these are resolved.
  3. We moved all curated datasets to be in /datasets. There is no more /datasets/ImageDataSets.
  4. We upgraded the OS on all the nodes to CentOS 7.8.2003.
  5. We upgraded our resource manager, SLURM, to version 20.02.3.
  6. We upgraded our NVIDIA GPU drivers to the latest versions and made the CUDA 11 module available to users.
  7. We upgraded some of our internal configuration tools and ran regular benchmarking for compute, file system, and network performance; all show the clusters are operating as expected.
  8. We appreciate your patience and wish you the best with your research!

    Glenn & Paul.

Stokes & Newton Down for A/C repair 4/28 -- 5/1

As indicated in our many listserv messages and on our Facebook page, the ARCC must take Stokes and Newton down from Tue.28.Apr - Fri.1.May so that the A/C system can be replaced.

Power loss in research park, ARCC outage

Unfortunately, there was a power loss in the Partnership III building last night, Thur.16.Jan 1:15a - 3:15a. The good news is that the infrastructure functioned as it was designed: The critical servers (file system, management, etc.) remained up and functioning, and the UPS performed as it was designed. The bad news is that the time exceeded our UPS limits, so nearly all compute nodes powered off -- the jobs were lost.

We are working to bring the nodes back online now. We apologize for the inconvenience and appreciate your patience.

About the UCF ARCC:
The University of Central Florida (UCF) Advanced Research Computing Center is managed by the Institute for Simulation and Training, with subsidies from the UCF Provost and Vice President for Research and Commercialization, for the use by all UCF faculty and their students. Collaboration with other universities and industry is also possible.
Connect with Us!
Contact Info:
UCF Advanced Research Computing Center
3039 Technology Parkway, Suite 220
Orlando, FL 32826
P: 407-882-1147
Request Help