Here is all the latest news from the UCF ARCC:

Changes due to Spring 2019 Maintenance Downtime

Stokes and Newton have returned to operation two days early! Please remember that we have two such scheduled maintenance downtimes per year, one after Fall term and one after Spring term (the one we just completed).

Please take a moment to read over the changes, most of which will not affect usage:

  1. We increased the default inode count for user accounts from 400K files to 1 million files.
  2. We upgraded our NVIDIA GPU drivers to the latest versions and made the CUDA 10.1 module available to users.
  3. We repaired a minor problem that was preventing SLURM from sending email about jobs when configured to do so.
  4. We revised our naming standards for ARCC email addresses so that they were more general and more consistent with our other standards. The most relevant change that affects users is as follows. For the new address, check you email or go here. The old email address will redirect to this address for some time in the future; however, users should transition to the new address.
  5. Some minor configuration changes and testing were performed on our NFS server to address some performance issues. This is not the shared file system where user data resides, but is instead the file system where the modules and applications are stored. The system benchmarks acceptably now.
  6. We completed our regular benchmarking of the compute, file system, and network performance. All show the clusters operating as expected.
  7. A review of our anaconda installs indicated a number of errors. We have temporarily *removed* anaconda from our applications and modules list while we reconstruct these applications. This means that our JupyterHub gateways are also temporarily down. These will all be re-deployed soon.

We appreciate your patience and wish you the best with your research!

Paul & Glenn.

Spring Maintenance cycle downtime, May 6 - May 12

Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during early May. Specifically, the clusters will be unavailable from the morning of Monday, May 6 through the evening of Sunday, May 12.

Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes and Newton prior to the downtime.

The Newton GPU cluster has expanded!

Today we brought another 10 nodes online on our Newton GPU cluster! There are now 40 GPUs available on that cluster. These 20 nodes each have two NVidia V100 GPU cards in them. Twenty of these GPUs have 16 GB on board the card (evc1-evc10), and the other twenty have 32 GB (evc11-evc20). These resources were made possible via two awards received by the ARCC to support education and research at UCF. The ARCC continues to do our best to support the university's growing computational needs.

Changes due to Fall 2018 Maintenance Downtime

Stokes and Newton have returned to operation! Please remember that we have two such scheduled maintenance downtimes per year, one in late Fall (the one we just completed) and one in late Spring.

Please take a moment to read over the changes, some of which *will* affect usage:

  1. We have upgraded SLURM and changed the SLURM configuration in the following key ways:

    • We slightly reduced the amount of available memory on each node to mitigate some node thrashing problems we have been having. This means that some submission requests you had before that explicitly requested the maximum amount of memory will give you an error now. Issue the following call to see how much memory is now available for each node:
       sinfo -Nel -p normal
    • Because of the above we had to slightly reduce the default memory request per CPU from 1990 MB to 1950 MB. This probably will not affect most people; however, it is possible that jobs that used *almost all* the memory before but not quite will be impacted. Submit a ticket request if you have such a case, and we will discuss how to address this.
    • . Stokes and Newton will no longer permit direct ssh to compute nodes unless you have that node allocated by a job. When you *do* ssh to a node for which you have a job, that ssh session will be automatically added to that job, and the session will be killed when the job ends. This was to address problems we have been having with rogue processes remaining after a job completed that impacted other users.

  2. Several pieces of software that had been marked "deprecated" during a previous downtime were placed in an "unsupported" state to discourage their use. Contact us if you have a critical need for these, and we will either help you migrate to a newer solution or explain how you can gain access to older, unsupported modules. In all cases, there were six months to a year's notice for these. The unsupported software builds include:

  3. Several pieces of software are or were marked "deprecated" to be removed in a *later* downtime. These deprecated software builds include:
    gcc-4.9.2 and all software built with it
    ic-2013.0.079 and all software built with it
    ic/ic-2015.1.133 and all software built with it
    ic/ic-2015.3.187 and all software built with it
    All versions of openmpi lower than 2.0 and all software built with it
    jdk/jdk-1.8.0_025 and all software built with it
    jdk/jdk-1.8.0_112 and all software built with it
    jdk/jdk-1.8.0_131 and all software built with it

  4. Several new build tools were installed. Use "module avail" to see these. The new software includes:
    ic-2019.1.144  (Intel Parallel Studio 2019)

  5. The Newton GPU cluster was re-racked into a single rack and switched over to 60-amp power. This is the first step needed to expand Newton in January. This should not impact users.

  6. Several internal tools that we use for managing the clusters were upgraded, including our environment module system (lmod). This should not impact users.
About the UCF ARCC:
The University of Central Florida (UCF) Advanced Research Computing Center is managed by the Institute for Simulation and Training, with subsidies from the UCF Provost and Vice President for Research and Commercialization, for the use by all UCF faculty and their students. Collaboration with other universities and industry is also possible.
Connect with Us!
Contact Info:
UCF Advanced Research Computing Center
3039 Technology Parkway, Suite 220
Orlando, FL 32826
P: 407-882-1147
Request Help