News


Here is all the latest news from the UCF ARCC:


The Newton GPU cluster has expanded!

Today we brought another 10 nodes online on our Newton GPU cluster! There are now 40 GPUs available on that cluster. These 20 nodes each have two NVidia V100 GPU cards in them. Twenty of these GPUs have 16 GB on board the card (evc1-evc10), and the other twenty have 32 GB (evc11-evc20). These resources were made possible via two awards received by the ARCC to support education and research at UCF. The ARCC continues to do our best to support the university's growing computational needs.

Changes due to Fall 2018 Maintenance Downtime

Stokes and Newton have returned to operation! Please remember that we have two such scheduled maintenance downtimes per year, one in late Fall (the one we just completed) and one in late Spring.

Please take a moment to read over the changes, some of which *will* affect usage:

  1. We have upgraded SLURM and changed the SLURM configuration in the following key ways:

    • We slightly reduced the amount of available memory on each node to mitigate some node thrashing problems we have been having. This means that some submission requests you had before that explicitly requested the maximum amount of memory will give you an error now. Issue the following call to see how much memory is now available for each node:
       sinfo -Nel -p normal
    • Because of the above we had to slightly reduce the default memory request per CPU from 1990 MB to 1950 MB. This probably will not affect most people; however, it is possible that jobs that used *almost all* the memory before but not quite will be impacted. Submit a ticket request if you have such a case, and we will discuss how to address this.
    • . Stokes and Newton will no longer permit direct ssh to compute nodes unless you have that node allocated by a job. When you *do* ssh to a node for which you have a job, that ssh session will be automatically added to that job, and the session will be killed when the job ends. This was to address problems we have been having with rogue processes remaining after a job completed that impacted other users.

  2. Several pieces of software that had been marked "deprecated" during a previous downtime were placed in an "unsupported" state to discourage their use. Contact us if you have a critical need for these, and we will either help you migrate to a newer solution or explain how you can gain access to older, unsupported modules. In all cases, there were six months to a year's notice for these. The unsupported software builds include:
    gcc-4.5.2
    impi-2017.2.174
    migrate-3.6.11-impi-2017-ic-2017

  3. Several pieces of software are or were marked "deprecated" to be removed in a *later* downtime. These deprecated software builds include:
    gcc-4.9.2 and all software built with it
    ic-2013.0.079 and all software built with it
    ic/ic-2015.1.133 and all software built with it
    ic/ic-2015.3.187 and all software built with it
    All versions of openmpi lower than 2.0 and all software built with it
    jdk/jdk-1.8.0_025 and all software built with it
    jdk/jdk-1.8.0_112 and all software built with it
    jdk/jdk-1.8.0_131 and all software built with it
    cuda/cuda-8.0
    matlab-R2014a
    petsc/petsc-3.7.3-lapack-3.5.0-openmpi-1.8.6-ic-2015.3.187
    petsc/petsc-3.7.5-lapack-3.5.0-openmpi-2.0.1-ic-2017.1.043

  4. Several new build tools were installed. Use "module avail" to see these. The new software includes:
    cuda-10.0
    openmpi-4.0.0  
    ic-2019.1.144  (Intel Parallel Studio 2019)

  5. The Newton GPU cluster was re-racked into a single rack and switched over to 60-amp power. This is the first step needed to expand Newton in January. This should not impact users.

  6. Several internal tools that we use for managing the clusters were upgraded, including our environment module system (lmod). This should not impact users.

Fall Maintenance Cycle (December 10-16, 2018)

Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during mid-December. Specifically, the clusters will be unavailable from the morning of Monday, December 10 through the morning of Monday, December 17.

The primary objective during this downtime is to upgrade our scheduler, SLURM, and to change some of its default options. There will also be some changes made to the Python installs to bring more consistency across versions. We will provide more detail in the change log when we bring the system back online.

Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes prior to the downtime.

Changes due to ARCC Spring 2018 Maintenance Cycle

Stokes and Newton have returned to operation! Please remember that we have two such maintenance downtime periods per year, one in late Fall and one in late Spring (the one we just completed).

Please take a moment to read over the changes:

  1. The base OS of the nodes was upgraded to CentOS 7.4, and the resource manager was upgraded to SLURM 17.11.6. This has fixed a problem we were having with GPU reservations on Newton.
  2. The NSF file system that supports our shared applications area was expanded.
  3. The IST external core network switches were repaired, so our external links should be back to 20 Gb/s (they have been 10 Gb/s for several months).
  4. There is a new version of the node matrix for Stokes, and now one is available for Newton.
  5. Newer versions of a number of pieces of software were built, including gdal, geos, proj, openbugs, lapack, and openmpi.
  6. The applications in /apps/jags were rebuilt with newer compilers and renamed to fix some inconsistencies with our naming standards; the modules were also renamed. The old distributions are no longer present. This may affect some R users. You can find out more by typing:
      module avail jags/jags
    
  7. We have begun the process of rebuilding and cleaning up the R builds. Currently, the old builds and modules still exist; however, over the summer we will be transitioning to 3.5.0 under newer build tools. By the end of the summer we hope to have moved all users off of older versions of R. All library packages currently installed will still be available. If you are an R user and have concerns, please contact us.

We appreciate your attention and wish you the best of luck with your research. If you have any questions or concerns, please submit a ticket.

About the UCF ARCC:
The University of Central Florida (UCF) Advanced Research Computing Center is managed by the Institute for Simulation and Training, with subsidies from the UCF Provost and Vice President for Research and Commercialization, for the use by all UCF faculty and their students. Collaboration with other universities and industry is also possible.
Connect with Us!
Contact Info:
UCF Advanced Research Computing Center
3039 Technology Parkway, Suite 220
Orlando, FL 32826
P: 407-882-1147
Request Help