News


Here is all the latest news from the UCF ARCC:


ARCC to Hire Undergraduate Student

Greetings,

The Advanced Research Computing Center (ARCC) in research park is looking to hire an undergraduate student to help with our high performance computing (HPC) resources. In the first few months, the student will be responsible for helping the ARCC build, configure, and optimize scientific visualization software and setup remote visualization capabilities. After that project is complete, the student will help resolve problems for users, investigate issues related to the management and performance of our clusters, build & optimize software packages for the clusters, work to improve our documentation, and get hands-on experience with HPC specific hardware and software. Candidates should have experience with the GNU/Linux operating system, including scripting and standard methods for building software in UNIX-like environments. Special consideration will be given to students with experience running parallel and distributed programs, as well experience with scientific visualization tools.


For more information about the position, please contact R. Paul Wiegand ( w...@ist.ucf.edu (Click on "..." to reveal the address)) or Glenn Martin ( m...@ist.ucf.edu (Click on "..." to reveal the address)). For more information about the ARCC and its resources, please see our home page.


Thank You,
R. Paul Wiegand
Glenn Martin

Changes due to ARCC Fall 2019 Maintenance Cycle

Stokes and Newton have returned to operation four days early! Please remember that we have two such scheduled maintenance downtimes per year, one after Fall term (the one we just completed) and one after Spring term.

Please take a moment to read over the changes:

  1. We upgraded our resource manager, SLURM, to version 19.05.4. The new version has much better GPU support. We will be revising our documentation online soon to show how to take advantage of these features. Until then, here is more information: https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf
  2. Singularity support was deployed on both clusters. Singularity is a container service like Docker (and has the ability to run Docker images). To run singularity, just load the singularity/singularity-3.5.2-go-1.13.5 module! Again, we will be rolling out improved documentation in the coming weeks to explain how to use this. Until then, here is more information: https://sylabs.io/guides/3.5/user-guide/
  3. The module system (lmod) was upgraded, and its configuration was changed so that the module hierarchy will be cached. This should improve latencies when using the 'ml' and 'module' commands, including during login.
  4. We upgraded our NVIDIA GPU drivers to the latest versions and made the CUDA 10.2 module available to users.
  5. We completed internal tool configuration changes, as well as our regular benchmarking of the compute, file system, and network performance. All show the clusters operating as expected.

We appreciate your patience and wish you the best with your research!

Fall Maintenance cycle downtime, Dec 15 - Dec 22

Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during mid-December. Specifically, the clusters will be unavailable from Sunday, December 15 through Sunday, December 22.

The primary objectives during this downtime is to upgrade SLURM and our module system.

Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes and Newton prior to the downtime.

Power loss in research park, ARCC outage

Last night (Thursday, September 26) at 11:30p, a commercial power loss in research park caused an outage in the Partnership buildings, where our clusters reside. The outage lasted 2.5 hours. The good news is that our system functioned precisely as designed: Our UPS held the load of the cluster for the designed length of time, and the core infrastructure elements (e.g., file system, infiniband backbone) were protected the entire time via generator.

But the bad news is that the length of the outage exceeded our UPS's designed limits, which means that almost all compute nodes lost power. All jobs on Stokes, and many jobs on Newton were lost. We apologize for the inconvenience.

The clusters are back up and accessible now. Queued jobs have deployed and are running.

About the UCF ARCC:
The University of Central Florida (UCF) Advanced Research Computing Center is managed by the Institute for Simulation and Training, with subsidies from the UCF Provost and Vice President for Research and Commercialization, for the use by all UCF faculty and their students. Collaboration with other universities and industry is also possible.
Connect with Us!
Contact Info:
UCF Advanced Research Computing Center
3039 Technology Parkway, Suite 220
Orlando, FL 32826
P: 407-882-1147
Request Help