
News
Here is all the latest news from the UCF ARCC:
Changes due to Spring 2016 maintenance cycle
- Details
- Written by R. Paul Wiegand R. Paul Wiegand
- Published: 27 May 2016 27 May 2016
Greetings Stokes users,
Stokes has returned to operation! Newton will remain down until next week. Please remember that we have two such maintenance downtimes per year, one in late Fall (the one we just completed) and one in late Spring. Let us know about issues you are having by sending requests to req...@ist.ucf.edu (Click on "..." to reveal the address).
Please take a moment to read over the changes:
- The scheduling software was upgraded to SLURM 16.05.0. For most of you, all commands and scripts will work the same as before, but we are hoping to eliminate a few problems with the upgrade. Also, the new version will give us better diagnostic capabilities.
- Some OpenMPI, MVAPICH2, and OpenFOAM builds were built with SLURM support directly in them. These had to be rebuilt, and the module names have changed. If you use one of the modules listed at the bottom of this message, then you will have to make some changes to your submission scripts (e.g., load a different module) and may have to rebuild your programs. We took the old modules away so there can be no confusion. If you were not using a version of one of those packages with "slurm" in the suffix, you are not affected.
- We conducted some diagnostics and firmware updates for components of our file system to address issues related to our outage last January.
- We racked, cabled, and configured 24 new nodes with 28 cores and 128 GB of memory each. These will become available very soon.
- There were some minor internal server changes relating to how we manage the cluster.
- The remaining IBM blades have been permanently removed from the cluster.
Modules that have changed: (xx.xx.x changed from 15.08.3 to 16.05.0)
mvapich2/mvapich2-2.1.0-ic-2015.3.187-slurm-xx.xx.x openfoam/openfoam-3.0.1-openmpi-1.8.6-ic-2015.3.187-slurm-xx.xx.x openfoam/openfoam-3.0.1-openmpi-1.8.6-gcc-4.9.2-slurm-xx.xx.x openmpi/openmpi-1.8.6-ic-2015.3.187-slurm-xx.xx.x openmpi/openmpi-1.8.6-gcc-4.9.2-slurm-xx.xx.x
Glenn Martin & Paul Wiegand.
Spring maintenance cycle downtime, Mon.23.May - Fri.27.May
- Details
- Written by R. Paul Wiegand R. Paul Wiegand
- Published: 31 March 2016 31 March 2016
Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during mid-May. Specifically, the clusters will be unavailable from the morning of Monday, May 23rd through the evening of Friday, May 27th. You should organize your activities with this downtime in mind.
Contrary to the last few maintenance cycles, there will be very few changes that affect Stokes users. We will be upgrading SLURM, installing some new compute resources (20 new 28-core nodes), and doing some simple maintenance on the new file system.
Newton, on the other hand will be more substantially changed. Currently, nodes within Newton have a variety of co-processor resources. We will be replacing the existing co-processers such that every node has a pair of Nvidia GTX 980s. This is to make the visualization cluster more consistent, as well as to upgrade its capabilities.
Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes prior to the downtime.
Thank you, Paul & Glenn.Changes due to Fall 2015 Maintenance Cycle
- Details
- Written by R. Paul Wiegand R. Paul Wiegand
- Published: 21 December 2015 21 December 2015
Greetings Stokes users,
Stokes has returned to operation! Please remember that we have two such maintenance downtimes per year, one in late Fall (the one we just completed) and one in late Spring.
There are several substantive changes, and we expect there will be a few unanticipated issues to resolve. Please remain patient with us as we iron these out -- and also, let us know about issues you are having by sending requests to req...@ist.ucf.edu (Click on "..." to reveal the address).
Changes include the following:
- THE SCHEDULING SOFTWARE HAS CHANGED! We are now using SLURM rather than Torque/Moab. Your submit scripts must be rewritten, and you will have to learn some new commands. There is documentation available on our website under Help->Tutorials->Job Submission.
- We have entirely replaced our old GPFS file system with a new Lustre-based file system. Your user files are still in /home and your group files are still in /groups; however, there are some changes resulting from this.
- There are no longer any group quotas. The standard user quota is now 500GB and 400K files.
- The /gpfs/fs0 and /gpfs/fs1 directories no longer exist. If you were hard-pathing to those directories, your scripts/code will have errors -- please use /home or /groups instead.
- To find out how much much space you are using, now use the following command:
lfs quota -u /lustre/fs0
- There are 7 new compute nodes (ec2-8), each with 28 cores and 128 GB of memory.
- Several resources are not yet fully operational:
- The IBM blades are currently down. These are more complicated to integrate into our new IB fabric, and we are working on getting them up and running. Consequently ec99-ec147 will not be accessible for a few days, at least.
- Newton is currently down. This cluster must be completely rebuilt given the new file system and fabric. We prioritized getting Stokes running instead. We expect Newton to be back up before the start of Term. We will provide more information at that time.
- The data transfer node must be rebuilt to use the new file system. We expect to have this done by start of term.
- Materials Studio gateway is not operational. We are still working on integrating this with the new scheduler.
- Queue constraint and operational changes include:
- There are now only two queues: normal and preempatable. By default, normal is used; you do not need to specify a queue.
- There are longer any limitations on the normal queue in terms of the number of nodes or the length of time jobs run.
- Instead, we use FairShare, which is explained on our new Job Submission tutorial page.>/li>
- When you reach your DPH limit at the end of the month, you can run via the preemptable queue -- though note that jobs on this queue run with lower priority and can be killed to make room for higher priority jobs if there are no other resources. See our new Job Submission page for details.
- Performance changes include:
- The new network backplane is faster and the supporting switches are less oversubscribed. You should see overall improvement in network performance for distributed jobs.
- The new file system is much faster for reads (and slightly faster for writes), so you should see overall improvement for file system performance.
- The new scheduler will do a better job keeping track of memory and compute core resources. You can no longer use resources that were not requested and not assigned to you. This means that there will be some people whose jobs ran before but crash now. Such jobs will have to be corrected to use the specified resources.
The ARCC staff appreciates your patienace and thank you for your use of the resources.
Glenn Martin & Paul Wiegand.
Fall maintenance downtime, Sat.12.Dec - Sun.21.Dec
- Details
- Written by R. Paul Wiegand R. Paul Wiegand
- Published: 09 November 2015 09 November 2015
Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during the second full week of December. Specifically, the cluster will be unavailable from the morning of Saturday, December 12th through the evening of Sunday, December 21st. You should organize your activities with this downtime in mind.
Over the last few months, ARCC staff has been working on installing a number of upgrades to physical infrastructure. Our high-speed, parallel file system is being replaced, and we will have a whole new FDR InfiniBand network fabric. The focus of this maintenance cycle will be to finish up these upgrades by completing tasks that we cannot complete when Stokes is in operation. In addition, as we have advised for several months, our resource manager and scheduler will be changing to SLURM. After the maintenance cycle, we will be using the new scheduler.
Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes prior to the downtime.