News


Here is all the latest news from the UCF ARCC:


Changes due to Fall 2015 Maintenance Cycle

Greetings Stokes users,

Stokes has returned to operation! Please remember that we have two such maintenance downtimes per year, one in late Fall (the one we just completed) and one in late Spring.

There are several substantive changes, and we expect there will be a few unanticipated issues to resolve. Please remain patient with us as we iron these out -- and also, let us know about issues you are having by sending requests to req...@ist.ucf.edu (Click on "..." to reveal the address).

Changes include the following:

  1. THE SCHEDULING SOFTWARE HAS CHANGED! We are now using SLURM rather than Torque/Moab. Your submit scripts must be rewritten, and you will have to learn some new commands. There is documentation available on our website under Help->Tutorials->Job Submission.
  2. We have entirely replaced our old GPFS file system with a new Lustre-based file system. Your user files are still in /home and your group files are still in /groups; however, there are some changes resulting from this.
    • There are no longer any group quotas. The standard user quota is now 500GB and 400K files.
    • The /gpfs/fs0 and /gpfs/fs1 directories no longer exist. If you were hard-pathing to those directories, your scripts/code will have errors -- please use /home or /groups instead.
    • To find out how much much space you are using, now use the following command:
      lfs quota -u   /lustre/fs0
  3. There are 7 new compute nodes (ec2-8), each with 28 cores and 128 GB of memory.
  4. Several resources are not yet fully operational:
    • The IBM blades are currently down. These are more complicated to integrate into our new IB fabric, and we are working on getting them up and running. Consequently ec99-ec147 will not be accessible for a few days, at least.
    • Newton is currently down. This cluster must be completely rebuilt given the new file system and fabric. We prioritized getting Stokes running instead. We expect Newton to be back up before the start of Term. We will provide more information at that time.
    • The data transfer node must be rebuilt to use the new file system. We expect to have this done by start of term.
    • Materials Studio gateway is not operational. We are still working on integrating this with the new scheduler.
  5. Queue constraint and operational changes include:
    • There are now only two queues: normal and preempatable. By default, normal is used; you do not need to specify a queue.
    • There are longer any limitations on the normal queue in terms of the number of nodes or the length of time jobs run.
    • Instead, we use FairShare, which is explained on our new Job Submission tutorial page.>/li>
    • When you reach your DPH limit at the end of the month, you can run via the preemptable queue -- though note that jobs on this queue run with lower priority and can be killed to make room for higher priority jobs if there are no other resources. See our new Job Submission page for details.
  6. Performance changes include:
    • The new network backplane is faster and the supporting switches are less oversubscribed. You should see overall improvement in network performance for distributed jobs.
    • The new file system is much faster for reads (and slightly faster for writes), so you should see overall improvement for file system performance.
    • The new scheduler will do a better job keeping track of memory and compute core resources. You can no longer use resources that were not requested and not assigned to you. This means that there will be some people whose jobs ran before but crash now. Such jobs will have to be corrected to use the specified resources.

The ARCC staff appreciates your patienace and thank you for your use of the resources.

Glenn Martin & Paul Wiegand.

Fall maintenance downtime, Sat.12.Dec - Sun.21.Dec

Stokes and Newton will be taken down per our bi-annual routine maintenance cycle during the second full week of December. Specifically, the cluster will be unavailable from the morning of Saturday, December 12th through the evening of Sunday, December 21st. You should organize your activities with this downtime in mind.

Over the last few months, ARCC staff has been working on installing a number of upgrades to physical infrastructure. Our high-speed, parallel file system is being replaced, and we will have a whole new FDR InfiniBand network fabric. The focus of this maintenance cycle will be to finish up these upgrades by completing tasks that we cannot complete when Stokes is in operation. In addition, as we have advised for several months, our resource manager and scheduler will be changing to SLURM. After the maintenance cycle, we will be using the new scheduler.

Recall that we now routinely bring the system down twice a year, once in late Fall and once in late Spring. We will keep the users notified in advance of such downtimes, but we recommend you build such expectations into your workflow. Though we anticipate no data loss during this time, it's never a bad idea to backup your materials. So we suggest you use this opportunity to copy salient data and code off of Stokes prior to the downtime.

HUGO, September 22 2015

The first meeting of the HPC User Group of Orlando (HUGO) this semester will be Tuesday, September 22 from 3p-4:30p in Partnership III room 233. We will discussion what computational resources are available at the national level via the XSEDE program, and how one goes about applying for access to those resources. The UCF Advanced Research Computing Center will provide refreshments.

Partnership III
3039 Technology Pkwy
Orlando, FL
https://map.ucf.edu/locations/8126/partnership-iii/

Stokes Returns to Service After Spring Downtime

We are pleased to report that the Stokes HPC is back up and running. We appreciate your patience with our scheduled downtime and would like to take this opportunity to remind you that we now have regular maintenance down times at least twice a year (one in late Fall and one in late Spring). This year, we may have another short downtime at the end of Summer, depending on equipment purchases.

While last Fall's maintenance cycle focused on OS and software changes, this cycle concentrated on hardware configuration changes. Consequently, there are very few changes that will affect the way Stokes users access the system. The following is a summary of the changes.

  1. The IBM DDR leaf of the HPC was removed. This means that the blades ec1-ec98 are permanently removed from the system. We now have just over 2,800 cores available, and there are no more 8-core blades. This was necessary for two reasons. First, much of the equipment on that leaf of the HPC was reaching end-of-life and beginning to fail (including the DDR IB switch). Second, we needed to make room in our machine room for new purchases at the end of the Summer. In the short term, Stokes has about 25% fewer cores than it had a week ago -- expect higher utilization and somewhat longer queue times.
  2. The main login node was replaced with a larger machine that has more cores and more memory. This allowed us to relax some of the ulimit constraints on users on the login node.
  3. The web server node has been replaced. This should improve the responsiveness of our website, http://arcc.ist.ucf.edu/
  4. The amount of memory registrable by the IB driver on each blade was marginally increased. This should not affect most people, but will hopefully relieve some of the MPI warnings some users get when using OpenMPI and sending very large message sizes.
About the UCF ARCC:
The University of Central Florida (UCF) Advanced Research Computing Center is managed by the Institute for Simulation and Training, with subsidies from the UCF Provost and Vice President for Research and Commercialization, for the use by all UCF faculty and their students. Collaboration with other universities and industry is also possible.
Connect with Us!
Contact Info:
UCF Advanced Research Computing Center
3039 Technology Parkway, Suite 220
Orlando, FL 32826
P: 407-882-1147
Request Help