Louisiana Tech’s eXtreme Computing Research (XCR) unveiled a
breakthrough development today in the RAS-ware runtime for
transparent job queue fault tolerance in HPC Cluster
environment.
Dr. Box Leangsuksun, an associate professor in computer science,
explained that XCR’s breakthrough consists of High Availability,
Self-configuration, and Self-healing as enabling solutions. His
group of graduate students, led by Anand Tikotekar and Kshitij
Limaye, has implemented a proof-of-concept Beowulf cluster based on
HA-OSCAR 1.1 and standard HPC resource management/job queue system
(e.g., PBS/TORQUE). Preliminary results suggest that MPI jobs can
continue their execution and job queue is preserved regardless of
failures at the head node and compute nodes.
The experiment runs standard MPI jobs without any modification
under LAM/MPI 7.0. The breakthrough handles both running and queued
jobs transparently and the queue order is even maintained in the
face of a catastrophic failure. The open source HA-OSCAR multi-head
solution provides failover capability and transparently recovers
the job queue in a head-node outage event.
“This is very exciting for us,” said Leangsuksun. “This marks a
major milestone in our overarching goal–toward non-stop services
in an HPC environment. We expect that our breakthrough technology
is exactly what the community has been waiting for.”
Leangsuksun continued, “Our breakthrough is also expected to be
part of the next HA-OSCAR release that will have broad impacts in
HPC and telecomm cluster environments, especially for mission
critical applications.”
This RAS-aware runtime breakthrough was a result of the MOLAR project under collaboration
between Louisiana Tech’s eXtreme Computing Research (XCR) group and
the Network and Cluster Computing (NCC) group at Oak Ridge National
Laboratory (ORNL).
Web Webster
Web Webster has more than 20 years of writing and editorial experience in the tech sector. He’s written and edited news, demand generation, user-focused, and thought leadership content for business software solutions, consumer tech, and Linux Today, he edits and writes for a portfolio of tech industry news and analysis websites including webopedia.com, and DatabaseJournal.com.