Linux Today: Linux News On Internet Time.
Search Linux Today
Linux News Sections:  Developer -  High Performance -  Infrastructure -  IT Management -  Security -  Storage -
Linux Today Navigation
LT Home
Contribute
Contribute
Link to Us
Linux Jobs

Partner Sites
JustLinux.com
Linux Planet
PHPBuilder
Technology Jobs

Top White Papers

More on LinuxToday


PR: Louisiana Tech's RAS-ware Runtime Breakthrough in HPC Clusters

Nov 02, 2005, 22:00 (1 Talkback[s])

Louisiana Tech's eXtreme Computing Research (XCR) unveiled a breakthrough development today in the RAS-ware runtime for transparent job queue fault tolerance in HPC Cluster environment.

Dr. Box Leangsuksun, an associate professor in computer science, explained that XCR's breakthrough consists of High Availability, Self-configuration, and Self-healing as enabling solutions. His group of graduate students, led by Anand Tikotekar and Kshitij Limaye, has implemented a proof-of-concept Beowulf cluster based on HA-OSCAR 1.1 and standard HPC resource management/job queue system (e.g., PBS/TORQUE). Preliminary results suggest that MPI jobs can continue their execution and job queue is preserved regardless of failures at the head node and compute nodes.

The experiment runs standard MPI jobs without any modification under LAM/MPI 7.0. The breakthrough handles both running and queued jobs transparently and the queue order is even maintained in the face of a catastrophic failure. The open source HA-OSCAR multi-head solution provides failover capability and transparently recovers the job queue in a head-node outage event.

"This is very exciting for us," said Leangsuksun. "This marks a major milestone in our overarching goal--toward non-stop services in an HPC environment. We expect that our breakthrough technology is exactly what the community has been waiting for."

Leangsuksun continued, "Our breakthrough is also expected to be part of the next HA-OSCAR release that will have broad impacts in HPC and telecomm cluster environments, especially for mission critical applications."

This RAS-aware runtime breakthrough was a result of the MOLAR project under collaboration between Louisiana Tech's eXtreme Computing Research (XCR) group and the Network and Cluster Computing (NCC) group at Oak Ridge National Laboratory (ORNL).

Related Stories: