The 2005 IEEE International Conference on Cluster Computing
 
Burlington Marriott, Burlington, MA, USA

 

Towards Highly Available, Scalable, and Secure HPC Clusters

with HA-OSCAR

Level

 

The tutorial is intended for scientists and engineers interested on learning the state of the art in building highly available clusters for high performance and enterprise computing using Linux and Open Source tools and software.

 

Duration

A half day, 1:00-5:00pm, September 26, 2005.

 

Presenters

Dr. Chokchai (Box) Leangsuksun                             Ibrahim Haddad

Center for Entrepreneurship and Information Technology   Open Source Development Labs

Louisiana Tech University                                               2725 SW Millikan Way,

P.O. box 10348,                                                                         Suite 400

Ruston, LA 71272 Montreal,                                            Beaverton, OR 97005 

USA                                                                              USA

Phone:   1.318.257.3291                                                  Phone:   1.503.906.1914

Fax:      1.318.257.4922                                                  Fax:      1 503 626-2436

Email:    box@latech.edu                                                 Email:    ibrahim@osdl.org

 

Dr. Stephen L. Scott

Computer Science and Mathematics Division        

Oak Ridge National Laboratory
One Bethel Valley Road
P.O. Box , MS-6016

Oak Ridge, TN 37831-6016

USA
Phone:   1.865.574.3144
Fax:      1.865.576.5491
Email:    scottsl@ornl.gov

 

 

Abstract

 

March 2004 was a major milestone for the HA-OSCAR Working Group. It marked the announcement of the first public release of the HA-OSCAR software package. HA-OSCAR is an Open Source project that aims to provide a combined power of high availability and performance computing. HA-OSCAR enhances a Beowulf cluster system for mission critical grade applications with various high availability mechanisms such as component redundancy to eliminate this single point of failure, self-healing mechanism, failure detection and recovery mechanisms, in addition to supporting automatic failover and fail-back.

The first release (version 1.0) supports new high availability capabilities for Linux Beowulf clusters based on the OSCAR 3.0 release from the Open Cluster Group. In this release of HA-OSCAR, we provide an installation wizard graphical user interface and a web-based administration tool, which allows intuitive creation and configuration of a multi-head Beowulf cluster. In addition, we have included a default set of monitoring services to ensure that critical services, hardware components, and important cluster resources are always available at the control node. HA-OSCAR also supports new tailored services that can be configured and added via a WebMin-based HA-OSCAR administration tool.

This tutorial will address in detail all the design and implementation issues related to building HA Linux Beowulf clusters and using Linux and Open Source Software as the base technology. In addition, the focus of the tutorial is HA-OSCAR. We will present the architecture of HA-OSCAR, review of new features of the latest release, discuss how we implemented the HA and security features, and discuss our experiments covering modeling, and testing performance and availability on real systems.

 

Background on HA-OSCAR:

The HA-OSCAR project’s primary goal is to improve the existing Beowulf architecture and cluster management systems while providing high-availability and scalability capabilities for Linux clusters. HA-OSCAR introduces several enhancements and new features to OSCAR, mainly in the areas of availability, scalability, and security. The new features in the initial release are head node redundancy, self-recovery for hardware, service, and application outages. HA-OSCAR has been tested to work with several OSCAR distributions. HA-OSCAR should work with OSCAR 2.3, 2.3.1, 3.0 based on Red Hat 9.0 and OSCAR 4.0 based on Fedora core 2. The first version (1.0) was released on March 22, 2004, brining over than 5000 hits to HA-OSCAR site within 48 hours. The announcement was featured in the LinuxWorld magazine, on O’Reilly, HPC Wire, ClusterWorld, and Slashdot.net.

 

Tutorial Detailed Description

 


Introduction (20%)

 

·         Introduction HA clustering 

·         Various levels of HA

·         Linux: the commodity component of the cluster stack

·         Software and hardware system architecture

·         Challenges in Designing and Prototyping HA/HPC Clusters

  • Booting the cluster
  • Storage
  • Traffic distribution mechanisms
  • Load balancing mechanisms
  • Building redundancy at various levels in the cluster:
  • Ethernet redundancy
  • DHCP/TFTP/NTP/NFS servers’ redundancy
  • Data redundancy using software RAID
  • File systems for HA Linux clusters

 

OSCAR (20%)

 

·         Introduction

·         Cluster Computing Overview

·         OSCAR - "The Beginning" - Overview / Strategy

·         OSCAR Components (Functional areas)

          o Core, Admin/Config, HPC Services 

          o Core Components: SIS, C3, Switcher, ODA, OPD

·         "The OSCAR Trail..."

          o Release survey (2.0 - 3.0 & comments on 4.x)

·         OSCAR Wizard (v3.0)

 

HA-OSCAR (50%)

 

·         HA-OSCAR overview

·         HA-OSCAR architecture and components

·         HA-OSCAR comparison with Beowulf architecture

·         HA features

·         Multi-head builder and Self-configuration

·         Monitoring

o        Service monitoring

o        Hardware monitoring

o        Resource monitoring

·         Self-healing and recovery mechanism

·         Test environment

·         Installation Steps

·         Experiments

·         Availability moldering, analysis and uptime improvement study between Beowulf and HA-OSCAR

·         Test results

·         Applications and feasibility studies

·         Grid-enable HA cluster

·         HA-OSCAR and Distributed Security Infrastructure integration

 

Demonstration (with 4 laptops running latest research release of HA-OSCAR)

 

Conclusion (10%)

 

·         HA-OSCAR Roadmap

·         Advanced research

·         Questions and answers


 

Presenters Bio

 

Dr. Chokchai Leangsuksun  is an Associate Professor Computer Science, Louisiana Tech University. In March 1995, Dr. Box Leangsuksun started his career in AT&T network system and later becoming Lucent where he acquired practical experiences in telecomm systems, reliability and high availability, software engineering research & development skills. He led the Lucent Technology R&D team in both technical and project management to create a number of next generation network and service management systems and to ensure system reliability in several mission critical products. In February 2002, he accepted an associated professor position at Louisiana Tech University where he has taught Computer Science and played a significant R&D role in the Center for Entrepreneurship and Information Technology. Within a short time span, Box has already started establishing his name and research recognitions by founding and co-chairing a high availability and performance workshop, serving as program committee in various conferences/workshops. He recently released HA-OSCAR, the first field grade HA-Beowulf cluster software. In September 2003, he received an outstanding teaching award from the college of Engineering and Science, Louisiana Tech University.

 

Ibrahim Haddad is a member of OSDL Engineering Department acting as Strategic Program Manager for the Carrier Grade Linux Initiative. Prior to joining OSDL, Ibrahim was a Senior Researcher at the "Research and Innovation" Unit, Ericsson Research Corporate Unit, in Montreal, Canada, where he was involved with the server system architecture for 3G wireless IP networks and promoting the use of Linux in telecommunications.

Ibrahim is Contributing Editor to the Linux Journal and LinuxWorld magazine. In addition, he contributes regularly to the O’Reilly Network, Sys Admin Magazine, and Linux User & Developer magazine. He has delivered a number of presentations and tutorials at local universities, IEEE and ACM conferences, Open Source forums, and international conferences.

Ibrahim contributed to two of Richard Petersen books, "Red Hat Linux Pocket Administrator" and "Red Hat: The Complete Reference (DVD Edition)", both published by McGraw-Hill/Osborne. He received his Bachelor and Master degrees in Computer Science from the Lebanese American University. He is currently a Dr. Sc. Candidate at Concordia University in Montreal researching “Scalable Architectures for High-Availability Web Server Clusters".

The following is the list of tutorials previously presented by Ibrahim Haddad:

• “Design and Implementation of HA Linux Clusters”, IEEE Cluster Conference 2001

• “Design and Implementation of Benchmarking Environments”, ACM Sigmetrics 2002

• “Supporting IPv6 on Linux Servers”, Ottawa Linux Symposium 2002

• “Supporting IPv6 on Linux Clusters”, IEEE Cluster Conference 2002

• “Networking Protocols for UMTS and 3G Services”, ACM Multimedia 2002

• “IPv6: The New Internet Protocol - All You Wanted to Know”, Real World Linux 2003

• “IPv6: The New IP Protocol”, Internetworking 2003

• “Carrier Grade Linux Platforms: Characteristics and Development Efforts”, EuroPar 2003

• “Carrier Grade Linux”, Real World Linux 2004

• “Wireless Carrier Grade Platforms: Characteristics and Ongoing Development Efforts”, International Conference on E-Business and Telecommunication Networks 2004

• “HA-OSCAR: Building Highly Available Linux Clusters”, IEEE Cluster 2004.

 

Dr. Stephen L. Scott is a senior research scientist in the Network and Cluster Computing Group of the Computer Science and Mathematics Division of Oak Ridge National Laboratory (ORNL) – USA. He has received the Ph.D. and M.S. in computer science from Kent State University, Kent, OhioUSA.  At Oak Ridge National Laboratory, Stephen’s responsibilities include research and development efforts in high performance scalable cluster computing as well as directing research staff in this effort. Primary research interest is in experimental systems with a focus on high performance, scalable, distributed, heterogeneous, and parallel computing. He is also a contributor to the Parallel Virtual Machine (PVM) and Heterogeneous Adaptable Reconfigurable NEtworked SystemS (HARNESS) research efforts at ORNL. Stephen is a founding member as well as a steering committee member of The Open Cluster Group (OCG) http://www.OpenClusterGroup.org). This organization is a consortium of research and industry that is dedicated to making cluster computing practical for high performance computing. Stephen is also a founding member and past working group chair of the OCG’s primary working group, Open Source Cluster Application Resources (OSCAR) (http://OpenClusterGroup.org/OSCAR). This working group is dedicated to bringing current “best practices” in cluster computing to all users via a self-installing cluster-on-a-CD suite. He is also a member of ACM, IEEE Computer, and the IEEE Task Force on Cluster Computing. Prior to attending graduate school, Stephen worked various industry positions including one as principal with a business-to-business chemical database startup company. Additional research information may be found at http://www.csm.ornl.gov/~sscott.