CSC 680: Fault Tolerant System

Offered Under: M.Sc. in Computer Science (CSC)
Description

Fault-tolerant computing is a generic term describing redundant design techniques with duplicate components or repeated computations enabling uninterrupted (tolerant) operation in response to component failure (faults). The central theme of this course is to expose students to the use of reliability and availability computations as a means of comparing fault-tolerant designs. This course defines fault-tolerant computer systems and illustrates the prime importance of such techniques in improving the reliability and availability of digital systems. Topics include: Introduction to redundancy theory, limit theorems, decision theory in redundant systems; Hardware fault tolerance: Computer redundancy, detection of faults, replication and compression techniques, self repairing techniques, concentrated and distributed voters, models of fault tolerant computer; Software fault-tolerance: Fault tolerance versus fault intolerance, fault tolerance objectives; errors and their management strategies, implementation of error management strategies; Software fault tolerance techniques, software defence, protective redundancy; Architectural support of fault-tolerant software protection mechanisms, recovery mechanisms.


Prerequisites:
  • None

Course Type Major
Credit Hour 3
Lecture Hour 45
Expected Outcome(s):
  • Understand the differences between fault, error and failure. 
  • Evaluate reliability of systems with permanent and temporary faults.
  • Understand the various methods for SW fault tolerance.

Suggested Books:
  1. Fault-Tolerant Systems by Israel Koren and C. Mani Krishna
  2. Design and Analysis of Fault Tolerant Digital Systems by Barry W. Johnson

Grading Policy: