CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

A Review on Fault Tolerance Techniques for High Performance Computing

عنوان مقاله: A Review on Fault Tolerance Techniques for High Performance Computing
شناسه ملی مقاله: CSITM01_085
منتشر شده در همایش ملی مهندسی رایانه و مدیریت فناوری اطلاعات در سال 1393
مشخصات نویسندگان مقاله:

Ahmad fadaei Tehrani - Dept.Computer, Najafabad Branch, Islamic Azad University of Najafabad
Framarz Safi - Dept.Computer, Najafabad Branch, Islamic Azad University of Najafabad

خلاصه مقاله:
Cloud computing is the next generation computing. There are new capacity and flexibilityto HPC (High Performance Computing) applications with using large number of virtual machines forcomputational intensive applications. Today’s high performance computing systems are typicallymanaged and operated by individual organizations in private. A cloud-based Infrastructure-as-a-Service (IaaS) approach for high performance computing applications promises cost savings andmore flexibility. High performance computing (HPC) systems may fail because of large workloadand number of servers. Fault tolerance techniques allow HPC systems on cloud to executecomputational intensive application with multiple of nodes. Fault tolerance can provide bestperformance of tasks in the presence of hardware and software faults. However, main failures aremostly hardware based. Also, system availability is very important and fault tolerance techniquesused to detect and predict faults. This paper gives an overview on most popular fault tolerancetechniques in HPC, prediction models and tools used in HPC.

کلمات کلیدی:
High Performance Computing, Reactive Fault Tolerance, Proactive Fault Tolerance, Predictions models, Artificial Intelligent Computing, Time series models

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/282626/