%0 Conference Proceedings %T HPC-SFI: System-Level Fault Injection for High Performance Computing Systems %+ Sino-German Joint Software Institute %A Wang, Yanqi %A Zhang, Qi %A Liu, Yi %A Qian, Depei %< avec comité de lecture %( Lecture Notes in Computer Science %B 15th IFIP International Conference on Network and Parallel Computing (NPC) %C Muroran, Japan %Y Feng Zhang %Y Jidong Zhai %Y Marc Snir %Y Hai Jin %Y Hironori Kasahara %Y Mateo Valero %I Springer International Publishing %3 Network and Parallel Computing %V LNCS-11276 %P 103-113 %8 2018-11-29 %D 2018 %R 10.1007/978-3-030-05677-3_9 %Z Computer Science [cs]Conference papers %X Resilience/fault-tolerance has become a key challenge for large-scale parallel systems. To ensure reliability of high performance computing systems, various kinds of techniques have been proposed, such as hardware-level fault-tolerance, checkpointing, replication, algorithm-base fault-tolerance, etc. There are also many software systems to monitor and handle system-failures, e.g. management and job-scheduling system of HPC systems. To evaluate the effectiveness of these systems, it is necessary to provide some kind of tool to inject failures in a HPC system. This paper proposes HPC-SFI, a system-level fault injection tool for HPC systems. Basically, HPC-SFI can generate three kinds of system-failures in a HPC system including in-node faults, failure in the interconnection network and failure of storage/parallel-file system. In addition, HPC-SFI can inject system-faults in pseudo-random model according to pre-defined parameters and probabilities. Preliminary experimental results demonstrate effectiveness of the tool. %G English %Z TC 10 %Z WG 10.3 %2 https://inria.hal.science/hal-02279558/document %2 https://inria.hal.science/hal-02279558/file/477597_1_En_9_Chapter.pdf %L hal-02279558 %U https://inria.hal.science/hal-02279558 %~ IFIP-LNCS %~ IFIP %~ IFIP-TC %~ IFIP-TC10 %~ IFIP-NPC %~ IFIP-WG10-3 %~ IFIP-LNCS-11276