13/07/2020

Understanding and Finding Crash-Consistency Bugs in Parallel File Systems

Jinghan Sun, Chen Wang, Jian Huang, Marc Snir

Keywords:

Abstract: Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC)infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCheck. PFSCheck is easy to use with low-performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCheck is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at HotStorage 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers