Paper on checkpointing for ML accepted on ASPLOS'25
The paper: "PCcheck: Persistent Concurrent Checkpointing for ML" by Foteini Strati*, Michal Friedman*, and Ana Klimovic was accepted to the International Conference on Architectural Support for Programming Languages and Operating Systems (external page ASPLOS'25).
The paper deals with the problem of frequent failures and resource preemptions encountered on large-scale clusters used for ML training, and proposes a novel checkpoint system that allows frequent training state snapshots with minimal overheads.
* equal contribution