Time+Place: Sunday 15/04/2007 12:30 Room 337-8 Taub Bld.
Title: Scalable Breakpoints, Watchpoints, and Checkpoints for Supercomputers
Speaker: Larry Rudolph NOTE UNUSUAL TIME http://csg.csail.mit.edu/u/r/rudolph/public_html/
Affiliation: MIT CSAIL
Host: Ron Pinter

Abstract:

Both breakpoints and checkpoints are effective for dealing with
program crashes: breakpoints help programmers debug and checkpoints
help users mitigate the damage of a system crash.  Traditional
methods have associated high overheads preventing their scalability;
this talk presents low overhead mechanisms.

A checkpoint requires the computation to stop while the data is
written back to stable storage.  The extra load on the network can
also adversely affect other jobs as well.  If it takes T time to
checkpoint an application, it seems clear that an application should
wait at least T time between checkpoints.  But who decides when
should a checkpoint be taken?  The programmer knows the best places
in the code to perform a checkpoint, but the system may know the best
time to perform one.  We propose a scheme in which the programmer
liberally places checkpoints into the code but the system
conservatively choose to skip some of them when the risk of a crash
before the next checkpoint is small.  This strategy is very effective
when the system understands the current state of the system.

A memory breakpoint requires that every memory access be monitored.
There is often special hardware to track several particular memory
locations. But there are times when a programmer wishes to track a
million memory locations.  In particular, we describe a new
methodology for Ubiquitous Memory Introspection, which is online and
lightweight using fast mini-simulations to analyze short memory
access traces recorded from frequently executed code regions.  The
simulations provide profiling results at varying granularities, down
to that of a single instruction or address. This can be used as a
debugging tool or as a metric for the state of the application which
may help in checkpointing decisions.