Finding Rare Numerical Stability Errors in Concurrent Computation

Speaker:
Karine Even, M.Sc. Thesis Seminar
Date:
Wednesday, 27.2.2013, 14:30
Place:
Taub 601
Advisor:
Prof. Eran Yahav and Dr. Hana Chockler

Stability of a numerical algorithm is an important factor in its analysis, since an error exceeding the bound can lead to disastrous results: for example, the Patriot missile failure in 1991 occurred due to rounding errors, explosion of the Ariane 5 rocket in 1996 was a result of an overflow error, and sinking of the Sleipner A offshore platform was caused by a combination of a numerical error and a physical error. Introduction of concurrency to numerical algorithms results in a significant increase in the number of possible computations of the same result and can lead to instability of previously stable algorithms, since rounding combined with the possibility of different interleaving of threads can result in a larger error than expected for some interleaving. Such errors can be very rare, since the particular combination of rounding can occur in only a very small percentage of interleaving, making the detection of such errors an especially challenging task. We present a framework for cross-entropy based testing of concurrent numerical programs with a large number of threads. The framework can be used to detect rare numerical stability errors in concurrent computations. Our framework combines ideas from cross-entropy based testing, together with abstractions that make these techniques effective for programs with large number of threads. We have implemented our approach in a tool ACE, and demonstrated its effectiveness on implementations of several numerical algorithms from the literature with concurrency and rounding by truncation of intermediate computations. These algorithms, with certain precision requirements, are no longer stable, but this instability cannot be detected by random testing. We describe several abstraction algorithms on top of the implementation of the cross-entropy method and show that with abstraction, our algorithms successfully find rare errors in programs with hundreds of threads. In fact, some of our abstractions lead to a state space whose size does not depend on the number of threads at all.

Back to the index of events