Technical Report MSC-2016-15

Title: Parallel Additive Fast Fourier Transform Algorithms
Authors: Matan Hamilis
Supervisors: Eli Ben-Sasson, Mark Silberstein
PDFCurrently accessibly only within the Technion network
Abstract: Fast Fourier Transforms (FFTs), particularly over finite fields, have a main role in a large set of applications in the fields of signal and image processing, coding and cryptography. The computation of additive FFTs over finite fields is considered as a simpler and more scalable method than multiplicative FFTs due to the additive and recursive structure of finite fields. In this work we present an implementation of an algorithm to compute additive FFTs over finite fields of characteristic two (binary fields) to evaluate and interpolate polynomials of high degree over large affine subspaces. While previous works were applied only to linear subspaces, we apply a small modification to an existing algorithm to compute additive FFTs over affine subspaces as well. We present a parallel implementation of this algorithm for the GPU architecture and discuss its performance.

The FFT algorithm relies on an implementation of finite field arithmetics. Binary fields are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in ``large'' binary fields of sizes greater than 2 to the power of 32. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization.

The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.

We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 2 to the power of 32 over the popular, highly optimized Number Theory Library NTL, which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2 to the power of 256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2016
To the main CS technical reports page

Computer science department, Technion