In this chapter we document the parallel FFTW routines for parallel systems supporting the MPI message-passing interface. Unlike the shared-memory threads described in the previous chapter, MPI allows you to use distributed-memory parallelism, where each CPU has its own separate memory, and which can scale up to clusters of many thousands of processors. This capability comes at a price, however: each process only stores a portion of the data to be transformed, which means that the data structures and programming-interface are quite different from the serial or threads versions of FFTW.
Distributed-memory parallelism is especially useful when you are transforming arrays so large that they do not fit into the memory of a single processor. The storage per-process required by FFTW's MPI routines is proportional to the total array size divided by the number of processes. Conversely, distributed-memory parallelism can easily pose an unacceptably high communications overhead for small problems; the threshold problem size for which parallelism becomes advantageous will depend on the precise problem you are interested in, your hardware, and your MPI implementation.
A note on terminology: in MPI, you divide the data among a set of
“processes” which each run in their own memory address space.
Generally, each process runs on a different physical processor, but
this is not required. A set of processes in MPI is described by an
opaque data structure called a “communicator,” the most common of
which is the predefined communicator
refers to all processes. For more information on these and
other concepts common to all MPI programs, we refer the reader to the
documentation at the MPI home page.
We assume in this chapter that the reader is familiar with the usage of the serial (uniprocessor) FFTW, and focus only on the concepts new to the MPI interface.