Next: Multi-dimensional MPI DFTs of Real Data, Previous: 2d MPI example, Up: Distributed-memory FFTW with MPI

The most important concept to understand in using FFTW's MPI interface is the data distribution. With a serial or multithreaded FFT, all of the inputs and outputs are stored as a single contiguous chunk of memory. With a distributed-memory FFT, the inputs and outputs are broken into disjoint blocks, one per process.

In particular, FFTW uses a *1d block distribution* of the data,
distributed along the *first dimension*. For example, if you
want to perform a 100 × 200 complex DFT, distributed over 4
processes, each process will get a 25 × 200 slice of the data.
That is, process 0 will get rows 0 through 24, process 1 will get rows
25 through 49, process 2 will get rows 50 through 74, and process 3
will get rows 75 through 99. If you take the same array but
distribute it over 3 processes, then it is not evenly divisible so the
different processes will have unequal chunks. FFTW's default choice
in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
process 2.

FFTW provides several ‘`fftw_mpi_local_size`’ routines that you can
call to find out what portion of an array is stored on the current
process. In most cases, you should use the default block sizes picked
by FFTW, but it is also possible to specify your own block size. For
example, with a 100 × 200 array on three processes, you can
tell FFTW to use a block size of 40, which would assign 40 rows to
processes 0 and 1, and 20 rows to process 2. FFTW's default is to
divide the data equally among the processes if possible, and as best
it can otherwise. The rows are always assigned in “rank order,”
i.e. process 0 gets the first block of rows, then process 1, and so
on. (You can change this by using `MPI_Comm_split`

to create a
new communicator with re-ordered processes.) However, you should
always call the ‘`fftw_mpi_local_size`’ routines, if possible,
rather than trying to predict FFTW's distribution choices.

In particular, it is critical that you allocate the storage size that
is returned by ‘`fftw_mpi_local_size`’, which is *not*
necessarily the size of the local slice of the array. The reason is
that intermediate steps of FFTW's algorithms involve transposing the
array and redistributing the data, so at these intermediate steps FFTW
may require more local storage space (albeit always proportional to
the total size divided by the number of processes). The
‘`fftw_mpi_local_size`’ functions know how much storage is required
for these intermediate steps and tell you the correct amount to
allocate.