Ideally, when you parallelize a transform over some P processes, each process should end up with work that takes equal time. Otherwise, all of the processes end up waiting on whichever process is slowest. This goal is known as “load balancing.” In this section, we describe the circumstances under which FFTW is able to load-balance well, and in particular how you should choose your transform size in order to load balance.
Load balancing is especially difficult when you are parallelizing over heterogeneous machines; for example, if one of your processors is a old 486 and another is a Pentium IV, obviously you should give the Pentium more work to do than the 486 since the latter is much slower. FFTW does not deal with this problem, however—it assumes that your processes run on hardware of comparable speed, and that the goal is therefore to divide the problem as equally as possible.
For a multi-dimensional complex DFT, FFTW can divide the problem
equally among the processes if: (i) the first dimension
n0 is divisible by P; and (ii), the product of
the subsequent dimensions is divisible by P. (For the advanced
interface, where you can specify multiple simultaneous transforms via
some “vector” length
howmany, a factor of
included in the product of the subsequent dimensions.)
For a one-dimensional complex DFT, the length
N of the data
should be divisible by P squared to be able to divide
the problem equally among the processes.