For one-dimensional distributed DFTs using FFTW, matters are slightly more complicated because the data distribution is more closely tied to how the algorithm works. In particular, you can no longer pass an arbitrary block size and must accept FFTW's default; also, the block sizes may be different for input and output. Also, the data distribution depends on the flags and transform direction, in order for forward and backward transforms to work correctly.
ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, ptrdiff_t *local_ni, ptrdiff_t *local_i_start, ptrdiff_t *local_no, ptrdiff_t *local_o_start);
This function computes the data distribution for a 1d transform of
size n0
with the given transform sign
and flags
.
Both input and output data use block distributions. The input on the
current process will consist of local_ni
numbers starting at
index local_i_start
; e.g. if only a single process is used,
then local_ni
will be n0
and local_i_start
will
be 0
. Similarly for the output, with local_no
numbers
starting at index local_o_start
. The return value of
fftw_mpi_local_size_1d
will be the total number of elements to
allocate on the current process (which might be slightly larger than
the local size due to intermediate steps in the algorithm).
As mentioned above (see Load balancing), the data will be divided
equally among the processes if n0
is divisible by the
square of the number of processes. In this case,
local_ni
will equal local_no
. Otherwise, they may be
different.
For some applications, such as convolutions, the order of the output
data is irrelevant. In this case, performance can be improved by
specifying that the output data be stored in an FFTW-defined
“scrambled” format. (In particular, this is the analogue of
transposed output in the multidimensional case: scrambled output saves
a communications step.) If you pass FFTW_MPI_SCRAMBLED_OUT
in
the flags, then the output is stored in this (undocumented) scrambled
order. Conversely, to perform the inverse transform of data in
scrambled order, pass the FFTW_MPI_SCRAMBLED_IN
flag.
In MPI FFTW, only composite sizes n0
can be parallelized; we
have not yet implemented a parallel algorithm for large prime sizes.