Parallel matrix multiplication

One-way channels are extremely efficient, and can be implemented in hardware. There used to be a computing element called a transputer, which contained a CPU, memory and four pairs of one-way channels on a single chip. Synchronous communications was implemented through channel statements which could be directly compiled to machine-language instructions of the transputer. This facilitated the construction of arrays of processors that could be programmed to execute truly parallel programs. While the transputer no longer exists, its architecture has influenced that of modern chips, especially digital signal processors. Here we will demonstrate an algorithm that can be efficiently implemented on such systems.

Consider the problem of matrix multiplication:

Each element of the resulting matrix can be computed independently of the other elements. For example, the element 30 in the bottom right corner of the matrix is obtained by the following computation, which can be performed before, after or simultaneously with the computation of other elements:

We will implement matrix multiplication for 3 x 3 matrices using small numbers, although this algorithm is practical only for large matrices of large values, because of the overhead involved in the communications.

Figure 8.1 shows 21 processors connected so as to compute the matrix multiplication. Each processor will execute a single process. The one-way channels are denoted by arrows. We will use geographical directions to denote the various neighbors of each processor.

The central array of 3 x 3 processes multiplies pairs of elements and computes the partial sums. One process is assigned to each element of the first matrix and initialized with the value of that element. The Source processes are initialized with the row vectors of the second matrix and the elements are sent one by one to the Multiplier processes to the south; they in turn pass on the elements until they are finally caught in the Sink processes. (The Sink processes exist so that the programming of all the Multiplier processes can be uniform without a special case for the bottom row of the array.) A Multiplier process receives a partial sum from the east, and adds to it the result of multiplying its element and the value received from the north. The partial sums are initialized by the Zero processes and they are finally sent to the Result processes on the west.

Let us study the computation of one element of the result:

Process array of matrix multiplication

This is the computation of the corner element that we showed above as a mathematical equation. Since each process waits for its two input values before computing an output, we can read the diagram from right to left: first,\(9-0+0=0\), where the partial sum of 0 comes from the Zero process; then, \(8-2+0=16\), but now the the partial sum of 0 comes from the Multiplier process labeled 9; finally, using this partial sum of 16, we get \(7-2+16=30\) which is passed to the Result process.

Except for the Multiplier processes, the algorithms for the other processes are trivial. A Zero process executes \(West \Leftarrow 0\) three times to initialize the Sum variables of a Multiplier process; a Source process executes \(South\Leftarrow Vector[i]\) for each of the three elements of the rows of the second matrix; a Sink process executes

\(North \Rightarrow dummy\) three times and ignores the values received; a Result process ex- ecutes (East \Rightarrow result) three times and prints or otherwise uses the values received. A Result process knows which row and column the result belongs to: the row is determined by its position in the structure and the column by the order in which the values are received.

Here is the algorithm for a Multiplier process:

Each process must be initialized with five parameters: the element FirstElement and channels that are appropriate for its position in the structure. This configuration can be quite language-specific and we do not consider it further here.

Selective input

In all our examples so far, an input statement ch=>var is for a single channel, and the statement blocks until a process executes an output statement on the corresponding channel. Languages with synchronous communication like CSP, Promela and Ada allow selective input, in which a process can attempt to communicate with more than one channel at the same time:

either
  chl => varl
  or
  ch2 => var2
  or
  ch3 => var3

Exactly one alternative of a selective input statement can succeed. When the statement is executed, if communications can take place on one or more channels, one of them is selected nondeterministically. Otherwise, the process blocks until the communications can succeed on one of the alternatives.

The matrix multiplication program can use selective input to take advantage of additional parallelism in the algorithm. Instead of waiting for input first from North and then from East, we can wait for the first available input and then wait for the other one:

Once one alternative has been selected, the rest of the statements in the alternative are executed normally, in this case, an input statement from the other channel. The use of selective input ensures that the processor to the east of this Multiplier is not blocked unnecessarily.

PCTR 2020

Parallel matrix multiplication

Process array of matrix multiplication

Selective input