COMP 422 Parallel Computing Spring 2008 Assignment 3 Complete by 5pm Wednesday, April 2, 2008

## Iterative Averaging of a One-Dimensional Array

The goal of this assignment is to solve a One-Dimensional Iterative Averaging problem by developing a scalable message passing implementation using MPI. You can write your program in C+MPI or Fortran+MPI. You should use point-to-point and collective communications in MPI as appropriate.

Your program should take the following three inputs on the command line:

1. Input size, N, of type long.
2. Convergence threshold, epsilon, of type double.
3. Number of processors, P, of type int.

Given an input size, N, a solution to this iterative averaging problem must perform the following steps:

1. Allocate a (logically) global array of doubles, A, of size N+2 partitioned across the P processors. You may also allocate a second shadow array for A, if needed.
2. Initialize A[0] := 0 and A[N+1] := N+1.
3. In each iteration, replace A[i] by the average of its neighbors, i.e., A[i] := (A[i-1] + A[i+1])/2 for all 1 <= i <= N. This computation will entail communication of boundary elements on each processor.
4. Repeat the previous step until the absolute difference between two successive values of A[i] is <= epsilon, for all 1 <= i <= N. We will use delta(i) to denote this absolute difference.
5. Print the following values as output:
• Average and maximum values of delta(i) for 1 <= i <= N.
• Number of iterations
• Sum of final values of A[1...N], which should be close to N*(N+1)/2.

You goal is to write a program that exhibits the best sequential and parallel performance for this problem. An incomplete description of a C+MPI implementation for this problem can be found in Figure 1 of the paper, The high-level parallel language ZPL improves productivity and performance, by Bradford L. Chamberlain, Sung-Eun Choi, Steven J. Deitz, and Lawrence Snyder, published in the Proceedings of the IEEE International Workshop on Productivity and Performance in High-End Computing (P-PHEC), 2004. You are welcome to use an alternate implementation approach, if you prefer.

You will run your MPI program on Ada and measure its performance and scalability. Run your program on 1-64 processors of Ada, only on numbers of processors that are powers of 2. Your program should measure the elapsed time of the iterative part of the program (steps 3 and 4 above), which excludes initialization and final output. Feel free to use values of N and epsilon that yield favorable performance results for your program, but be sure to document your choices in your write-up -- we will perform tests with additional values.

Write a document that describes how your program works. This document should not include your programs, though it may include figures containing pseudo-code that sketch the key elements of your parallelization strategy for each implementation. Explain how your program partitions the data, work, and exploits parallelism. If your program uses collective communication, explain how; if not, explain why not. Your writeup should include a table with timing measurements for your solver on on 1-64 processors for numbers of processors that are powers of 2. Plot the parallel efficiency of your program executions. The x axis should show the number of processors. The Y axis should show your measured parallel efficiency for the execution.

NOTE: Parallel efficiency is computed as S/(p * T(p)), where S represents the wall clock time of a sequential execution of your program, p is a number of processors and T(p) is the wall clock time of an execution on p processors. Don't make the mistake of using the execution time of a one thread version of your parallel implementation as the time of a sequential execution. The execution time of the one-thread parallel implementation is typically larger than that of the original sequential implementation. When you compute parallel efficiency, always use the performance of the original sequential code as a baseline; otherwise, you will overestimate the value of your parallelization.

## Using MPI on the Ada

General information about MPI and links to online documentation can be found on the COMP 422 Resources Page. For helpful details explaining how to develop and debug your programs, see http://rcsg.rice.edu/ada/FAQ. This FAQ contains information about how to compile, run, and submit your job to the batch queueing system and the TotalView debugger.

The queue status can be checked by running the qstat command on Ada.

You can write and compile your program on one of the master nodes (the node you log into when you say `ssh ada.rice.edu`). To run a parallel program, you need to use one of the batch queues. You can obtain an interactive session for debugging your program by requesting an interactive session in the "interactive" queue using ```qsub -I -q interactive```. You need to specify the number of processor nodes that you want, e.g. using `-l nodes=4`. If you don't specify how much time you want in the interactive queue, your session will be killed after 30 minutes. The upper time limit for the interactive queue is 3 hours. See "man qsub" for details on how to specify the wall clock time for your request.

We recommend using queue "interactive" for development on 1 or 2 nodes and then submitting timing experiments to the "short" queue. Specifying a realistic bound for the running time of your application when you submit to the "short" queue will accelerate your progress through the queue. The more time you ask for, the longer you will probably wait in the queue. However, if you don't ask for enough, your job may be killed before it completes if your execution time exceeds the amount you requested. Ask for enough, but not too much.