Department of Computer Science |
University of San Francisco |
---|

Parallel and Distributed Computing

Spring 2011

MWF 11:45-12:50, LS 307

**Professor:** Peter Pacheco

**Office:** Harney 540

**Phone:** 422-6630

**Email:** user: peter, domain: usfca.edu

**Office Hours:** M 4-5:30, F 1-2, and by appointment

Course Syllabus (Here's a PDF Version.)

**Programming Assignments**

- Programming assignment 1. Also see the Guide to using the penguin cluster and the GET_TIME macro. Note that the due date has been changed to Friday, February 7.
- Programming assignment 2. Note that the due date has been changed to Wednesday, February 26. The test input and output are in this directory.
- Programming assignment 3. Note that the due date has been changed to Monday, March 24. The test input is in this directory.
- Programming assignment 4. Note that the due date has been changed to Monday, April 14. The test data and output is in this directory
- A list of possible projects
for programming assignment 5
The speakers will be

- Monday, May 5
- Dustin
- Minglu
- Robin
- Bin
- Vincent
- Roderick

- Wednesday, May 7
- Hao
- Xiaoou
- Guangzhi
- Pirakorn
- Joseph

- Monday, May 5

**Seminar Papers**

- Leslie Valiant, "A Bridging Model for Parallel Computation", Communications of the ACM, Vol 33, No 8, Aug 1990, pp 103-111. Jan 31. Roderick Lisam is presenting.
- William Gropp, "Changing How Programmers Think About Parallel Computation", July, 2013, http://learning.acm.org/webinar/. Feb 7. Robin Kalia is presenting.
- Michael Heroux and Jack Dongarra, "Toward a New Metric for Ranking High Performance Computing Systems", UTK EECS Tech Report, June 2013. Feb 14. Hao Chen is presenting.
- Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca and Jack Dongarra, "Post-failure recovery of MPI communication capability: Design and rationale", International Journal of High Performance Computing Applications, vol 27, no 3, pp. 244-254, Fall 2013. Feb 21. Pirakorn Iam Charernying is presenting.
- David Culler, et al, "LogP: A Practical Model of Parallel Computation", Communications of the ACM, vol 39, no 11, pp 78-85, 1996. Feb 28. Xiaoou Li is presenting.
- Guy Blelloch, "Prefix Sums and Their Applications", in John H. Reif, ed., Synthesis of Parallel Algorithms, Morgan Kaufmann, 1991. (The presentation will cover pp. 35-47 of the paper.) Mar 7. Dustin Chesterman is presenting.
- Sarita Adve and Hans-J Boehm, "Memory Models: a Case for Rethinking Parallel Languages and Hardware", Communications of the ACM, vol 53, no 8, pp. 90-101, 2010. Mar 28. Guangzhi Li is presenting.
- James Larus and Christos Kozyrakis, "Transactional Memory", Communications of the ACM, vol 51, no 7, pp. 80-88, 2008 Apr 4. Minglu Ma is presenting.
- K. Kandalla, et al, "Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) Infiniband Clusters", Proc of 2013 IEEE 21st Annual Symposium on High-Performance Interconnects, pp. 63-70, 2013. Apr 11. Vincent Zhang is presenting.
- Junfeng Yang, et al, "Making Parallel Programs Reliable with Stable Multithreading", Communications of the ACM, vol 57, no 3, pp. 58-69, 2014. Apr 25. Bin Lu is presenting.
- Bradford Chamberlain, "Graph Partitioning Algorithms for Distributing Workloads of Parallel Computations", University of Washington Technical Report UW-CSE-98-10-03, October 1998. May 2. Joseph Tanigawa is presenting.

**Additional Course Information**

- How to set up a Subversion repository
- A PDF form for seminar paper evaluation
- A plain text form for seminar paper evaluation
- A guide to using the penguin cluster
- A list of some MPI collective communication functions
- Runtimes, speedups, and efficiencies of the original MPI implementation of the trapezoidal rule
- Estimated runtimes, speedups, and
efficiencies of the MPI trapezoidal rule using the model
developed in class and
`t_a, t_s,`and`t_w`obtained empirically. - Estimated runtimes, speedups, and
efficiencies of the MPI trapezoidal rule using the model
developed in class and
`t_a, t_s,`and`t_w`obtained from a least-squares approximation to the actual runtimes. - Runtimes, speedups, and efficiencies of the slightly improved MPI implementation of the trapezoidal rule
- Runtimes, speedups, and efficiencies of the MPI implementation of matrix-vector multiplication
- Midterm key The mean was 19.1 and the
median was 20. The distribution of the scores was
>= 24: 1 20-22: 6 17-19: 1 0-16: 3

- Runtimes of two implementations of matrix vector multiplication using Pthreads: Version 1 and Version 2.
- Block diagram of an Nvidia Tesla GPU
- Some notes on distributed memory matrix multiplication
- A TSP Digraph and search tree

**Code**

- An MPI greetings program
- Macro that can be used for finding wall-clock times
- Trapezoidal Rule Code
- Serial trapezoidal rule code
- MPI implementation of the trapezoidal rule
- Slightly improved MPI implementation of the trapezoidal rule
- Pthreads implementation of the trapezoidal rule
- First OpenMP implementation of the trapezoidal rule: uses a parallel and a critical directive.
- Second OpenMP implementation of the trapezoidal rule: uses a parallel directive with a reduction clause.
- Third OpenMP implementation of the trapezoidal rule: uses a parallel for directive with a reduction clause.
- MPI implementation of the trapezoidal rule. This version includes code to time the execution of the trapezoidal rule.
- Pthreads implementation of the trapezoidal rule. This version uses a semaphore instead of a mutex.

- MPI implementation of matrix-vector multiplication. This version uses a block row distribution of the matrix and a block distribution of the vectors. The program includes code to time its execution.
- Pthreads barriers
- Using semaphores for producer-consumer synchronization
- Program that attempts to send messages among threads
- Program that uses semaphores to synchronize threads sending messages
- Program that uses semaphores to synchronize threads sending messages. This version uses named semaphores, and it can run under MacOS X.

- Pthreads implementation of matrix-vector multiplication
- CUDA examples
- CUDA dot product
- CUDA dot product. First
version uses
`atomicAdd`. - CUDA dot product. This
version uses tree-structured reduction and
`__syncthreads` - CUDA dot product. This version also uses shared memory to store intermediate results.
- CUDA dot product. This version uses a different tree structure to reduce thread divergence.
- CUDA dot product. This version uses page-locked memory for the results of each block so that they don't need to be copied from device to host.
- CUDA dot product. This
version "unrolls" the last five iterations of the reduction
loop. This eliminates the need for the calls to
`__syncthreads`. - CUDA dot product. This version "unrolls" all the iterations of the reduction loop.
- CUDA dot product. This version does some intermediate sums on the device.

- CUDA dot product. First
version uses
- TSP

Peter Pacheco 2014-05-05