Ph.D. Dissertation Defense: Jing Wu

Monday, July 21, 2014
2:00 p.m.
1146 AV Williams Bldg.
Maria Hoo
301 405 3681
mch@umd.edu

ANNOUNCEMENT: Ph.D. Dissertation Defense

Name: Jing Wu

Committee:
Professor Joseph F JaJa Chair/Advisor
Professor Shuvra S. Bhattacharyya
Professor Tudor Dumitras
Professor Manoj Franklin
Professor Amitabh Varshney

Date/Time: Monday, July 21st, 2pm

Location: Room 1146 AV Williams Building

Title: Optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms.

Abstract:

An emerging trend in processor architecture is the doubling of the number of cores per chip every two years with little or no improvement to the clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms in a way that exploits the available computational power of the underlying architectures.

The fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, highly multithreaded CUDA GPUs. We explore the high-degree of multithreading offered by the CUDA environment while carefully managing the multiple levels of memory hierarchy. As a result, we develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions so as to obtain some of the fastest known implementations for the computation of multi-dimensional FFT.

We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 140 GFLOPS and a bandwidth of 70 GB/s on the Tesla C1060, and up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. Moreover, we introduce a mixed-precision strategy such that only single precision data are stored in the slower memory while achieving the second order convergence, which needs double precision implementation.

We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound applications such as 3D-FFT, solving the Poisson equation, and computation-bound application such as dense matrix multiplication. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline used for memory-bound applications to a computation-bound application-DGEMM and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU. More specifically, we can perform dense matrix multiplication, at near-peak GPU computational rate achieving 1 and 2 TFLOPS by using 1 and 2 GPUs respectively for extremely large matrices that do not fit on the device memory.

Audience: Graduate Faculty

Browse All Events

July 2025

SU	MO	TU	WE	TH	FR	SA
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

Submit an Event