This research has two major goals. First, we would like to design an efficient software communication layer that allows the hardware and software shared memory modules to cooperate in as seamless a fashion as possible. This requires special cache-coherence protocols that are cognizant of the hierarchical nature in which shared memory is implemented. Such multigrain-aware protocols dynamically detect fine-grain sharing patterns that are clustered within a multiprocessor node, and for such sharing patterns, relaxes the coherence provided at the page level thus eliminating all software-related overheads. Therefore, the system delivers the efficiency of hardware cache-coherent shared memory mechanisms to applications that exhibit clustered fine-grain sharing patterns.
Second, our research investigates the behavior of shared memory applications when the underlying shared memory mechanisms are supported in a multigrain fashion. To provide a platform upon which to conduct our application study, we have built a prototype of a multigrain shared memory system on the MIT Alewife multiprocessor, called MGS. The MGS prototype constructs a DSSMP on top of a monolithic cache-coherent shared memory machine (e.g. Alewife) using a technique called virtual clustering. MGS partitions the Alewife machine into virtual clusters by disallowing the use of hardware communication mechanisms across virtual cluster boundaries, and trapping to a page-based software-only shared memory layer when communication across virtual clusters is necessary. This technique allows us to configure the size of each multiprocessor in the cluster at runtime, thereby enabling the study of different DSSMP configurations.
The flexibility provided by the MGS prototype in studying different DSSMPs allows us to fully characterize the performance of shared memory applications on multigrain systems. In particular, our application study has produced performance results like the following:
This graph shows the execution time of a hypothetical shared memory application on the MGS prototype as node size is varied between 1 and P processors (total machine size is kept fixed at P processors). With increasing multiprocessor node size, the application experiences improved performance since the larger multiprocessors mean that a greater fraction of the application's shared memory traffic is supported using hardware mechanisms. Our performance analysis also directly compares the performance of the multigrain systems to the monolithic architectures at node sizes of 1 (all-software shared memory) and P (all-hardware shared memory) processors, as indicated by the two performance metrics that have been labeled on the graph. The interested reader is encouraged to look at our MGS paper which appeared in ISCA '96 for more details.
Traditionally, accurate modeling of memory system behavior is difficult because caches are highly unpredictable. The cache miss rate of an application depends on complex interactions between the reference stream of the application and the architecture of the memory hierarchy. Multiprocessor memory systems are even more unpredictable because of the additional interaction between the reference streams of multiple threads.
Software page-based shared memory systems, particularly those that support a release consistent (RC) memory consistency model, present an opportunity for accurate performance modeling. Such shared memory systems place the onus of coherence management on the programmer through the annotation of shared memory code with special memory operations known as acquires and releases. The implication is that the state of each processor's page cache is managed explicitly by the application; furthermore, the exact state of the page cache can be computed by analyzing the acquire and release operations emitted by the application. In ongoing research, I have been developing a performance model for software page-based shared memory systems based on the analysis of acquire-release patterns, or what I call Synchronization Analysis.
Preliminary efforts at developing such a performance model have focused on modeling the behavior of multigrain systems. A model of the MGS prototype has been developed, and validation of the model against the actual prototype has been encouraging. We have used the model to explore the entire space of multigrain architectures, as illustrated by the graph below.
This graph shows the space of multigrain architectures, parameterized by total number of processors along the X-axis, and number of processors per multiprocessor node along the Y-axis. For each point in this machine space, our model was used to predict the performance of the Water application from the SPLASH benchmark suite. The contours that cut through the machine space show lines of equivalent performance, where performance is increasing from the lower left corner to the upper right corner of the graph. For instance, our model predicts that a 128-processor all-hardware shared memory system exhibits the same performance as a 256-processor system built using 16-way SMPs.
Last updated: January 1999 by Donald Yeung (email@example.com)