% ------------------------------------------------------------------------- % % These bibtex bibliographic entries for the 24th % INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE % (1997) created by Julie Fingerson and Mark D. Hill % from author input. % % These entries are correct to the best of our knowledge, % but we accept no responsibility for the consequences of % any errors. Email corrections to hoffman@cs.wisc.edu. % Last change: Thu Mar 27 11:22:10 CST 1997. % % ------------------------------------------------------------------------- % % %Caching Techniques for Instruction Level Parallelism % % 1) Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences @INPROCEEDINGS{Vajapeyam:isca97, AUTHOR = "Sriram Vajapeyam and Tulika Mitra", TITLE = "Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular, register renaming a large number of instructions per cycle is difficult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the register file into a global file and several local files, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small. "} % 2) Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups @INPROCEEDINGS{nair:isca97, AUTHOR = "Ravi Nair and Martin E. Hopkins", TITLE = "Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Modern processors employ a large amount of hardware to dynamically detect parallelism in single-threaded programs and maintain the sequential semantics implied by these programs. The complexity of some of this hardware diminishes the gains due to parallelism because of longer clock period or increased pipeline latency of the machine. In this paper we propose a processor implementation which dynamically schedules groups of instructions while executing them on a fast simple engine and caches them for repeated execution on a fast VLIW-type engine. Our experiments show that scheduling groups spanning several basic blocks and caching these scheduled groups results in significant performance gain over fill buffer approaches for a standard VLIW cache. This concept, which we call DIF (Dynamic Instruction Formatting), unifies and extends principles underlying several schemes being proposed today to reduce superscalar processor complexity. This paper examines various issues in designing such a processor and presents results of experiments using trace-driven simulation of SPECint95 benchmark programs. "} % 3) DAISY: Dynamic Compilation for 100% Architectural Compatibility @INPROCEEDINGS{ebcioglu:isca97, AUTHOR = "Kemal Ebcioglu and E. R. Altman", TITLE = "DAISY: Dynamic Compilation for 100% Architectural Compatibility", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage. "} % % Networks and Input/Output % % 1) On Deadlocks in Interconnection Networks @INPROCEEDINGS{pinkston:isca97, AUTHOR = "Timothy Mark Pinkston and Sugath Warnakulasuriya", TITLE = "On Deadlocks in Interconnection Networks", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Deadlock avoidance-based and deadlock recovery-based routing algorithms have been proposed in recent years without full understanding of the likelihood and characteristics of actual deadlocks in interconnection networks. This work models the interrelationships between routing freedom, message blocking, correlated resource dependencies and deadlock formation. We empirically show that increasing routing freedom, as achieved by allowing unrestricted routing over multiple virtual channels, makes deadlocks highly improbable and reduces the likelihood of other types of correlated message blocking behavior that can degrade performance. Our results further substantiate that recovery-based routing algorithms have a higher potential performance advantage over deadlock avoidance-based routing algorithms which, inherently, allow less routing freedom. "} % 2) Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact @INPROCEEDINGS{stunkel:isca97, AUTHOR = "Craig B. Stunkel and Rajeev Sivaram and Dhabaleswar K. Panda", TITLE = "Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed. Both of these implementations are evaluated against each other as well as against a software-based scheme using the central buffer organization. Simulation experiments under a range of traffic (multiple multicast, bimodal, varying degree of multicast, and message length) and system size are used for evaluation. The study demonstrates the superiority of the central-buffer-based switch architecture. It also indicates that under bimodal traffic the central-buffer-based hardware multicast implementation affects background unicast traffic less adversely compared to a software-based multicast implementation. Thus, multidestination message passing can easily be applied to switch-based parallel systems to deliver good collective communication performance. "} % 3) Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering @INPROCEEDINGS{alvarez:isca97, AUTHOR = "Guillermo A. Alvarez and Walter A. Burkhard and Flaviu Cristian", TITLE = "Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " We present Datum, a novel method for tolerating multiple disk failures in disk arrays. Datum is the first known method that can mask any given number of failures, requires an optimal amount of redundant storage space, and spreads reconstruction accesses uniformly over disks in the presence of failures without needing large layout tables in controller memory. Our approach is based on information dispersal, a coding technique that admits an efficient hardware implementation. As the method does not restrict the configuration parameters of the disk array, many existing RAID organizations are particular cases of Datum. A detailed performance comparison with two other approaches shows that Datum's response times are similar to those of the best competitor when two or less disks fail, and that the performance degrades gracefully when more than two disks fail. "} % %Improving Instruction Level Parallelism % % 1) Dynamic Speculation and Synchronization of Data Dependencies @INPROCEEDINGS{moshovos:isca97, AUTHOR = "Andreas Moshovos and Scott E. Breach and T. N. Vijaykumar and Gurindar S. Sohi", TITLE = "Dynamic Speculation and Synchronization of Data Dependences", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-speculation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor. "} % 2) Dynamic Instruction Reuse @INPROCEEDINGS{sodani:isca97, AUTHOR = "Avinash Sodani and Gurindar S. Sohi", TITLE = "Dynamic Instruction Reuse", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " This paper introduces the concept of dynamic instruction reuse. Empirical observations suggest that many instructions, and groups of instructions, having the same inputs, are executed dynamically. Such instructions do not have to be executed repeatedly -- their results can be obtained from a buffer where they were saved previously. This paper presents three hardware schemes for exploiting the phenomenon of dynamic instruction reuse, and evaluates their effectiveness using execution-driven simulation. We find that in some cases over 50% of the instructions can be reused. The speedups so obtained, though less striking than the percentage of instructions reused, are still quite significant. "} % 3) Complexity-Effective Superscalar Processors @INPROCEEDINGS{subbarao:isca97, AUTHOR = "Subbarao Palacharla and Norman. P. Jouppi and J. E. Smith", TITLE = "Complexity-Effective Superscalar Processors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0.8um, 0.35um, and 0.18um. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster -- consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. "} % % NUMA and COMA Architectures % % 1) Coherence Controller Architectures for an SMP-based CC-NUMA Multiprocessor @INPROCEEDINGS{michael:isca97, AUTHOR = "Maged M. Michael and Ashwini K. Nanda and Beng-Hong Lim and Michael L. Scott", TITLE = "Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Scalable distributed shared-memory architectures rely on coherence controllers on each processing node to synthesize cache-coherent shared memory across the entire machine. The coherence controllers execute coherence protocol handlers that may be hardwired in custom hardware or programmed in a protocol processor within each coherence controller. Although custom hardware runs faster, a protocol processor allows the coherence protocol to be tailored to specific application needs and may shorten hardware development time. Previous research show that the increase in application execution time due to protocol processors over custom hardware is minimal. With the advent of SMP nodes and faster processors and networks, the tradeoff between custom hardware and protocol processors needs to be reexamined. This paper studies the performance of custom-hardware and protocol-processor-based coherence controllers in SMP-node-based CC-NUMA systems on applications from the SPLASH-2 suite. Using realistic parameters and detailed models of existing state-of-the-art system components, it shows that the occupancy of coherence controllers can limit the performance of applications with high communication requirements, where the execution time using protocol processors can be twice as long as using custom hardware. To gain a deeper understanding of the tradeoff, we investigate the effect of varying several architectural parameters that influence the communication characteristics of the applications and the underlying system on coherence controller performance. We identify measures of applications' communication requirements and their impact on the performance penalty of protocol processors, which can help system designers predict performance penalties for other applications. We also study the potential of improving the performance of hardware-based and protocol-processor-based coherence controllers by separating or duplicating critical components. "} % 2) Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA @INPROCEEDINGS{falsafi:isca97, AUTHOR = "Babak Falsafi and David A. Wood", TITLE = "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with a more-recent Simple-COMA (S-COMA) protocol. What makes R-NUMA novel is the way it dynamically reacts to program and system behavior to switch between CC-NUMA and S-COMA and exploit the best aspects of both protocols. This reactive behavior allows each node in an R-NUMA system to independently choose the best protocol for a particular page, thus providing much greater performance stability than either CC-NUMA or S-COMA alone. Our evaluation is both qualitative and quantitative. We first show the theoretical result that R-NUMA's worst-case performance is bounded within a small constant factor (i.e., two to three times) of the best of CC-NUMA and S-COMA. We then use detailed execution-driven simulation to show that, in practice, R-NUMA usually performs better than either a pure CC-NUMA or pure S-COMA protocol, and no more than 57% worse than the best of CC-NUMA and S-COMA, for our benchmarks and base system assumptions. "} % 3) The SGI Origin: A ccNUMA Highly Scalable Server @INPROCEEDINGS{laudon:isca97, AUTHOR = "James Laudon and Daniel Lenoski", TITLE = "The SGI Origin: A ccNUMA Highly Scalable Server", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, latency, or cost cliffs. The Origin system consists of up to 512 nodes interconnected by a scalable Craylink network. Each node consists of one or two R10000 processors, up to 4 GB of coherent memory, and a connection to a portion of the XIO IO subsystem. This paper discusses the motivation for building the Origin 2000 and then describes its ar chitecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems. "} % % Issues in Shared Memory Systems % % 1 The Interaction of Software Prefetching with ILP processors in Shared-Memory Systems @INPROCEEDINGS{ranganathan:isca97, AUTHOR = "Parthasarathy Ranganathan and Vijay S. Pai and Hazim Abdel-Shafi and Sarita V. Adve", TITLE = "The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance bottleneck for shared-memory multiprocessor systems built of such processors. This paper provides the first study of the effectiveness of software-controlled non-binding prefetching in shared-memory multiprocessors built of state-of-the-art ILP-based processors. We find that software prefetching results in significant reductions in execution time (12% to 31%) for three out of five applications on an ILP system. However, compared to previous-generation systems, software prefetching is significantly less effective in reducing the memory stall component of execution time on an ILP system. Consequently, even after adding software prefetching, memory stall time accounts for over 30% of the total execution time in four out of five applications on our ILP system. This paper also investigates the interaction of software prefetching with memory consistency models on ILP-based multiprocessors. In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC). We find that even with software prefetching, for three out of five applications, RC provides a significant reduction in execution time (15% to 40%) compared to SC. "} % 2) VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks @INPROCEEDINGS{konthanassis:isca97, AUTHOR = "Leonidas Kontothanassis and Galen Hunt and Robert Stets and Nikolaos Hardavellas and Michal Cierniak and Srinivasan Parthasarathy and Wagner {Meira, Jr.} and Sandhya Dwarkadas and Michael Scott", TITLE = "VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Recent technological advances have produced network interfaces that provide users with very low-latency access to the memory of remote machines. We examine the impact of such networks on the implementation and performance of software DSM. Specifically, we compare two DSM systems -- Cashmere and TreadMarks -- on a 32-processor DEC Alpha cluster connected by a Memory Channel network. Both Cashmere and TreadMarks use virtual memory to maintain coherence on pages, and both use lazy, multi-writer release consistency. The systems differ dramatically, however, in the mechanisms used to track sharing information and to collect and merge concurrent updates to a page, with the result that Cashmere communicates much more frequently, and at a much finer grain. Our principal conclusion is that low-latency networks make DSM based on fine-grain communication competitive with more coarse-grain approaches, but that further hardware improvements will be needed before such systems can provide consistently superior performance. In our experiments, Cashmere scales slightly better than TreadMarks for applications with false sharing. At the same time, it is severely constrained by limitations of the current Memory Channel hardware. In general, performance is better for TreadMarks. "} % 3) Efficient Synchronization: Let Them Eat QOLB @INPROCEEDINGS{kagi:isca97, AUTHOR = "Alain K{\"{a}}gi and Doug Burger and James R. Goodman", TITLE = "Efficient Synchronization: Let Them Eat QOLB", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base types of locks (Test&Set, Test&Test&Set, MCS, LH, M, and Qolb). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that Qolb, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of Qolb, which runs on an unmodified cluster of commodity workstations. "} % % Multiprocessors % % 1) Hardware Fault Containment in Scalable Shared-Memory Multiprocessors @INPROCEEDINGS{teodosiu:isca97, AUTHOR = "Dan Teodosiu and Joel Baxter and Kinshuk Govil and John Chapin and Mendel Rosenblum and Mark Horowitz", TITLE = "Hardware Fault Containment in Scalable Shared-Memory Multiprocessors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size. The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine. Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors. "} % 2) Effects of Communication Latency, Overhead and Bandwidth in a Cluster Architecture @INPROCEEDINGS{martin:isca97, AUTHOR = "Richard P. Martin and Amin M. Vahdat and David E. Culler and Thomas E. Anderson", TITLE = "Effects of Communication Latency, Overhead, and\\Bandwidth in a Cluster Architecture", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 us. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. "} % 3) The Mercury Interconnect Architecture: A Cost-effective Infrastructure for High-performance Servers @INPROCEEDINGS{weber:isca97, AUTHOR = "Wolf-Dietrich Weber and Stephen Gold and Pat Helland and Takeshi Shimizu and Thomas Wicki and Winfried Wilcke", TITLE = "The Mercury Interconnect Architecture: A Cost-effective Infrastructure for High-performance Servers", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " This paper presents HAL's Mercury Interconnect Architecture, an interconnect infrastructure designed to link commodity microprocessors, memory, and I/O components into high-performance multiprocessing servers. Both shared-memory and message-passing systems, as well as hybrid systems are supported by the interconnect. The key attributes of the Mercury Interconnect Architecture are: low latency, high bandwidth, a modular and flexible design, reliability/availability/serviceability (RAS) features, and a simplicity that enables very cost-effective implementations. The first implementation of the architecture links multiple 4-processor Pentium\056 Pro based nodes. In a 4-node (16-processor) shared-memory configuration, this system achieves a remote read latency of just over 1\040micro-sec, and a maximum interconnect bandwidth of 6.4\040GByte/s. Both of these parameters far outpace comparable SCI-based solutions, while utilizing much fewer hardware components. "} % % Memory System Design % % 1) The Design and Analysis of a Cache Architecture for Texture Mapping @INPROCEEDINGS{hakura:isca97, AUTHOR = "Ziyad S. Hakura and Anoop Gupta", TITLE = "The Design and Analysis of a Cache Architecture for Texture Mapping", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " The effectiveness of texture mapping in enhancing the realism of computer generated imagery has made support for real-time texture mapping a critical part of graphics pipelines. Despite a recent surge in interest in three-dimensional graphics from computer architects, high-quality high-speed texture mapping has so far been confined to costly hardware systems that use brute-force techniques to achieve high performance. One obstacle faced by designers of texture mapping systems is the requirement of extremely high bandwidth to texture memory. High bandwidth is necessary since there are typically tens to hundreds of millions of accesses to texture memory per second. In addition, to achieve the high clock rates required in graphics pipelines, low-latency access to texture memory is needed. In this paper, we propose the use of texture image caches to alleviate the above bottlenecks, and evaluate various tradeoffs that arise in such designs. We find that the factors important to cache behavior are (i) the representation of texture images in memory, (ii) the rasterization order on screen and (iii) the cache organization. Through a detailed investigation of these issues, we explore the best way to exploit locality of reference and determine whether this technique is robust with respect to different scenes and different amounts of texture. Overall, we observe that there is a significant amount of temporal and spatial locality and that the working set sizes are relatively small (at most 16KB) across all cases that we studied. Consequently, the memory bandwidth requirements of a texture cache system are substantially lower (at least three times and as much as fifteen times) than the memory bandwidth requirements of a system which achieves equivalent performance but does not utilize a cache. These results are very encouraging and indicate that caching is a promising approach to designing memory systems for texture mapping. "} % 2) Designing High Bandwidth On-Chip Caches @INPROCEEDINGS{wilson:isca97, AUTHOR = "Kenneth M. Wilson and Kunle Olukotun", TITLE = "Designing High Bandwidth On-Chip Caches", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measured in execution time using a dynamic superscalar processor running realistic benchmarks that include operating system references. The results show that a large dual-ported multi-cycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and the use of a line buffer provide the bandwidth needed by a dynamic superscalar processor. In addition, the line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency. "} % 3) Memory System Design Considerations in Dynamically-Scheduled Processors @INPROCEEDINGS{farkas:isca97, AUTHOR = "Keith I. Farkas and Paul Chow and Norman P. Jouppi and Zvonko Vranesic" TITLE = "Memory-System Design Considerations for Dynamically-Scheduled Processors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance."} % % Prefetching and Prediction % % 1) Prefetching Using Markov Predictors @INPROCEEDINGS{grunwald:isca97, AUTHOR = "Dirk Grunwald and Douglas Joseph", TITLE = "Prefetching Using Markov Predictors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor. This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be effectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization. "} % 2) Data Prefetching on the HP PA-8000 @INPROCEEDINGS{santhanam:isca97, AUTHOR = "Vatsa Santhanam and Edward H. Gornish and Wei-Chung Hsu", TITLE = "Data Prefetching on the HP PA-8000", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Memory latency is a major issue for many modern microprocessor based systems, including the Hewlett-Packard PA-8000. Due to its fast clock rate and wide issue capability, cache misses in the PA-8000 are very expensive. The PA-8000 combines out-of-order execution with multiple outstanding memory requests to tolerate memory latency; however, this approach has its limitations. In order to substantially reduce much of the memory latency penalty, the PA-8000 uses software-based data cache prefetching. In this paper, we discuss the implementation of the data prefetch generation algorithm in the Hewlett-Packard Precision Architecture (HP-PA) compiler. We present performance results for SPECfp95 on a PA-8000 system that show speedups, due to data prefetching, of up to 100% "} % 3) Target Prediction for Indirect Jumps @INPROCEEDINGS{chang:isca97, AUTHOR = "Po-Yung Chang and Eric Hao and Yale N. Patt", TITLE = "Target Prediction for Indirect Jumps", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " As the issue rate and pipeline depth of high performance superscalar processors increase, the amount of speculative work issued also increases. Because speculative work must be thrown away in the event of a branch misprediction, wide-issue, deeply pipelined processors must employ accurate branch predictors to effectively exploit their performance potential. Many existing branch prediction schemes are capable of accurately predicting the direction of conditional branches. However, these schemes are ineffective in predicting the targets of indirect jumps achieving, on average, a prediction accuracy rate of 51.8\% for the SPECint95 benchmarks. In this paper, we propose a new prediction mechanism, the target cache, for predicting indirect jump targets. For the perl and gcc benchmarks, this mechanism reduces the indirect jump misprediction rate by 93.4\% and 63.3\% and the overall execution time by 14\% and 5\%. "} % % Branch Prediction % % 1) The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference @INPROCEEDINGS{sprangle:isca97, AUTHOR = "Eric Sprangle, Robert S. Chappell, Mitch Alsup, and Yale N. Patt" TITLE = "The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference" BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Deeply pipelined, superscalar processors require accurate branch prediction to achieve high performance. Two-level branch predictors have been shown to achieve high prediction accuracy. It has also been shown that branch interference is a major contributor to the number of branches mispredicted by two-level predictors. This paper presents a new method to reduce the interference problem called agree prediction, which reduces the chance that two branches aliasing the same PHT ent ry will interfere negatively. We evaluate the performance of this scheme using full traces (both user and supervisor) of the SPECint95 benchmarks. The result is a reduction in the misprediction rate of gcc ranging from 8.62\% with a 64K--entry PHT up to 33 .3\% with a 1K--entry PHT. % 2) Trading Conflict and Capacity Aliasing in Conditional Branch Predictors @INPROCEEDINGS{michaud:isca97, AUTHOR = "Pierre Michaud and Andr\'e Seznec and Richard Uhlig", TITLE = "Trading Conflict and Capacity Aliasing in Conditional Branch Predictors", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they are becoming increasingly dependent on accurate branch prediction. Because hardware resources for branch-predictor tables are invariably limited, it is not possible to hold all relevant branch history for all active branches at the same time, especially for large workloads consisting of multiple processes and operating-system code. The problem that results, commonly referred to as aliasing in the branch-predictor tables, is in many ways similar to the misses that occur in finite-sized hardware caches. In this paper we propose a new classification for branch aliasing based on the three-Cs model for caches, and show that conflict aliasing is a significant source of mispredictions. Unfortunately, the obvious method for removing conflicts -- adding tags and associativity to the predictor tables -- is not a cost-effective solution. To address this problem, we propose the skewed branch predictor, a multi-bank, tag-less branch predictor, designed specifically to reduce the impact of conflict aliasing. Through both analytical and simulation models, we show that the skewed branch predictor removes a substantial portion of conflict aliasing by introducing redundancy to the branch-predictor tables. Although this redundancy increases capacity aliasing compared to a standard one-bank structure of comparable size, our simulations show that the reduction in conflict aliasing overcomes this effect to yield a gain in prediction accuracy. Alternatively, we show that a skewed organization can achieve the same prediction accuracy as a standard one-bank organization but with half the storage requirements. "} % 3) A Language for Describing Predictors and Its Use to Automatically Synthesize Them @INPROCEEDINGS{emer:isca97, AUTHOR = "J. Emer and N. Gloy", TITLE = "A Language for Describing Predictors and Its Use to Automatically Synthesize Them", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " As processor architectures have increased their reliance on speculative execution to improve performance, the importance of accurate prediction of what to execute speculatively has increased. Furthermore, the types of values predicted have expanded from the ubiquitous branch and call/return targets to the prediction of indirect jump targets, cache ways and data values. In general, the prediction process is one of identifying the current state of the system, and making a prediction for some as yet uncomputed value based on that state. Prediction accuracy is improved by learning what is a good prediction for that state using a feedback process at the time the predicted value is actually computed. While there have been a number of efforts to formally characterize this process, we have taken the approach of providing a simple algebraic-style notation that allows one to express this state identification and feedback process. This notation allows one to describe a wide variety of predictors in a uniform way. It also facilitates the use of an efficient search technique called genetic programming, which is loosely modeled on the natural evolutionary process, to explore the design space. In this paper we describe our notation and the results of the application of genetic programming to the design of branch and indirect jump predictors. "} % % Managing the Memory Hierarchy and Memory-Centric Architectures % % 1) Run-Time Adaptive Cache Hierarchy Management via Reference Analysis @INPROCEEDINGS{johnson:isca97, AUTHOR = "Teresa L. Johnson and Wen-mei Hwu", TITLE = "Run-Time Adaptive Cache Hierarchy Management via Reference Analysis", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups."} % 2) The Energy Efficiency of IRAM Architectures @INPROCEEDINGS{fromm:isca97, AUTHOR = "Richard Fromm and Stylianos Perissakis and Neal Cardwell and Bruce McGaughy and Christoforos Kozyrakis and David Patterson and Thomas Anderson and Katherine Yelick", TITLE = "The Energy Efficiency of IRAM Architectures", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " Portable systems demand energy efficiency in order to maximize battery life. IRAM architectures, which combine DRAM and a processor on the same chip in a DRAM process, are more energy efficient than conventional systems. The high density of DRAM permits a much larger amount of memory on-chip than a traditional SRAM cache design in a logic process. This allows most or all IRAM memory accesses to be satisfied on-chip. Thus there is much less need to drive high-capacitance off-chip buses, which contribute significantly to the energy consumption of a system. To quantify this advantage we apply models of energy consumption in DRAM and SRAM memories to results from cache simulations of applications reflective of personal productivity tasks on low power systems. We find that IRAM memory hierarchies consume as little as 22% of the energy consumed by a conventional memory hierarchy for memory-intensive applications, while delivering comparable performance. Furthermore, the energy consumed by a system consisting of an IRAM memory hierarchy combined with an energy efficient CPU core is as little as 40% of that of the same CPU core with a traditional memory hierarchy. "} % 3) DataScalar Architectures @INPROCEEDINGS{burger:isca97, AUTHOR = "Doug Burger and Stefanos Kaxiras and James R. Goodman", TITLE = "DataScalar Architectures", BOOKTITLE = "Proceedings of the 24th Annual International Symposium on Computer Architecture", YEAR = 1997, PAGES = "???", MONTH = Jun, WHERE = "bound", ABSTRACT = " DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. The program data set (and/or text) is distributed across these memories. In this execution model, each processor broadcasts operands it loads from its local memory to all other units. In this paper, we describe the benefits, costs, and problems associated with the DataScalar model. We also present simulation results of one possible implementation of a DataScalar system. In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional memory system. Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail. We conclude with a discussion of how DataScalar systems may accommodate traditional parallel processing, thus improving performance over a much wider range of applications than is currently possible with either model. "}