% ------------------------------------------------------------------------- % % These bibtex bibliographic entries for the 23rd % INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE % (1996) created by Julie Fingerson and Mark D. Hill % from author input. % % These entries are correct to the best of our knowledge, % but we accept no responsibility for the consequences of % any errors. Email corrections to hoffman@cs.wisc.edu. % Last change: Mon Mar 18 14:01:53 CST 1996 % % ------------------------------------------------------------------------- % % % Session 2A: Branch Prediction % % 1) Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches @INPROCEEDINGS{evers:isca96, AUTHOR = "Marius Evers and Po-Yung Chang and Yale N. Patt", TITLE = "Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "3-11", MONTH = May, WHERE = "bound", ABSTRACT = " Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, two-component hybrid branch predictors. In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches. "} % 2) An Analysis of Dynamic Branch Prediction Schemes on System Workloads @INPROCEEDINGS{gloy:isca96, AUTHOR = "Nicolas Gloy and Cliff Young and J. Bradley Chen and Michael D. Smith", TITLE = "An Analysis of Dynamic Branch Prediction Schemes on System Workloads", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "12-21", MONTH = May, WHERE = "bound", ABSTRACT = " Recent studies of dynamic branch prediction schemes rely almost exclusively on user-only simulations to evaluate performance. We find that an evaluation of these schemes with user and kernel references often leads to different conclusions. By analyzing our own Atom-generated system traces and the system traces from the Instruction Benchmark Suite, we quantify the effects of kernel and user interactions on branch prediction accuracy. We find that user-only traces yield accurate prediction results only when the kernel accounts for less than 5% of the total executed instructions. Schemes that appear to predict well under user-only traces are not always the most effective on full-system traces: the recently-proposed two-level adaptive schemes can suffer from higher aliasing than the original per-branch 2-bit counter scheme. We also find that flushing the branch history state at fixed intervals does not accurately model the true effects of user/kernel interaction. "} % 3) Correlation and Aliasing in Dynamic Branch Predictors @INPROCEEDINGS{sechrest:isca96, AUTHOR = "Stuart Sechrest and Chih-Chieh Lee and Trevor N. Mudge", TITLE = "Correlation and Aliasing in Dynamic Branch Predictors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "22-32", MONTH = May, WHERE = "bound", ABSTRACT = " Previous branch prediction studies have relied primarily upon the SPECint89 and SPECint92 benchmarks for evaluation. Most of these benchmarks exercise a very small amount of code. As a consequence, the resources required by these schemes for accurate predictions of larger programs have not been clear. Moreover, many of these studies have simulated a very limited number of configurations. Here we report on simulations of a variety of branch prediction schemes using a set of relatively large benchmark programs that we believe to be more representative of likely system workloads. We have examined the sensitivity of these prediction schemes to variation in workload, in resources, and in design and configuration. We show that for predictors with small available resources, aliasing between distinct branches can have the dominant influence on prediction accuracy. For global history based schemes, such as GAs and gshare, aliasing in the predictor table can eliminate any advantage gained through inter branch correlation. For self- history based prediction scheme, such as PAs, it is aliasing in the buffer recording branch history, rather than the predictor table, that poses problems. Past studies have sometimes confused these effects and allocated resources incorrectly. "} % % Session 2B: Shared Memory % % 1) Decoupled Hardware Support for Distributed Shared Memory @INPROCEEDINGS{reinhardt:isca96, AUTHOR = "Steven K. Reinhardt and Robert W. Pfile and David A. Wood", TITLE = "Decoupled Hardware Support for Distributed Shared Memory", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "34-43", MONTH = May, WHERE = "bound", ABSTRACT = " This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices. We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance. We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems. "} % 2) MGS: A Multigrain Shared Memory System @INPROCEEDINGS{yeung:isca96, AUTHOR = "Donald Yeung and John Kubiatowicz and Anant Agarwal", TITLE = "MGS: A Multigrain Shared Memory System", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "44-55", MONTH = May, WHERE = "", ABSTRACT = " Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors DSSMPs. This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced. Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85\% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor. "} % 3) COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors @INPROCEEDINGS{morin:isca96, AUTHOR = "Christine Morin and Alain Gefflaut and Michel Ban\^{a}tre and Anne-Marie Kermarrec", TITLE = "COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "56-65", MONTH = May, WHERE = "bound", ABSTRACT = " Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution-driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5 \% in the best case to 35 \% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data. "} % % Session 3: Processor/Memory Tradeoffs % % 1) Evaluation of Design Alternatives for a Multiprocessor Microprocessor @INPROCEEDINGS{Nayfeh:isca96, AUTHOR = "Basem A. Nayfeh and Lance Hammond and Kunle Olukotun", TITLE = "Evaluation of Design Alternatives for a Multiprocessor Microprocessor", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "67-77", MONTH = May, WHERE = "bound", ABSTRACT = " In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included. "} % 2) Memory Bandwidth Limitations of Future Microprocessors @INPROCEEDINGS{burger:isca96, AUTHOR = "Doug Burger and James R. Goodman and Alain {K\"{a}gi}", TITLE = "Memory Bandwidth Limitations of Future Microprocessors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "78-89", MONTH = May, WHERE = "bound", ABSTRACT = " This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches-implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips. "} % 3) Missing the Memory Wall: The Case for Processor/Memory Integration @INPROCEEDINGS{saulsbury:isca96, AUTHOR = "Ashley Saulsbury and Fong Pong and Andreas Nowatzyk", TITLE = "Missing the Memory Wall: The Case for Processor/Memory Integration", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "90-101", MONTH = May, WHERE = "bound", ABSTRACT = " Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU- centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance. This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25mm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems. We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache. "} % % Session 5A: Cache Organization % % 1) Don't Use the Page Number, But a Pointer To It @INPROCEEDINGS{seznec:isca96, AUTHOR = "Andr\'e Seznec", TITLE = "Don't Use the Page Number, But a Pointer To It", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "104-113", MONTH = May, WHERE = "bound", ABSTRACT = " Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is particularly dramatic when small block sizes are used. At the same time, the performance of high performance microprocessors depends more and more on the accuracy of branch prediction and for reasons similar to those in the case of caches the size of the Branch Target Buffer is also increasing linearly with the address width. In this paper, we apply the simple principle stated in the title for limiting the tag size of on-chip caches. In the resulting indirect- tagged cache, the duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Applying the same principle to Branch Target Buffers, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer. "} % 2) The Difference-bit Cache @INPROCEEDINGS{juan:isca96, AUTHOR = "Toni Juan and Tomas Lang and Juan J. Navarro", TITLE = "The Difference-bit Cache", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "114-120", MONTH = May, WHERE = "bound", ABSTRACT = " The difference-bit cache is a two-way set-associative cache with an access time that is smaller than that of a conventional one and close or equal to that of a direct-mapped cache. This is achieved by noticing that the two tags for a set have to differ at least by one bit and by using this bit to select the way. In contrast with previous approaches that predict the way and have two types of hits (primary of one cycle and secondary of two to four cycles), all hits of the difference-bit cache are of one cycle. The evaluation of the access time of our cache organization has been performed using a recently proposed on-chip cache access model. "} % % Session 5B: Application Implications for MP Systems % % 1) Understanding the Performance of Shared Virtual Memory from an Applications Perspective @INPROCEEDINGS{iftode:isca96, AUTHOR = "Liviu Iftode and Jaswinder Pal Singh and Kai Li", TITLE = " Understanding the Performance of Shared Virtual Memory from an Applications Perspective", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "122-133", MONTH = May, WHERE = "bound", ABSTRACT = " Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics. We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)-with each other and with an all- hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at least up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used. "} % 2) Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines @INPROCEEDINGS{holt:isca96, AUTHOR = "Chris Holt and Jaswinder Pal Singh and John Hennessy", TITLE = "Application and Architectural Bottlenecks in Large Scale Shared Memory Multiprocessors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "134-145", MONTH = May, WHERE = "bound", ABSTRACT = " Many of the programming challenges encountered in small to moderate- scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well up to several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well. "} % % Session 6A: Superscalar Memory Systems % % 1) Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors @INPROCEEDINGS{wilson:isca96, AUTHOR = "Kenneth M. Wilson and Kunle Olukotun and Mendel Rosenblum", TITLE = "Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "147-157", MONTH = May, WHERE = "bound", ABSTRACT = " The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi- ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache. "} % 2) High-Bandwidth Address Translation for Multiple-Issue Processors @INPROCEEDINGS{austin:isca96, AUTHOR = "Todd M. Austin and Gurindar S. Sohi", TITLE = "High-Bandwidth Address Translation for Multiple-Issue Processors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "158-167", MONTH = May, WHERE = "bound", ABSTRACT = " In an effort to push the envelope of system performance, micro- processor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi- ported designs will soon impact memory access latency. We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multi- ported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times. We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single- ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches. "} % % Session 6B: I/O and Interrupts % % 1) DCD----Disk Caching Disk: A New Approach for Boosting I/O Performance @INPROCEEDINGS{huyang:isca96, AUTHOR = "Yiming Hu and Qing Yang", TITLE = "DCD----Disk Caching Disk: A New Approach for Boosting I/O Performance", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "169-178", MONTH = May, WHERE = "bound", ABSTRACT = " This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize write performance. While the cache-disk and the normal data disk have the same physical properties, the access speed of the former differs dramatically from the latter because of different data units and different ways in which data are accessed. Our objective is to exploit this speed difference by using the log disk as a cache to build a reliable and smooth disk hierarchy. The log disk caches disk writes reliably and efficiently because seek time and rotation time are eliminated for most disk operations on the log disk. A small RAM buffer is used to collect small write requests to form a log which is transferred onto the cache-disk whenever the cache-disk is idle. Because of the temporal locality that exists in office/engineering work-load environments, the DCD system shows write performance close to the same size RAM (i.e. solid- state disk) for the cost of a disk. Moreover, the cache-disk can also be implemented as a logical disk in which case a small portion of the normal data disk is used as the log disk. Trace-driven simulation experiments are carried out to evaluate the performance of the proposed disk architecture. Under the office/engineering work-load environment, the DCD shows superb disk performance for writes as compared to existing disk systems. Performance improvements of up to two orders of magnitude are observed in terms of average response time for write operations. Furthermore, DCD is very reliable and works at the device or device driver level. As a result, it can be applied directly to current file systems without the need of changing the operating system. "} % 2) Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling @INPROCEEDINGS{maquelin:isca96, AUTHOR = "Olivier Maquelin and Guang R. Gao and Herbert H. J. Hum and Kevin B. Theobald and Xin-Min Tian", TITLE = "Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "179-188", MONTH = May, WHERE = "bound", ABSTRACT = " Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach -- Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program. We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system. "} % % Session 7A: Processor Microarchitecture % % 1) Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous MultiThreading Processor @INPROCEEDINGS{tullsen:isca96, AUTHOR = "Dean M. Tullsen and Susan J. Eggers and Joel S. Emer and Henry M. Levy and Jack L. Lo and Rebecca L. Stamm", TITLE = "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "191-202", MONTH = May, WHERE = "bound", ABSTRACT = " Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved {\em without} extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the ``best'' instructions to the processor. "} % 2) Evaluation of Multithreaded Uniprocessors for Commercial Application Environments @INPROCEEDINGS{eickemeyer:isca96, AUTHOR = "Richard J. Eickemeyer and Ross E. Johnson and Steven R. Kunkel and Mark S. Squillante and Shiafun Liu", TITLE = "Evaluation of Multithreaded Uniprocessors for Commercial Application Environments", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "203-212", MONTH = May, WHERE = "bound", ABSTRACT = " As memory speeds grow at a considerably slower rate than processor speeds, memory accesses are starting to dominate the execution time of processors, and this will likely continue into the future. This trend will be exacerbated by growing miss rates due to commercial applications, object-oriented programming and micro-kernel based operating systems. We examine the use of coarse-grained multi- threading to address this important problem in uniprocessor on-line transaction processing environments where there is a natural, coarse- grained parallelism between the tasks resulting from transactions being executed concurrently, with no application software modifications required. Our results suggest that multithreading can provide significant performance improvements for uniprocessor commercial computing environments. "} % 3) Performance Comparison of ILP Machines with Cycle Time Evaluation @INPROCEEDINGS{hara:isca96, AUTHOR = "Tetsuya Hara and Hideki Ando and Chikako Nakanishi and Masao Nakaya", TITLE = "Performance Comparison of ILP Machines with Cycle Time Evaluation", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "213-224", MONTH = May, WHERE = "Philadelphia", ABSTRACT = " Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to execute a program, but do not quantitatively estimate the penalty imposed on the cycle time from the architecture. Since the performance of a microprocessor must be measured by its execution time, a cycle time evaluation is required as well as a cycle count speedup evaluation. Currently, superscalar machines are widely accepted as the machines which achieve the highest performance. On the other hand, because of hardware simplicity and instruction scheduling sophistication, there is a perception that the next generation of microprocessors will be implemented with a VLIW architecture. A simple VLIW machine, however, has a serious weakness regarding speculative execution. Thus, it is a question whether a simple VLIW machine really outperforms a superscalar machine. We recently proposed a mechanism called predicating that supports speculative execution for the VLIW machine, and showed a significant cycle count speedup over a scalar machine. Although the mechanism is simple, it is unknown how much it imposes a penalty on the cycle time, and how much the performance is improved as a result. This paper evaluates both the cycle count speedup and the cycle time for three ILP machines: a superscalar machine, a simple VLIW machine, and the VLIW machine with predicating. The evaluation results show that the simple VLIW machine slightly outperforms the superscalar machine, while the VLIW machine with predicating achieves a significant speedup of 1.41x over the superscalar machine. "} % % Session 7B: Networks % % 1) Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks @INPROCEEDINGS{kim:isca96, AUTHOR = "Jae H. Kim and Andrew A. Chien", TITLE = "Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "226-236", MONTH = May, WHERE = "bound", ABSTRACT = " Network service guarantees not only provide significant performance benefits to distributed computing systems (more balanced resource utilization, fast fault recovery, and fair network access), but they are also essential for many new applications requiring real-time communications with continuous data types (audio/video). Most existing algorithms which provide network service guarantees are too complicated to be feasible in high-speed, low-cost switches for multicomputer networks. The simpler algorithms proposed provide only limited service guarantees or waste significant network resources. We present a novel, cost-effective queueing and scheduling algorithm, called Rotating Combined Queueing (RCQ), which can efficiently support a range of service guarantees including deterministic delay bounds and bandwidth guarantees in multicomputer networks. By allowing bursty traffic to utilize unused network resources efficiently, RCQ also can provide competitive performance to best-effort data communications. Such cost-effective service guarantees not only provide substantial benefits to the overall system performance, but can also further expand the domain of multicomputer applications to encompass distributed multimedia applications requiring isochronous communications. "} % 2) A Router Architecture for Real-Time Point-to-Point Networks @INPROCEEDINGS{rexford:isca96, AUTHOR = "Jennifer Rexford and John Hall and Kang G. Shin", TITLE = "A Router Architecture for Real-Time Point-to-Point Networks", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "237-246", MONTH = May, WHERE = "bound", ABSTRACT = " Parallel machines have the potential to satisfy the large computational demands of emerging real-time applications. These applications require a predictable communication network, where time-constrained traffic requires bounds on latency or throughput while good average performance suffices for best-effort packets. This paper presents a router architecture that tailors low-level routing, switching, arbitration and flow-control policies to the conflicting demands of each traffic class. The router implements deadline-based scheduling, with packet switching and table-driven multicast routing, to bound end-to-end delay for time-constrained traffic, while allowing best-effort traffic to capitalize on the low-latency routing and switching schemes common in modern parallel machines. To limit the cost of servicing time-constrained traffic, the router shares packet buffers and link-scheduling logic between the multiple output ports. Verilog simulations demonstrate that the design meets the performance goals of both traffic classes in a single-chip solution. "} % 3) Coherent Network Interfaces for Fine-Grain Communication @INPROCEEDINGS{mukherjee:isca96, AUTHOR = "Shubhendu S. Mukherjee and Babak Falsafi and Mark D. Hill and David A. Wood", TITLE = "Coherent Network Interfaces for Fine-Grain Communication", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "247-258", MONTH = May, WHERE = "bound", ABSTRACT = " Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling). This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process. Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coher ent blocks managed as a circular queue. Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus. "} % % Session 8: Performance Evaluation and Optimization % % 1) Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors @INPROCEEDINGS{horowitz:isca96, AUTHOR = "Mark Horowitz and Margaret Martonosi and Todd C. Mowry and Michael D. Smith", TITLE = "Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "260-270", MONTH = May, WHERE = "bound", ABSTRACT = " Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations-one based on a cache-outcome condition code and another based on low-overhead traps-and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations. "} % 2) Instruction Prefetching of Systems Codes With Layout Optimized for Reduced Cache Misses @INPROCEEDINGS{xia:isca96, AUTHOR = "Chun Xia and Josep Torrellas", TITLE = "Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "271-282", MONTH = May, WHERE = "bound", ABSTRACT = " High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large systems codes. To improve the performance of the latter codes, the compiler can be used to lay out the code in memory for reduced cache conflicts. Interestingly, such an operation leaves the code in a state that can be exploited by a new type of instruction prefetching: guarded sequential prefetching. The idea is that the compiler leaves hints in the code as to how the code was laid out. Then, at run time, the prefetching hardware detects these hints and uses them to prefetch more effectively. This scheme can be implemented very cheaply: one bit encoded in control transfer instructions and a prefetch module that requires minor extensions to existing next-line sequential prefetchers. Furthermore, the scheme can be turned off and on at run time with the toggling of a bit in the TLB. The scheme is evaluated with simulations using complete traces from a 4-processor machine. Overall, for 16-Kbyte primary instruction caches, guarded sequential prefetching removes, on average, 66 percent of the instruction misses remaining in an operating system with an optimized layout, speeding up the operating system by 10 percent. Moreover, the scheme is more cost-effective and robust than existing sequential prefetching techniques. "} % 3) Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Evaluation @INPROCEEDINGS{choi:isca96, AUTHOR = "Lynn Choi and Pen-Chung Yew", TITLE = "Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "283-294", MONTH = May, WHERE = "bound", ABSTRACT = " In this paper, we study a hardware-supported, compiler-directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration have also been addressed. The cost of the required hardware support is small and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data-flow analysis, have been implemented on the Polaris compiler. From our simulation study using the Perfect Club benchmarks, we found that, in spite of the conservative analysis made by the compiler, the performance of the proposed HSCD scheme can be comparable to that of a full-map hardware directory scheme. With its comparable performance and reduced hardware cost, the scheme can be a viable alternative for large-scale multiprocessors, such as the Cray T3D, that rely on users to maintain data coherence. "} % % Session 9: Systems % % 1) Early Experience with Message-Passing on the SHRIMP Multicomputer @INPROCEEDINGS{alpert:isca96, AUTHOR = "Richard D. Alpert and Angelos Bilas and Matthias A. Blumrich and Douglas W. Clark and Stefanos Damianakis and Cezary Dubnicki and Edward W. Felten and Liviu Iftode and Kai Li", TITLE = "Early Experience with Message-Passing on the SHRIMP Multicomputer", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "296-307", MONTH = May, WHERE = "bound", ABSTRACT = " The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that a data transfer can be done without the intervention of the receiving node CPU. An important question is whether such a mechanism can indeed deliver all of the available hardware performance to applications which use conventional message-passing libraries. This paper reports our early experience with message-passing on a small, working SHRIMP multicomputer. We have implemented several user-level communication libraries on top of the VMMC mechanism, including the NX message-passing interface, Sun RPC, stream sockets, and specialized RPC. The first three are fully compatible with existing systems. Our experience shows that the VMMC mechanism supports these message-passing interfaces well. When zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance. "} % 2) STiNG: A CC-NUMA Computer System for the Commercial Marketplace @INPROCEEDINGS{lovett:isca96, AUTHOR = "Thomas D. Lovett and Russell M. Clapp", TITLE = "STiNG: A CC-NUMA Computer System for the Commercial Marketplace", BOOKTITLE = "Proceedings of the 23rd Annual International Symposium on Computer Architecture", YEAR = 1996, PAGES = "308-317", MONTH = May, WHERE = "bound", ABSTRACT = " STiNG is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multiprocessor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI) based coherent interconnect. The Quads are based on the Intel P6 processor and the external bus it defines. In addition to 4 P6 processors, each Quad may contain up to 4 GBytes of system memory, 2 Peripheral Component Interface (PCI) busses for I/O, and a Lynx board. The Lynx board provides the datapath to the SCI-based interconnect and ensures system-wide cache coherency. STiNG is one of the first commercial CC-NUMA systems to be built. This paper describes the motivation for building STiNG as well as its architecture and implementation. In addition, performance analysis is provided for On-Line Transaction Processing (OLTP) and Decision Support System (DSS) workloads. Finally, the status of the current implementation is reviewed. "}