% ------------------------------------------------------------------------- % % These bibtex bibliographic entries for the 25th % INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE % (1998) created by Rebecca Hoffman and Mark D. Hill % from author input. % % These entries are correct to the best of our knowledge, % but we accept no responsibility for the consequences of % any errors. Email corrections to hoffman@cs.wisc.edu. % Last change: Thu Mar 27 11:22:10 CST 1997. % ------------------------------------------------------------------------- % % Machine Measurement %1) Memory System Characterization of Commercial Workloads @INPROCEEDINGS{barroso:memory, AUTHOR = "Luiz Andre Barroso and Kourosh Gharachorloo and Edouard Bugnion", TITLE = "Memory System Characterization of Commercial Workloads", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "3--14", MONTH = June, WHERE = "bound", ABSTRACT = " Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. "} %2) Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads @INPROCEEDINGS{keeton:pprodb, AUTHOR = "Kimberly Keeton and David A. Patterson and Yong Qiang He and Roger C. Raphael and Walter E. Baker", TITLE = "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "15--26", MONTH = June, WHERE = "Barcelona, Spain", ABSTRACT = " Commercial applications are an important, yet often overlooked, workload with significantly different characteristics from technical workloads. The potential impact of these differences is that computers optimized for technical workloads may not provide good performance for commercial applications, and these applications may not fully exploit advances in processor design. To evaluate these issues, we use hardware counters to measure architectural features of a four-processor Pentium Pro-based server running a TPC-C-like workload on an Informix database. We examine the effectiveness of out-of-order execution, branch prediction, speculative execution, superscalar issue and retire, caching, and multiprocessor scaling. We find that out-of-order execution, superscalar issue and retire, and branch prediction are not as effective for database workloads as they are for technical workloads, such as SPEC. We find that caches are effective at reducing processor traffic to memory; even larger caches would be helpful to satisfy more data requests. Multiprocessor scaling of this workload is good, but even modest bus utilization degrades application memory latency, limiting database throughput. "} %3) Execution Characteristics of Desktop Applications on Windows NT @INPROCEEDINGS{Lee:NT, AUTHOR = "Dennis C. Lee and Patrick J. Crowley and Jean-Loup Baer and Thomas E. Anderson and Brian N. Bershad", TITLE = "Execution Characteristics of Desktop Applications on {Windows NT}", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "27--38", MONTH = June, WHERE = "bound", ABSTRACT = " This paper examines the performance of desktop applications running on the Microsoft Windows NT operating system on Intel x86 processors, and contrasts these applications to the programs in the integer SPEC95 benchmark suite. We present measurements of basic instruction set and program characteristics, and detailed simulation results of the way these programs use the memory system and processor branch architecture. We show that the desktop applications have similar characteristics to the integer SPEC95 benchmarks for many of these metrics. However, compared to the integer SPEC95 applications, desktop applications have larger instruction working sets, execute instructions in a greater number of unique functions, cross DLL boundaries frequently and execute a greater number of indirect calls. "} %4) An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors @INPROCEEDINGS{lo:smtdatabase, AUTHOR = "Jack L. Lo and Luiz Andre Barroso and Susan J. Eggers and Kourosh Gharachorloo and Henry M. Levy and Sujay S. Parekh", TITLE = "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "39--50", MONTH = June, WHERE = "bound", ABSTRACT = " Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable critical working set. Second, we show that the additional data cache conflicts caused by simultaneous-multithreaded instruction scheduling can be nearly eliminated by the proper choice of software- directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better inter-thread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT's latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8-context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. "} % Program Behaviour %1) An Analysis of Correlation and Predictability: What Makes Two-level Branch Predictors Work @INPROCEEDINGS{evers:corr, AUTHOR = "Marius Evers and Sanjay J. Patel and Robert S. Chappell and Yale N. Patt", TITLE = "An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "52--61", MONTH = June, WHERE = "bound", ABSTRACT = " Pipeline flushes due to branch mispredictions is one of the most serious problems facing the designer of a deeply pipelined, superscalar processor. Many branch predictors have been proposed to help alleviate this problem, including two-level adaptive branch predictors and hybrid branch predictors. Numerous studies have shown which predictors and configurations best predict the branches in a given set of benchmarks. Some studies have also investigated effects, such as pattern history table interference, that can be detrimental to the performance of these predictors. However, little research has been done on which characteristics of branch behavior make predictors perform well. In this paper, we investigate and quantify reasons why branches are predictable. We show that some of this predictability is not captured by the two-level adaptive branch predictors. An understanding of the predictability of branches m ay lead to insights ultimately resulting in better or less complex predictors. We also investigate and quantify what fraction of the branches in each benchmark is predictable using each of the methods described in this paper. "} %2) Branch Prediction Based on Universal Data Compression Algorithms @INPROCEEDINGS{federovsky:branch, AUTHOR = "Eitan Federovsky and Meir Feder and Shlomo Weiss", TITLE = "Branch Prediction Based on Universal Data Compression Algorithms", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "62--72", MONTH = June, WHERE = "bound", ABSTRACT = " Data compression and prediction are closely related. Thus prediction methods based on data compression algorithms have been suggested for the branch prediction problem. In this work we consider two universal compression algorithms: prediction by partial matching (PPM), and a recently developed method, context tree weighting (CTW). We describe the prediction algorithms induced by these methods. We also suggest adaptive algorithms - variations of the basic methods that attempt to fit limited memory constraints and to match the non-stationary nature of the branch sequence. Furthermore, we show how to incorporate address information and ne other relevant data. Finally, we present simulation results for selected programs from the SPECint95, SYSmark/32, SYSmark/NT, and transactional processing benchmarks. Our results are most promising in programs with difficult to predict branch behavior. "} %3) Modeling Program Predictability @INPROCEEDINGS{sazeidessmith:modeling, AUTHOR = "Yiannakis Sazeides and James E. Smith", TITLE = "Modeling Program Predictability", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "73--84", MONTH = June, WHERE = "Barcelona", ABSTRACT = " Basic properties of program predictability -- for both values and control -- are defined and studied. We take the view that program predictability originates at certain points during a program's execution, flows through subsequent instructions, and then ends at other points in the program. These key components of predictability: generation, propagation, and termination; are defined in terms of a model. The model is based on a graph derived from dynamic data dependences and a predictor. Using the SPEC95 benchmarks, we analyze the predictability phenomena both separately and in combination. Examples are provided to illustrate relationships between model-based characteristics and program constructs. It is shown that most predictability derives from program control structure and immediate values, not program input data. Furthermore, most predictability originates from a relatively small number of generate points. The analysis of obtained results suggests a number of ramifications regarding predictability and its use. "} % Graphics and IO %1) Multi-level Texture Caching for 3D Graphics Hardware @INPROCEEDINGS{cox:ionic, AUTHOR = "Michael Cox, Narendra Bhandari, and Michael Shantz", TITLE = "Multi-Level Texture Caching for 3D Graphics Hardware", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "86--97", MONTH = June, WHERE = "bound", ABSTRACT = " Traditional graphics hardware architectures implement what we call the push architecture for texture mapping. Local memory is dedicated to the accelerator for fast local retrieval of texture during rasterization, and the application is responsible for managing this memory. The push architecture has a bandwidth advantage, but disadvantages of limited texture capacity, escalation of accelerator memory requirements (and therefore cost), and poor memory utilization. The push architecture also requires the programmer to solve the bin-packing problem of managing accelerator memory each frame. More recently graphics hardware on PC-class machines has moved to an implementation of what we call the pull architecture. Texture is stored in system memory and downloaded by the accelerator as needed. The pull architecture has advantages of texture capacity, stems the escalation of accelerator memory requirements, and has good memory utilization. It also frees the programmer from accelerator texture memory management. However, the pull architecture suffers escalating requirements for bandwidth from main memory to the accelerator. In this paper we propose multi-level texture caching to provide the accelerator with the bandwidth advantages of the push architecture combined with the capacity advantages of the pull architecture. We have studied the feasibility of 2-level caching and found the following: (1) significant re-use of texture between frames; (2) L2 caching requires significantly less memory than the push architecture; (3) L2 caching requires significantly less bandwidth from host memory than the pull architecture; (4) L2 caching enables implementation of smaller L1 caches that would otherwise bandwidth-limit accelerators on the workloads in this paper. Results suggest that an L2 cache achieves the original advantage of the pull architecture -- stemming the growth of local texture memory -- while at the same time stemming the current explosion in demand for texture bandwidth between host memory and the accelerator. "} %2) Switcherland: A QoS Communication Architecture for Workstation Clusters @INPROCEEDINGS{eberle:switcherland, AUTHOR = "Hans Eberle and Erwin Oertli", TITLE = "Switcherland: A QoS Communication Architecture for Workstation Clusters", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "98--108", MONTH = June, WHERE = "bound", ABSTRACT = " Computer systems have become powerful enough to process continuous data streams such as video or animated graphics. While processing power and communication bandwidth of today's systems typically are sufficient, quality of service (QoS) guarantees as required for handling such data types cannot be provided by these systems in adequate ways. We present Switcherland, a scalable communication architecture based on crossbar switches that provides QoS guarantees for workstation clusters in the form of reserved bandwidth and bounded transmission delays. Similar to the ATM technology Switcherland provides QoS guarantees with the help of service classes, that is, data transfers are characterized as variable bit rate traffic or constant bit rate traffic. However, unlike LAN technologies, Switcherland is optimized for cluster computing in that (i) it serves as ackplane interconnection fabric as well as a LAN, (ii) it extends support for service classes by also covering the end nodes of the network, (iii) it provides low latency in the order of one microsecond per switch, and (iv) it uses a communication model based on a global memory to simplify programming. "} %3) Declustered Disk Array Architectures with Optimal and Near-optimal Parallelism @INPROCEEDINGS{alvarez:parallelism, AUTHOR = "Guillermo A. Alvarez and Walter A. Burkhard and Larry J. Stockmeyer and Flaviu Cristian", TITLE = "Declustered Disk Array Architectures with Optimal and Near-optimal Parallelism", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "109--120", MONTH = June, WHERE = "bound", ABSTRACT = " This paper investigates the placement of data and parity on redundant disk arrays. Declustered organizations have been traditionally used to achieve fast reconstruction of a failed disk's contents. In previous work, Holland and Gibson identified six desirable properties for ideal layouts; however, no declustered layout satisfying all properties has been published in the literature. We present a complete, constructive characterization of the collection of ideal declustered layouts possessing all six properties. Given that ideal layouts exist only for a limited set of configurations, we also present two novel layout families. Prime and Relpr can tolerate multiple failures in a wide variety of configurations with slight deviations from the ideal. Our simulation studies show that the new layouts provide excellent parallel access performance and reduced incremental loads during degraded operation, when compared with previously published layouts. For large accesses and under high loads, response times for the new layouts are typically smaller than those of previously published declustered layouts by a factor of 2.5. "} % Speculation %1) Confidence Estimation for Speculation Control @INPROCEEDINGS{grunwald98:confidence, AUTHOR = "Dirk Grunwald and Artur Klauser and Srilatha Manne and Andrew Ples zkun", TITLE = "Confidence Estimation for Speculation Control", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "122--131", MONTH = June, WHERE = "bound", ABSTRACT = " Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could be used, such as threading or eager execution. As the use of speculation increases, we believe more processors will need some form of speculation control to balance the benefits of speculation against other possible activities. Confidence estimation is one technique that can be exploited by architects for speculation control. In this paper, we introduce performance metrics to compare confidence estimation mechanisms, and argue that these metrics are appropriate for speculation control. We compare a number of confidence estimation mechanisms, focusing on mechanisms that have a small implementation cost and gain benefit by exploiting characteristics of branch predictors, such as clustering of mispredicted branches. We compare the performance of the different confidence estimation methods using detailed pipeline simulations. Using these simulations, we show how to improve some confidence estimators, providing better insight for future investigations comparing and applying confidence estimators. "} %2) Pipeline Gating: Speculation Control for Energy Reduction @INPROCEEDINGS{manne98:gating, AUTHOR = "Srilatha Manne and Artur Klauser and Dirk Grunwald", title = "Pipeline Gating: Speculation Control For Energy Reduction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "132--141", MONTH = June, WHERE = "bound", ABSTRACT = " Branch prediction has enabled microprocessors to increase instruction level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries. Although speculative execution is essential for increasing the instructions per cycle (IPC), it does come at a cost. A large amount of unnecessary work results from wrong-path instructions entering the pipeline due to branch misprediction. Results generated with the SimpleScalar tool set using a 4-way issue pipeline and various branch predictors show an instruction overhead of 16% to 105% for every instruction committed. The instruction overhead will increase in the future as processors use more aggressive speculation and wider issue widths. In this paper, we present an innovative method for power reduction which, unlike previous work that sacrificed flexibility or performance, reduces power in high-performance microprocessors without impacting performance. In particular, we introduce a hardware mechanism called pipeline gating to control rampant speculation in the pipeline. We present inexpensive mechanisms for determining when a branch is likely to mispredict, and for stopping wrong-path instructions from entering the pipeline. Results show up to a 38% reduction in wrong-path instructions with a negligible performance loss (~ 1%). Best of all, even in programs with a high branch prediction accuracy, performance does not noticeably degrade. Our analysis indicates that there is little risk in implementing this method in existing processors since it does not impact performance and can benefit energy reduction. "} %3) Memory Dependence Prediction Using Store Sets @INPROCEEDINGS{chrysos:storesets, AUTHOR = "George Z. Chrysos and Joel S. Emer", TITLE = "Memory Dependence Prediction using Store Sets", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "142--153", MONTH = June, WHERE = "bound", ABSTRACT = " For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store in-structions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and com-municate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's \"store set\". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar ma-chines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the imple-mentation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance. "} % Prediction Techniques %1) Dynamic History-length Fitting: A Third Level of Adaptivity for Branch Prediction @INPROCEEDINGS{toni:dhlf, AUTHOR = "Toni Juan and Sanji Sanjeevan and Juan J. Navarro", TITLE = "Dynamic History-Length Fitting: A third level of adaptivity for branch prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "155--166", MONTH = June, WHERE = "Barcelona", ABSTRACT = " Accurate branch prediction is essential for obtaining high performance in pipelined superscalar processors that execute instructions speculatively. Some of the best current predictors combine a part of the branch address with a fixed amount of global history of branch outcomes in order to make a prediction. These predictors cannot perform uniformly well across all workloads because the best amount of history to be used depends on the code, the input data and the frequency of context switches. Consequently, all predictors that use a fixed history length are therefore unable to perform up to their maximum potential. We introduce a method --called DHLF-- that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload. Our proposal adds an extra level of adaptivity to two-level adaptive branch predictors. The DHLF method can be applied to any one of the predictors that combine global branch history with the branch address. We apply the DHLF method to gshare (dhlf-gshare) and obtain near-optimal results for all SPECint95 benchmarks, with and without context switches. Some results are also presented for gskewed (dhlf-gskewed), confirming that other predictors can benefit from our proposal. "} %2) Accurate Indirect Branch Prediction AUTHOR = "Karel Driesen and Urs H\166lzle ", TITLE = "Accurate Indirect Branch Prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "167--178", MONTH = June, WHERE = "bound", ABSTRACT = bstract: Indirect branch prediction is likely to become increasingly important in the future because indirect branches occur more frequently in object-oriented programs. With misprediction rates of around 25% on current processors, indirect branches can incur a significant fraction of branch misprediction overhead even though they remain less frequent than the more predictable conditional branches. We investigate a wide range of two-level predictors dedicated exclusively to indirect branches. Starting with predictors that use full-precision addresses and H unlimited tables, we progressively introduce hardware constraints and minimize the loss of predictor performance at each step. For programs from the SPECint95 suite as well as a suite of large C++ applications, a two-level predictor achieves a misprediction rate of 9.8% with a 1K-entry table and 7.3% with an 8K-entry table, representing more than a threefold improvement over an ideal BTB. A hybrid predictor further reduces the misprediction rates to 8.98% and 5.95%, respectively. "} %3) Using Prediction to Accelerate Coherence Protocols @INPROCEEDINGS{mukherjee:cosmos, AUTHOR = "Shubhendu S. Mukherjee and Mark D. Hill", TITLE = "Using Prediction to Accelerate Coherence Protocols", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "179--190", MONTH = June, WHERE = "bound", ABSTRACT = " Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely- cached blocks. To ameliorate this latency, researchers have aug mented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions. This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracies of 62% to 93%. Cosmos' high prediction accuracy is a result of predictable coherence message signatures that arise from stable sharing patterns of cache blocks. "} % Memory Management %1) Active Pages: A Computation Model for Intelligent Memory OCEEDINGS{oskin:active-pages, AUTHOR = "Mark Oskin, Frederic T. Chong, Timothy Sherwood", TITLE = "Active Pages: A Computation Model for Intelligent Memory", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "192--203", MONTH = June, WHERE = "bound", ABSTRACT = " Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic. Results from the SimpleScalar simulator demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies. "} %2) Increasing TLB Reach Using Superpages Backed by Shadow Memory @INPROCEEDINGS{swanson:superpages, AUTHOR = "Mark R. Swanson and Leigh Stoller and John Carter", TITLE = "Increasing TLB Reach Using Superpages Backed by Shadow Memory", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "204--213", MONTH = June, WHERE = "bound", ABSTRACT = " The amount of memory that can be accessed without causing a TLB fault, the reach of a TLB, is failing to keep pace with the increasingly large working sets of applications. We propose to extend TLB reach via a novel Memory Controller TLB (MTLB) that lets us aggressively create superpages from non-contiguous, unaligned regions of physical memory. This flexibility increases the OS's ability to use superpages on arbitrary application data. The MTLB supports shadow pages, regions of physical address space for which the MTLB remaps accesses to real physical pages. The MTLB preserves per-base-page referenced and dirty bits, which enables the OS to swap shadow-backed superpages a page at a time, unlike conventional superpages. Simulation of five applications, including two SPECint95 benchmarks, demonstrated that a modest-sized MTLB improves performance of applications with moderate-to-high TLB miss rates by 5-20 percent. Simulation also showed that this mechanism can more than double the effective reach of a processor TLB with no modification to the processor MMU. "} %3) Options for Dynamic Address Translation in COMAs @INPROCEEDINGS{qiu:options, AUTHOR = "Xiaogang Qiu and Michel Dubois", TITLE = "Options for Dynamic Address Translation in COMAs", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "214--225", MONTH = June, WHERE = "bound", ABSTRACT = " In modern processors, the dynamic translation of virtual addresses to support virtual memory is done before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably the latency and bandwidth demands on the TLB (Translation Lookaside Buffer) are getting more and more difficult to meet. The situation is worse in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We evaluate and compare five options for virtual ddress translation in the context of COMAs (Cache Only Memory Architectures). The dynamic address translation mechanism can be located after the cache access provided the cache is virtual. In a particular design, which we call V-COMA for Virtual COMA, the physical address concept and the traditional TLB are eliminated. While still supporting virtual memory, V-COMA reduces the address translation overhead to a minimum. V-COMA scales well and works better in systems with large number of processors. As a machine running on virtual addresses, V-COMA provides a simple and consistent hardware model to the operating system and the compiler, in which further optimization opportunities are possible. "} % Prediction and Multipath Execution %1) Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture @INPROCEEDINGS{august:epic, AUTHOR = "D. I. August and D. A. Connors and S. A. Mahlke and J. W. Sias and K. M. Crozier and B. Cheng and P. R. Eaton and Q. B. Olaniran and W. W. Hwu", TITLE = "Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "227--237", MONTH = June, WHERE = "bound", ABSTRACT = " Explicitly Parallel Instruction Computing (EPIC) architectures require the compiler to express program instruction level parallelism directly to the hardware. EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication have individually been shown to be very effective. However, these techniques have not been studied in combination with each other. This paper presents the IMPACT EPIC Architecture to address the issues involved in designing processors based on these EPIC concepts. In particular, we focus on new execution and recovery models in which microarchitectural support for predicated execution is also used to enable efficient recovery from exceptions caused by speculatively executed instructions. This paper demonstrates that a coherent framework to integrate the three techniques can be elegantly designed to achieve much better performance than each individual technique could alone provide. "} %2) Threaded Multiple Path Execution @INPROCEEDINGS{wallace:TME, AUTHOR = "Steven Wallace and Brad Calder and Dean M. Tullsen", TITLE = "Threaded Multiple Path Execution", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "238--249", MONTH = June, WHERE = "bound", ABSTRACT = " This paper presents Threaded Multi-Path Execution (TME), which exploits existing hardware on a Simultaneous Multithreading (SMT) processor to speculatively execute multiple paths of execution. When there are fewer threads in an SMT processor than hardware contexts, threaded multi-path execution uses spare contexts to fetch and execute code along the less likely path of hard-to-predict branches. This paper describes the hardware mechanisms needed to enable an SMT processor to efficiently spawn speculative threads for threaded multi-path execution. The Mapping Synchronization Bus is described, which enables the spawning of these multiple paths. Policies are examined for deciding which branches to fork, and for managing competition between primary and alternate path threads for critical resources. Our results show that TME increases the single program performance of an SMT with eight thread contexts by 14-23 percent on average, depending on the misprediction penalty, for programs with a high misprediction rate. "} %3) Selective Eager Execution on the PolyPath Architecture @INPROCEEDINGS{Klauser:polypath, AUTHOR = "Artur Klauser and Abhijit Paithankar and Dirk Grunwald", TITLE = "Selective Eager Execution on the PolyPath Architecture", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "250--259", MONTH = June, WHERE = "bound", ABSTRACT = " Control-flow misprediction penalties are a major impediment to high performance in wide-issue superscalar processors. In this paper we present Selective Eager Execution (SEE), an execution model to overcome mis-speculation penalties by executing both paths after diffident branches. We present the micro-architecture of the PolyPath processor, which is an extension of an aggressive superscalar, out-of-order architecture. The PolyPath architecture uses a novel instruction tagging and register renaming mechanism to execute instructions from multiple paths simultaneously in the same processor pipeline, while retaining maximum resource availability for single-path code sequences. Results of our execution-driven, pipeline-level simulations show that SEE can improve performance by as much as 36% for the go benchmark, and an average of 14% on SPECint95, when compared to a normal superscalar, out-of-order, speculative execution, mono-path processor. Moreover, our architectural model is both elegant and practical to implement, using a small amount of additional state and control logic. "} % Processor Microarchitecture %1) Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing @INPROCEEDINGS{sanjay:tracecache, AUTHOR = "Sanjay J. Patel and Marius Evers and Yale N. Patt", TITLE = "Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "262--271", MONTH = June, WHERE = "bound", ABSTRACT = " The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. As a result, the trace cache delivers instructions at a high rate by supplying multiple fetch blocks each cycle. In this paper, we examine two techniques to improve the number of instructions delivered each cycle by the trace cache. The first technique, branch promotion, dynamically converts strongly biased branches into branches with static predictions. Because these promoted branches require no dynamic prediction, the branch predictor suffers less from the negative effects of interference. Branch promotion unlocks the potential of the second technique: trace packing. With trace packing, trace segments are packed with as many instructions as will fit, without regard to naturally occurring fetch block boundaries. With both techniques, the effective fetch rate of the trace cache jumps up 17% over a trace cache which implements neither. On a machine where the execution engine has a very aggressive memory disambiguator, the performance of a machine using branch promotion and trace packing is on average 11% higher than a machine using neither technique. "} %2) The Effect of Instruction Fetch Bandwidth on Value Prediction @INPROCEEDINGS{Gabbay:vp, AUTHOR = "Freddy Gabbay and Avi Mendelson", TITLE = "The Effect of Instruction Fetch Bandwidth on Value Prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "272--281", MONTH = June, WHERE = "bound", ABSTRACT = " Value prediction attempts to eliminate true-data dependencies by dynamically pre dicting the outcome values of instructions and executing true-data dependent ins tructions based on that prediction. In this paper we attempt to understand the l imitations of using this paradigm in realistic machines. We show that the instru ction-fetch bandwidth and the issue rate have a very significant impact on the e fficiency of value prediction. In addition, we study how recent techniques to im prove the instruction-fetch rate affect the efficiency of value prediction and i ts hardware organization. "} %3) Dynamic IPC/Clock Rate Optimization @INPROCEEDINGS{albonesi:ipcclock, AUTHOR = "David H. Albonesi", TITLE = "Dynamic IPC/Clock Rate Optimization", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "282--292", MONTH = June, WHERE = "bound", ABSTRACT = " Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-matched to the particular hardware organization chosen. We present a new approach called Complexity-Adaptive Processors (CAPs) in which the IPC/clock rate tradeoff can be altered at runtime to dynamically match the changing requirements of the instruction stream. By exploiting repeater methodologies used increasingly in deep sub-micron designs, CAPs achieve this flexibility with potentially no cycle time impact compared to a fixed architecture. Our preliminary results in applying this approach to on-chip caches and instruction queues indicate that CAPs have the potential to significantly outperform conventional approaches on workloads containing both general-purpose and scientific applications. "} %4) Performance Modeling and Code Partitioning for the DS Architecture @INPROCEEDINGS{doe:DSarch, AUTHOR = "Yinong Zhang and George B. Adams III", TITLE = "Performance Modeling and Code Partitioning for the DS Architecture" , BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "293--304", MONTH = June, WHERE = "bound", ABSTRACT = " DS (Decoupled-Superscalar) is a new microarchitecture that combines decoupled and superscalar techniques to exploit instruction level parallelism. Issue bandwidth is increased while circuit complexity growth is controlled with little negative impact on performance. Programs for DS are compiled into two instruction substreams: the dominant substream navigates the control flow and the rest of computational task is shared between the dominant and subsidiary substreams. Each substream is processed by a separate superscalar core realizable with current VLSI technology. DS machines are binary compatible with superscalar machines having the same instruction set, and a family of DS machines is binary compatible without recompilation. DS run time behavior is examined with an analytical model. A novel technique for controlling slip between substreams is introduced. Code partitioning issues of instruction count balancing and residence time balancing, important to any split-stream scheme, are discussed. Simulation shows DS achieves performance comparable to an aggressive superscalar, but with potentially less complex hardware and faster clock rate. "} % Parallel Machines %1) Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor @INPROCEEDINGS{Keckler:MIT, AUTHOR = "Stephen W. Keckler and William J. Dally and Daniel Maskit and Nicholas P. Carter and Andrew Chang and Whay S. Lee", TITLE = "Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "306--317", MONTH = June, WHERE = "bound", ABSTRACT = " Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-siz e of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As th is fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Pr ocessor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a share d on-chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications. "} %2) Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance @INPROCEEDINGS{abandah:exemplar, AUTHOR = "Gheith A. Abandah and Edward S. Davidson", TITLE = "Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "318--329", MONTH = June, WHERE = "bound", ABSTRACT = " Advances in microarchitecture, packaging, and manufacturing processes enable designers to build new systems with higher performance and scalability. Using microbenchmark techniques, we contrast the memory and communication performance of two generations of the HP/Convex Exemplar scalable parallel processing system. The SPP1000 and SPP2000 have significant architectural and implementation differences, but maintain upward binary compatibility. The SPP2000 employs manufacturing and packaging advances to obtain shorter system interconnects with wider data paths and improved functionality, thereby reducing the latency and increasing the bandwidth of remote communication. Although the memory latency is not significantly improved, newer out-of-order execution processors coupled with nonblocking caches achieve much higher memory bandwidth. The SPP2000 has a richer system interconnect topology that allows scalability to a larger number of processors. The SPP2000 also employs innovations in its coherence protocols to improve synchronization and communication performance. This paper characterizes the performance effects of these changes, and identifies some remaining inefficiencies, in the cache coherence protocol and the node configuration, that future systems should address. "} %3) Design Choices in the SHRIMP System: An Empirical Study @INPROCEEDINGS{blumrich:shrimp, AUTHOR = "Matthias A. Blumrich and Richard D. Alpert and Yuqun Chen and Doug las W. Clark and Stefanos N. Damianakis and Cezary Dubnicki and Edward W. Felten and Liviu Iftode and Kai Li and Margaret Martonosi and Robert A. Shillner", TITLE = "Design Choices in the SHRIMP System: An Empirical Study", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "330--341", MONTH = June, WHERE = "bound", ABSTRACT = " The SHRIMP cluster-computing system has progressed to a point of relative maturity; a variety of applications are running on a 16-node system. We have enough experience to understand what we did right and wrong in designing and building the system. In this paper we discuss some of the lessons we learned about computer architecture, and about the challenges involved in building a significant working system in an academic research environment. We evaluate significant design choices by modifying the network interface firmware and the system software in order to empirically compare our design to other approaches. "} %4) Flexible Use of Memory for Replication/Migration in Cache-coherent DMS Multiprocessors @INPROCEEDINGS{soundararajan:flexible, AUTHOR = "Vijayaraghavan Soundararajan and Mark Heinrich and Ben Verghese and Kourosh Gharachorloo and Anoop Gupta and John Hennessy", TITLE = "Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "342--355", MONTH = June, WHERE = "bound", ABSTRACT = " Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data- locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible. "} % Caches and Memory Systems %1) Exploiting Spatial Locality in Data Caches Using Spatial Footprints @INPROCEEDINGS{doe:ionic, AUTHOR = "Sanjeev Kumar and Christopher Wilkerson", TITLE = "Exploiting Spatial Locality in Data Caches using Spatial Footprints ", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "357--368", MONTH = June, WHERE = "bound", ABSTRACT = " Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache gets used before eviction. The unused portion of the cache line negatively impacts performance by wasting bandwidth and polluting the cache by replacing potentially useful data that would otherwise remain in the cache. This paper describes an alternative approach to exploit spatial locality available in data caches. On a cache miss, our mechanism, called Spatial Footprint Predictor (SFP), predicts which portions of a cache block will get used before getting evicted. The high accuracy of the predictor allows us to exploit spatial locality exhibited in larger blocks of data yielding better miss ratios without significantly impacting the memory access latencies. Our evaluation of this mechanism shows that the miss rate of the cache is improved, on average, by 18% in addition to a significant reduction in the bandwidth requirement. "} %2) Low Load Latency through Sum-addressed Memory (SAM) @INPROCEEDINGS{lynch:analytic, AUTHOR ="William L. Lynch, Gary Lauterbach, and Joseph Chamdani", TITLE = "Low Load Latency through Sum-Addressed Memory (SAM)", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "369--379", MONTH = June, WHERE = "bound", ABSTRACT = " Load latency contributes significantly to execution time. Because most cache accesses hit, cache-hit latency becomes an important component of expected load latency. Most modern microprocessors have base+offset addressing loads; thus effective cache-hit latency includes an addition as well as the RAM access. This paper introduces a new technique used in the UltraSPARC III microprocessor, Sum-Addressed Memory (SAM), which performs true addition using the decoder of the RAM array, with very low latency. We compare SAM with other methods for reducing the add part of load latency. These methods include sum-prediction with recovery, and bitwise indexing with duplicate-tolerance. The results demonstrate the superior performance of SAM. "} %3) Analytic Evaluation of Shared-memory Parallel Systems with ILP Processors @INPROCEEDINGS{sorin:analytic, AUTHOR = "Daniel J. Sorin and Vijay S. Pai and Sarita V. Adve and Mary K. Vernon and David A. Wood", TITLE = "Analytic Evaluation of Shared-Memory Systems with ILP Processors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "380--391", MONTH = June, WHERE = "bound", ABSTRACT = " This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds. The model input parameters characterize the ability of an application to exploit instruction-level parallelism as well as the interaction between the application and the memory system architecture. A trace-driven simulation methodology is developed that allows these parameters to be generated over 100 times faster than with a detailed execution-driven simulator. Finally, this paper shows that the analytical model can be used to gain insights into application performance and to evaluate architectural design trade-offs. "} From hoffman@cs.wisc.edu Tue Aug 4 13:18:56 1998 Received: from primost (localhost [127.0.0.1]) by primost.cs.wisc.edu (8.8.6/8.8.6) with SMTP id NAA00617 for ; Tue, 4 Aug 1998 13:18:54 -0500 (CDT) Sender: hoffman Message-ID: <35C7508D.CEC@cs.wisc.edu> Date: Tue, 04 Aug 1998 13:18:53 -0500 From: Rebecca Hoffman X-Mailer: Mozilla 3.01 (X11; U; SunOS 5.5.1 sun4m) MIME-Version: 1.0 To: hoffman Subject: bib Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: R %----------------------------------------------------------------------- % % These bibtex bibliographic entries for the 25th % INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE % (1998) created by Rebecca Hoffman and Mark D. Hill % from author input. % % These entries are correct to the best of our knowledge, % but we accept no responsibility for the consequences of % any errors. Email corrections to hoffman@cs.wisc.edu. % Last change: Thu Mar 27 11:22:10 CST 1997. % ------------------------------------------------------------------------ % % Machine Measurement %1) Memory System Characterization of Commercial Workloads @INPROCEEDINGS{barroso:memory, AUTHOR = "Luiz Andre Barroso and Kourosh Gharachorloo and Edouard Bugnion", TITLE = "Memory System Characterization of Commercial Workloads", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "3--14", MONTH = June, WHERE = "bound", ABSTRACT = " Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers. This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches. "} %2) Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads @INPROCEEDINGS{keeton:pprodb, AUTHOR = "Kimberly Keeton and David A. Patterson and Yong Qiang He and Roger C. Raphael and Walter E. Baker", TITLE = "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "15--26", MONTH = June, WHERE = "Barcelona, Spain", ABSTRACT = " Commercial applications are an important, yet often overlooked, workload with significantly different characteristics from technical workloads. The potential impact of these differences is that computers optimized for technical workloads may not provide good performance for commercial applications, and these applications may not fully exploit advances in processor design. To evaluate these issues, we use hardware counters to measure architectural features of a four-processor Pentium Pro-based server running a TPC-C-like workload on an Informix database. We examine the effectiveness of out-of-order execution, branch prediction, speculative execution, superscalar issue and retire, caching, and multiprocessor scaling. We find that out-of-order execution, superscalar issue and retire, and branch prediction are not as effective for database workloads as they are for technical workloads, such as SPEC. We find that caches are effective at reducing processor traffic to memory; even larger caches would be helpful to satisfy more data requests. Multiprocessor scaling of this workload is good, but even modest bus utilization degrades application memory latency, limiting database throughput. "} %3) Execution Characteristics of Desktop Applications on Windows NT @INPROCEEDINGS{Lee:NT, AUTHOR = "Dennis C. Lee and Patrick J. Crowley and Jean-Loup Baer and Thomas E. Anderson and Brian N. Bershad", TITLE = "Execution Characteristics of Desktop Applications on {Windows NT}", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "27--38", MONTH = June, WHERE = "bound", ABSTRACT = " This paper examines the performance of desktop applications running on the Microsoft Windows NT operating system on Intel x86 processors, and contrasts these applications to the programs in the integer SPEC95 benchmark suite. We present measurements of basic instruction set and program characteristics, and detailed simulation results of the way these programs use the memory system and processor branch architecture. We show that the desktop applications have similar characteristics to the integer SPEC95 benchmarks for many of these metrics. However, compared to the integer SPEC95 applications, desktop applications have larger instruction working sets, execute instructions in a greater number of unique functions, cross DLL boundaries frequently and execute a greater number of indirect calls. "} %4) An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors @INPROCEEDINGS{lo:smtdatabase, AUTHOR = "Jack L. Lo and Luiz Andre Barroso and Susan J. Eggers and Kourosh Gharachorloo and Henry M. Levy and Sujay S. Parekh", TITLE = "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "39--50", MONTH = June, WHERE = "bound", ABSTRACT = " Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable critical working set. Second, we show that the additional data cache conflicts caused by simultaneous-multithreaded instruction scheduling can be nearly eliminated by the proper choice of software- directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better inter-thread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT's latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8-context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources. "} % Program Behaviour %1) An Analysis of Correlation and Predictability: What Makes Two-level Branch Predictors Work @INPROCEEDINGS{evers:corr, AUTHOR = "Marius Evers and Sanjay J. Patel and Robert S. Chappell and Yale N. Patt", TITLE = "An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "52--61", MONTH = June, WHERE = "bound", ABSTRACT = " Pipeline flushes due to branch mispredictions is one of the most serious problems facing the designer of a deeply pipelined, superscalar processor. Many branch predictors have been proposed to help alleviate this problem, including two-level adaptive branch predictors and hybrid branch predictors. Numerous studies have shown which predictors and configurations best predict the branches in a given set of benchmarks. Some studies have also investigated effects, such as pattern history table interference, that can be detrimental to the performance of these predictors. However, little research has been done on which characteristics of branch behavior make predictors perform well. In this paper, we investigate and quantify reasons why branches are predictable. We show that some of this predictability is not captured by the two-level adaptive branch predictors. An understanding of the predictability of branches m ay lead to insights ultimately resulting in better or less complex predictors. We also investigate and quantify what fraction of the branches in each benchmark is predictable using each of the methods described in this paper. "} %2) Branch Prediction Based on Universal Data Compression Algorithms @INPROCEEDINGS{federovsky:branch, AUTHOR = "Eitan Federovsky and Meir Feder and Shlomo Weiss", TITLE = "Branch Prediction Based on Universal Data Compression Algorithms", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "62--72", MONTH = June, WHERE = "bound", ABSTRACT = " Data compression and prediction are closely related. Thus prediction methods based on data compression algorithms have been suggested for the branch prediction problem. In this work we consider two universal compression algorithms: prediction by partial matching (PPM), and a recently developed method, context tree weighting (CTW). We describe the prediction algorithms induced by these methods. We also suggest adaptive algorithms - variations of the basic methods that attempt to fit limited memory constraints and to match the non-stationary nature of the branch sequence. Furthermore, we show how to incorporate address information and ne other relevant data. Finally, we present simulation results for selected programs from the SPECint95, SYSmark/32, SYSmark/NT, and transactional processing benchmarks. Our results are most promising in programs with difficult to predict branch behavior. "} %3) Modeling Program Predictability @INPROCEEDINGS{sazeidessmith:modeling, AUTHOR = "Yiannakis Sazeides and James E. Smith", TITLE = "Modeling Program Predictability", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "73--84", MONTH = June, WHERE = "Barcelona", ABSTRACT = " Basic properties of program predictability -- for both values and control -- are defined and studied. We take the view that program predictability originates at certain points during a program's execution, flows through subsequent instructions, and then ends at other points in the program. These key components of predictability: generation, propagation, and termination; are defined in terms of a model. The model is based on a graph derived from dynamic data dependences and a predictor. Using the SPEC95 benchmarks, we analyze the predictability phenomena both separately and in combination. Examples are provided to illustrate relationships between model-based characteristics and program constructs. It is shown that most predictability derives from program control structure and immediate values, not program input data. Furthermore, most predictability originates from a relatively small number of generate points. The analysis of obtained results suggests a number of ramifications regarding predictability and its use. "} % Graphics and IO %1) Multi-level Texture Caching for 3D Graphics Hardware @INPROCEEDINGS{cox:ionic, AUTHOR = "Michael Cox, Narendra Bhandari, and Michael Shantz", TITLE = "Multi-Level Texture Caching for 3D Graphics Hardware", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "86--97", MONTH = June, WHERE = "bound", ABSTRACT = " Traditional graphics hardware architectures implement what we call the push architecture for texture mapping. Local memory is dedicated to the accelerator for fast local retrieval of texture during rasterization, and the application is responsible for managing this memory. The push architecture has a bandwidth advantage, but disadvantages of limited texture capacity, escalation of accelerator memory requirements (and therefore cost), and poor memory utilization. The push architecture also requires the programmer to solve the bin-packing problem of managing accelerator memory each frame. More recently graphics hardware on PC-class machines has moved to an implementation of what we call the pull architecture. Texture is stored in system memory and downloaded by the accelerator as needed. The pull architecture has advantages of texture capacity, stems the escalation of accelerator memory requirements, and has good memory utilization. It also frees the programmer from accelerator texture memory management. However, the pull architecture suffers escalating requirements for bandwidth from main memory to the accelerator. In this paper we propose multi-level texture caching to provide the accelerator with the bandwidth advantages of the push architecture combined with the capacity advantages of the pull architecture. We have studied the feasibility of 2-level caching and found the following: (1) significant re-use of texture between frames; (2) L2 caching requires significantly less memory than the push architecture; (3) L2 caching requires significantly less bandwidth from host memory than the pull architecture; (4) L2 caching enables implementation of smaller L1 caches that would otherwise bandwidth-limit accelerators on the workloads in this paper. Results suggest that an L2 cache achieves the original advantage of the pull architecture -- stemming the growth of local texture memory -- while at the same time stemming the current explosion in demand for texture bandwidth between host memory and the accelerator. "} %2) Switcherland: A QoS Communication Architecture for Workstation Clusters @INPROCEEDINGS{eberle:switcherland, AUTHOR = "Hans Eberle and Erwin Oertli", TITLE = "Switcherland: A QoS Communication Architecture for Workstation Clusters", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "98--108", MONTH = June, WHERE = "bound", ABSTRACT = " Computer systems have become powerful enough to process continuous data streams such as video or animated graphics. While processing power and communication bandwidth of today's systems typically are sufficient, quality of service (QoS) guarantees as required for handling such data types cannot be provided by these systems in adequate ways. We present Switcherland, a scalable communication architecture based on crossbar switches that provides QoS guarantees for workstation clusters in the form of reserved bandwidth and bounded transmission delays. Similar to the ATM technology Switcherland provides QoS guarantees with the help of service classes, that is, data transfers are characterized as variable bit rate traffic or constant bit rate traffic. However, unlike LAN technologies, Switcherland is optimized for cluster computing in that (i) it serves as ackplane interconnection fabric as well as a LAN, (ii) it extends support for service classes by also covering the end nodes of the network, (iii) it provides low latency in the order of one microsecond per switch, and (iv) it uses a communication model based on a global memory to simplify programming. "} %3) Declustered Disk Array Architectures with Optimal and Near-optimal Parallelism @INPROCEEDINGS{alvarez:parallelism, AUTHOR = "Guillermo A. Alvarez and Walter A. Burkhard and Larry J. Stockmeyer and Flaviu Cristian", TITLE = "Declustered Disk Array Architectures with Optimal and Near-optimal Parallelism", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "109--120", MONTH = June, WHERE = "bound", ABSTRACT = " This paper investigates the placement of data and parity on redundant disk arrays. Declustered organizations have been traditionally used to achieve fast reconstruction of a failed disk's contents. In previous work, Holland and Gibson identified six desirable properties for ideal layouts; however, no declustered layout satisfying all properties has been published in the literature. We present a complete, constructive characterization of the collection of ideal declustered layouts possessing all six properties. Given that ideal layouts exist only for a limited set of configurations, we also present two novel layout families. Prime and Relpr can tolerate multiple failures in a wide variety of configurations with slight deviations from the ideal. Our simulation studies show that the new layouts provide excellent parallel access performance and reduced incremental loads during degraded operation, when compared with previously published layouts. For large accesses and under high loads, response times for the new layouts are typically smaller than those of previously published declustered layouts by a factor of 2.5. "} % Speculation %1) Confidence Estimation for Speculation Control @INPROCEEDINGS{grunwald98:confidence, AUTHOR = "Dirk Grunwald and Artur Klauser and Srilatha Manne and Andrew Ples zkun", TITLE = "Confidence Estimation for Speculation Control", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "122--131", MONTH = June, WHERE = "bound", ABSTRACT = " Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could be used, such as threading or eager execution. As the use of speculation increases, we believe more processors will need some form of speculation control to balance the benefits of speculation against other possible activities. Confidence estimation is one technique that can be exploited by architects for speculation control. In this paper, we introduce performance metrics to compare confidence estimation mechanisms, and argue that these metrics are appropriate for speculation control. We compare a number of confidence estimation mechanisms, focusing on mechanisms that have a small implementation cost and gain benefit by exploiting characteristics of branch predictors, such as clustering of mispredicted branches. We compare the performance of the different confidence estimation methods using detailed pipeline simulations. Using these simulations, we show how to improve some confidence estimators, providing better insight for future investigations comparing and applying confidence estimators. "} %2) Pipeline Gating: Speculation Control for Energy Reduction @INPROCEEDINGS{manne98:gating, AUTHOR = "Srilatha Manne and Artur Klauser and Dirk Grunwald", title = "Pipeline Gating: Speculation Control For Energy Reduction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "132--141", MONTH = June, WHERE = "bound", ABSTRACT = " Branch prediction has enabled microprocessors to increase instruction level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries. Although speculative execution is essential for increasing the instructions per cycle (IPC), it does come at a cost. A large amount of unnecessary work results from wrong-path instructions entering the pipeline due to branch misprediction. Results generated with the SimpleScalar tool set using a 4-way issue pipeline and various branch predictors show an instruction overhead of 16% to 105% for every instruction committed. The instruction overhead will increase in the future as processors use more aggressive speculation and wider issue widths. In this paper, we present an innovative method for power reduction which, unlike previous work that sacrificed flexibility or performance, reduces power in high-performance microprocessors without impacting performance. In particular, we introduce a hardware mechanism called pipeline gating to control rampant speculation in the pipeline. We present inexpensive mechanisms for determining when a branch is likely to mispredict, and for stopping wrong-path instructions from entering the pipeline. Results show up to a 38% reduction in wrong-path instructions with a negligible performance loss (~ 1%). Best of all, even in programs with a high branch prediction accuracy, performance does not noticeably degrade. Our analysis indicates that there is little risk in implementing this method in existing processors since it does not impact performance and can benefit energy reduction. "} %3) Memory Dependence Prediction Using Store Sets @INPROCEEDINGS{chrysos:storesets, AUTHOR = "George Z. Chrysos and Joel S. Emer", TITLE = "Memory Dependence Prediction using Store Sets", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "142--153", MONTH = June, WHERE = "bound", ABSTRACT = " For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store in-structions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and com-municate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's \"store set\". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar ma-chines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the imple-mentation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance. "} % Prediction Techniques %1) Dynamic History-length Fitting: A Third Level of Adaptivity for Branch Prediction @INPROCEEDINGS{toni:dhlf, AUTHOR = "Toni Juan and Sanji Sanjeevan and Juan J. Navarro", TITLE = "Dynamic History-Length Fitting: A third level of adaptivity for branch prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "155--166", MONTH = June, WHERE = "Barcelona", ABSTRACT = " Accurate branch prediction is essential for obtaining high performance in pipelined superscalar processors that execute instructions speculatively. Some of the best current predictors combine a part of the branch address with a fixed amount of global history of branch outcomes in order to make a prediction. These predictors cannot perform uniformly well across all workloads because the best amount of history to be used depends on the code, the input data and the frequency of context switches. Consequently, all predictors that use a fixed history length are therefore unable to perform up to their maximum potential. We introduce a method --called DHLF-- that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload. Our proposal adds an extra level of adaptivity to two-level adaptive branch predictors. The DHLF method can be applied to any one of the predictors that combine global branch history with the branch address. We apply the DHLF method to gshare (dhlf-gshare) and obtain near-optimal results for all SPECint95 benchmarks, with and without context switches. Some results are also presented for gskewed (dhlf-gskewed), confirming that other predictors can benefit from our proposal. "} %2) Accurate Indirect Branch Prediction AUTHOR = "Karel Driesen and Urs H\166lzle ", TITLE = "Accurate Indirect Branch Prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "167--178", MONTH = June, WHERE = "bound", ABSTRACT = bstract: Indirect branch prediction is likely to become increasingly important in the future because indirect branches occur more frequently in object-oriented programs. With misprediction rates of around 25% on current processors, indirect branches can incur a significant fraction of branch misprediction overhead even though they remain less frequent than the more predictable conditional branches. We investigate a wide range of two-level predictors dedicated exclusively to indirect branches. Starting with predictors that use full-precision addresses and unlimited tables, we progressively introduce hardware constraints and minimize the loss of predictor performance at each step. For programs from the SPECint95 suite as well as a suite of large C++ applications, a two-level predictor achieves a misprediction rate of 9.8% with a 1K-entry table and 7.3% with an 8K-entry table, representing more than a threefold improvement over an ideal BTB. A hybrid predictor further reduces the misprediction rates to 8.98% and 5.95%, respectively. "} %3) Using Prediction to Accelerate Coherence Protocols @INPROCEEDINGS{mukherjee:cosmos, AUTHOR = "Shubhendu S. Mukherjee and Mark D. Hill", TITLE = "Using Prediction to Accelerate Coherence Protocols", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "179--190", MONTH = June, WHERE = "bound", ABSTRACT = " Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely- cached blocks. To ameliorate this latency, researchers have aug mented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions. This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracies of 62% to 93%. Cosmos' high prediction accuracy is a result of predictable coherence message signatures that arise from stable sharing patterns of cache blocks. "} % Memory Management %1) Active Pages: A Computation Model for Intelligent Memory OCEEDINGS{oskin:active-pages, AUTHOR = "Mark Oskin, Frederic T. Chong, Timothy Sherwood", TITLE = "Active Pages: A Computation Model for Intelligent Memory", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "192--203", MONTH = June, WHERE = "bound", ABSTRACT = " Microprocessors and memory systems suffer from a growing gap in performance. We introduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active Pages on RADram (Reconfigurable Architecture DRAM), a memory system based upon the integration of DRAM and reconfigurable logic. Results from the SimpleScalar simulator demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies. "} %2) Increasing TLB Reach Using Superpages Backed by Shadow Memory @INPROCEEDINGS{swanson:superpages, AUTHOR = "Mark R. Swanson and Leigh Stoller and John Carter", TITLE = "Increasing TLB Reach Using Superpages Backed by Shadow Memory", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "204--213", MONTH = June, WHERE = "bound", ABSTRACT = " The amount of memory that can be accessed without causing a TLB fault, the reach of a TLB, is failing to keep pace with the increasingly large working sets of applications. We propose to extend TLB reach via a novel Memory Controller TLB (MTLB) that lets us aggressively create superpages from non-contiguous, unaligned regions of physical memory. This flexibility increases the OS's ability to use superpages on arbitrary application data. The MTLB supports shadow pages, regions of physical address space for which the MTLB remaps accesses to real physical pages. The MTLB preserves per-base-page referenced and dirty bits, which enables the OS to swap shadow-backed superpages a page at a time, unlike conventional superpages. Simulation of five applications, including two SPECint95 benchmarks, demonstrated that a modest-sized MTLB improves performance of applications with moderate-to-high TLB miss rates by 5-20 percent. Simulation also showed that this mechanism can more than double the effective reach of a processor TLB with no modification to the processor MMU. "} %3) Options for Dynamic Address Translation in COMAs @INPROCEEDINGS{qiu:options, AUTHOR = "Xiaogang Qiu and Michel Dubois", TITLE = "Options for Dynamic Address Translation in COMAs", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "214--225", MONTH = June, WHERE = "bound", ABSTRACT = " In modern processors, the dynamic translation of virtual addresses to support virtual memory is done before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably the latency and bandwidth demands on the TLB (Translation Lookaside Buffer) are getting more and more difficult to meet. The situation is worse in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We evaluate and compare five options for virtual ddress translation in the context of COMAs (Cache Only Memory Architectures). The dynamic address translation mechanism can be located after the cache access provided the cache is virtual. In a particular design, which we call V-COMA for Virtual COMA, the physical address concept and the traditional TLB are eliminated. While still supporting virtual memory, V-COMA reduces the address translation overhead to a minimum. V-COMA scales well and works better in systems with large number of processors. As a machine running on virtual addresses, V-COMA provides a simple and consistent hardware model to the operating system and the compiler, in which further optimization opportunities are possible. "} % Prediction and Multipath Execution %1) Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture @INPROCEEDINGS{august:epic, AUTHOR = "D. I. August and D. A. Connors and S. A. Mahlke and J. W. Sias and K. M. Crozier and B. Cheng and P. R. Eaton and Q. B. Olaniran and W. W. Hwu", TITLE = "Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "227--237", MONTH = June, WHERE = "bound", ABSTRACT = " Explicitly Parallel Instruction Computing (EPIC) architectures require the compiler to express program instruction level parallelism directly to the hardware. EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication have individually been shown to be very effective. However, these techniques have not been studied in combination with each other. This paper presents the IMPACT EPIC Architecture to address the issues involved in designing processors based on these EPIC concepts. In particular, we focus on new execution and recovery models in which microarchitectural support for predicated execution is also used to enable efficient recovery from exceptions caused by speculatively executed instructions. This paper demonstrates that a coherent framework to integrate the three techniques can be elegantly designed to achieve much better performance than each individual technique could alone provide. "} %2) Threaded Multiple Path Execution @INPROCEEDINGS{wallace:TME, AUTHOR = "Steven Wallace and Brad Calder and Dean M. Tullsen", TITLE = "Threaded Multiple Path Execution", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "238--249", MONTH = June, WHERE = "bound", ABSTRACT = " This paper presents Threaded Multi-Path Execution (TME), which exploits existing hardware on a Simultaneous Multithreading (SMT) processor to speculatively execute multiple paths of execution. When there are fewer threads in an SMT processor than hardware contexts, threaded multi-path execution uses spare contexts to fetch and execute code along the less likely path of hard-to-predict branches. This paper describes the hardware mechanisms needed to enable an SMT processor to efficiently spawn speculative threads for threaded multi-path execution. The Mapping Synchronization Bus is described, which enables the spawning of these multiple paths. Policies are examined for deciding which branches to fork, and for managing competition between primary and alternate path threads for critical resources. Our results show that TME increases the single program performance of an SMT with eight thread contexts by 14-23 percent on average, depending on the misprediction penalty, for programs with a high misprediction rate. "} %3) Selective Eager Execution on the PolyPath Architecture @INPROCEEDINGS{Klauser:polypath, AUTHOR = "Artur Klauser and Abhijit Paithankar and Dirk Grunwald", TITLE = "Selective Eager Execution on the PolyPath Architecture", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "250--259", MONTH = June, WHERE = "bound", ABSTRACT = " Control-flow misprediction penalties are a major impediment to high performance in wide-issue superscalar processors. In this paper we present Selective Eager Execution (SEE), an execution model to overcome mis-speculation penalties by executing both paths after diffident branches. We present the micro-architecture of the PolyPath processor, which is an extension of an aggressive superscalar, out-of-order architecture. The PolyPath architecture uses a novel instruction tagging and register renaming mechanism to execute instructions from multiple paths simultaneously in the same processor pipeline, while retaining maximum resource availability for single-path code sequences. Results of our execution-driven, pipeline-level simulations show that SEE can improve performance by as much as 36% for the go benchmark, and an average of 14% on SPECint95, when compared to a normal superscalar, out-of-order, speculative execution, mono-path processor. Moreover, our architectural model is both elegant and practical to implement, using a small amount of additional state and control logic. "} % Processor Microarchitecture %1) Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing @INPROCEEDINGS{sanjay:tracecache, AUTHOR = "Sanjay J. Patel and Marius Evers and Yale N. Patt", TITLE = "Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "262--271", MONTH = June, WHERE = "bound", ABSTRACT = " The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. As a result, the trace cache delivers instructions at a high rate by supplying multiple fetch blocks each cycle. In this paper, we examine two techniques to improve the number of instructions delivered each cycle by the trace cache. The first technique, branch promotion, dynamically converts strongly biased branches into branches with static predictions. Because these promoted branches require no dynamic prediction, the branch predictor suffers less from the negative effects of interference. Branch promotion unlocks the potential of the second technique: trace packing. With trace packing, trace segments are packed with as many instructions as will fit, without regard to naturally occurring fetch block boundaries. With both techniques, the effective fetch rate of the trace cache jumps up 17% over a trace cache which implements neither. On a machine where the execution engine has a very aggressive memory disambiguator, the performance of a machine using branch promotion and trace packing is on average 11% higher than a machine using neither technique. "} %2) The Effect of Instruction Fetch Bandwidth on Value Prediction @INPROCEEDINGS{Gabbay:vp, AUTHOR = "Freddy Gabbay and Avi Mendelson", TITLE = "The Effect of Instruction Fetch Bandwidth on Value Prediction", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "272--281", MONTH = June, WHERE = "bound", ABSTRACT = " Value prediction attempts to eliminate true-data dependencies by dynamically pre dicting the outcome values of instructions and executing true-data dependent ins tructions based on that prediction. In this paper we attempt to understand the l imitations of using this paradigm in realistic machines. We show that the instru ction-fetch bandwidth and the issue rate have a very significant impact on the e fficiency of value prediction. In addition, we study how recent techniques to im prove the instruction-fetch rate affect the efficiency of value prediction and i ts hardware organization. "} %3) Dynamic IPC/Clock Rate Optimization @INPROCEEDINGS{albonesi:ipcclock, AUTHOR = "David H. Albonesi", TITLE = "Dynamic IPC/Clock Rate Optimization", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "282--292", MONTH = June, WHERE = "bound", ABSTRACT = " Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-matched to the particular hardware organization chosen. We present a new approach called Complexity-Adaptive Processors (CAPs) in which the IPC/clock rate tradeoff can be altered at runtime to dynamically match the changing requirements of the instruction stream. By exploiting repeater methodologies used increasingly in deep sub-micron designs, CAPs achieve this flexibility with potentially no cycle time impact compared to a fixed architecture. Our preliminary results in applying this approach to on-chip caches and instruction queues indicate that CAPs have the potential to significantly outperform conventional approaches on workloads containing both general-purpose and scientific applications. "} %4) Performance Modeling and Code Partitioning for the DS Architecture @INPROCEEDINGS{doe:DSarch, AUTHOR = "Yinong Zhang and George B. Adams III", TITLE = "Performance Modeling and Code Partitioning for the DS Architecture" , BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "293--304", MONTH = June, WHERE = "bound", ABSTRACT = " DS (Decoupled-Superscalar) is a new microarchitecture that combines decoupled and superscalar techniques to exploit instruction level parallelism. Issue bandwidth is increased while circuit complexity growth is controlled with little negative impact on performance. Programs for DS are compiled into two instruction substreams: the dominant substream navigates the control flow and the rest of computational task is shared between the dominant and subsidiary substreams. Each substream is processed by a separate superscalar core realizable with current VLSI technology. DS machines are binary compatible with superscalar machines having the same instruction set, and a family of DS machines is binary compatible without recompilation. DS run time behavior is examined with an analytical model. A novel technique for controlling slip between substreams is introduced. Code partitioning issues of instruction count balancing and residence time balancing, important to any split-stream scheme, are discussed. Simulation shows DS achieves performance comparable to an aggressive superscalar, but with potentially less complex hardware and faster clock rate. "} % Parallel Machines %1) Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor @INPROCEEDINGS{Keckler:MIT, AUTHOR = "Stephen W. Keckler and William J. Dally and Daniel Maskit and Nicholas P. Carter and Andrew Chang and Whay S. Lee", TITLE = "Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "306--317", MONTH = June, WHERE = "bound", ABSTRACT = " Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-siz e of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As th is fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Pr ocessor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a share d on-chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications. "} %2) Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance @INPROCEEDINGS{abandah:exemplar, AUTHOR = "Gheith A. Abandah and Edward S. Davidson", TITLE = "Effects of Architectural and Technological Advances on the HP/Convex Exemplar's Memory and Communication Performance", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "318--329", MONTH = June, WHERE = "bound", ABSTRACT = " Advances in microarchitecture, packaging, and manufacturing processes enable designers to build new systems with higher performance and scalability. Using microbenchmark techniques, we contrast the memory and communication performance of two generations of the HP/Convex Exemplar scalable parallel processing system. The SPP1000 and SPP2000 have significant architectural and implementation differences, but maintain upward binary compatibility. The SPP2000 employs manufacturing and packaging advances to obtain shorter system interconnects with wider data paths and improved functionality, thereby reducing the latency and increasing the bandwidth of remote communication. Although the memory latency is not significantly improved, newer out-of-order execution processors coupled with nonblocking caches achieve much higher memory bandwidth. The SPP2000 has a richer system interconnect topology that allows scalability to a larger number of processors. The SPP2000 also employs innovations in its coherence protocols to improve synchronization and communication performance. This paper characterizes the performance effects of these changes, and identifies some remaining inefficiencies, in the cache coherence protocol and the node configuration, that future systems should address. "} %3) Design Choices in the SHRIMP System: An Empirical Study @INPROCEEDINGS{blumrich:shrimp, AUTHOR = "Matthias A. Blumrich and Richard D. Alpert and Yuqun Chen and Doug las W. Clark and Stefanos N. Damianakis and Cezary Dubnicki and Edward W. Felten and Liviu Iftode and Kai Li and Margaret Martonosi and Robert A. Shillner", TITLE = "Design Choices in the SHRIMP System: An Empirical Study", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "330--341", MONTH = June, WHERE = "bound", ABSTRACT = " The SHRIMP cluster-computing system has progressed to a point of relative maturity; a variety of applications are running on a 16-node system. We have enough experience to understand what we did right and wrong in designing and building the system. In this paper we discuss some of the lessons we learned about computer architecture, and about the challenges involved in building a significant working system in an academic research environment. We evaluate significant design choices by modifying the network interface firmware and the system software in order to empirically compare our design to other approaches. "} %4) Flexible Use of Memory for Replication/Migration in Cache-coherent DMS Multiprocessors @INPROCEEDINGS{soundararajan:flexible, AUTHOR = "Vijayaraghavan Soundararajan and Mark Heinrich and Ben Verghese and Kourosh Gharachorloo and Anoop Gupta and John Hennessy", TITLE = "Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "342--355", MONTH = June, WHERE = "bound", ABSTRACT = " Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data- locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible. "} % Caches and Memory Systems %1) Exploiting Spatial Locality in Data Caches Using Spatial Footprints @INPROCEEDINGS{doe:ionic, AUTHOR = "Sanjeev Kumar and Christopher Wilkerson", TITLE = "Exploiting Spatial Locality in Data Caches using Spatial Footprints ", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "357--368", MONTH = June, WHERE = "bound", ABSTRACT = " Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache gets used before eviction. The unused portion of the cache line negatively impacts performance by wasting bandwidth and polluting the cache by replacing potentially useful data that would otherwise remain in the cache. This paper describes an alternative approach to exploit spatial locality available in data caches. On a cache miss, our mechanism, called Spatial Footprint Predictor (SFP), predicts which portions of a cache block will get used before getting evicted. The high accuracy of the predictor allows us to exploit spatial locality exhibited in larger blocks of data yielding better miss ratios without significantly impacting the memory access latencies. Our evaluation of this mechanism shows that the miss rate of the cache is improved, on average, by 18% in addition to a significant reduction in the bandwidth requirement. "} %2) Low Load Latency through Sum-addressed Memory (SAM) @INPROCEEDINGS{lynch:analytic, AUTHOR ="William L. Lynch, Gary Lauterbach, and Joseph Chamdani", TITLE = "Low Load Latency through Sum-Addressed Memory (SAM)", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "369--379", MONTH = June, WHERE = "bound", ABSTRACT = " Load latency contributes significantly to execution time. Because most cache accesses hit, cache-hit latency becomes an important component of expected load latency. Most modern microprocessors have base+offset addressing loads; thus effective cache-hit latency includes an addition as well as the RAM access. This paper introduces a new technique used in the UltraSPARC III microprocessor, Sum-Addressed Memory (SAM), which performs true addition using the decoder of the RAM array, with very low latency. We compare SAM with other methods for reducing the add part of load latency. These methods include sum-prediction with recovery, and bitwise indexing with duplicate-tolerance. The results demonstrate the superior performance of SAM. "} %3) Analytic Evaluation of Shared-memory Parallel Systems with ILP Processors @INPROCEEDINGS{sorin:analytic, AUTHOR = "Daniel J. Sorin and Vijay S. Pai and Sarita V. Adve and Mary K. Vernon and David A. Wood", TITLE = "Analytic Evaluation of Shared-Memory Systems with ILP Processors", BOOKTITLE = "Proceedings of the 25th Annual International Symposium on Computer Architecture", YEAR = 1998, PAGES = "380--391", MONTH = June, WHERE = "bound", ABSTRACT = " This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds. The model input parameters characterize the ability of an application to exploit instruction-level parallelism as well as the interaction between the application and the memory system architecture. A trace-driven simulation methodology is developed that allows these parameters to be generated over 100 times faster than with a detailed execution-driven simulator. Finally, this paper shows that the analytical model can be used to gain insights into application performance and to evaluate architectural design trade-offs. "}