%------------------------------------------------------------------------- % % These bibtex bibliographic entries for the 27th % INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE % (2000) created by Alicia Walley and Mark D. Hill % from author input. % % These entries are correct to the best of our knowledge, % but we accept no responsibility for the consequences of % any errors. Email corrections to walley@cs.wisc.edu. % Last change: Wed 27 Apr 13:38:51 CST 2000 % %------------------------------------------------------------------------- % % % Using Threads % @INPROCEEDINGS{steffan:tls, AUTHOR = "J. G. Steffan and C. B. Colohan and A. Zhaia and T. C. Mowry", TITLE = "A Scalable Approach to Thread-Level Speculation", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this paper, we propose and evaluate a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger. "} @INPROCEEDINGS{cintra:speculation, AUTHOR = "Marcelo Cintra and Jose F. Martinez and Josep Torrellas", TITLE = "Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have attempted this technique in the context of scalable shared-memory systems. In this paper, we present and evaluate a new hardware scheme for scalable speculative parallelization. This design needs relatively simple hardware and is efficiently integrated into a cache-coherent NUMA system. We have the scheme in a hierarchical manner that largely abstracts away the internals of the node. We effectively utilize a speculative CMP as the building block for our scheme. Simulations show that the architecture proposed delivers good speedups at a modest hardware cost. For a set of important non-analyzable scientific loops, we report average speedups of 4.2 for 16 processors. We show that support for per-word speculative state is required by our applications, or else the performance suffers greatly. "} @INPROCEEDINGS{reinhardt:multithreading, AUTHOR = "Steven K. Reinhardt and Shubhendu S. Mukherjee", TITLE = "Transient Fault Detection via Simultaneous Multithreading", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations. We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor-derived from a Simultaneous Multithreaded (SMT) processor-provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order. This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms-slack fetch and branch outcome queue-that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%. "} % % Exploiting Traces % @INPROCEEDINGS{jacobson:preconstruction, AUTHOR = "Quinn Jacobson and J.E. Smith", TITLE = "Trace Preconstruction", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Trace caches enable high bandwidth, low latency instruction supply, but have a high miss penalty and relatively large working sets. Consequently, their performance may suffer due to capacity and compulsory misses. Trace preconstruction augments a trace cache by performing a function analogous to prefetching. The trace preconstruction mechanism observes the processor's instruction dispatch stream to detect opportunities for jumping ahead of the processor. After doing so, the preconstruction mechanism fetches static instructions from the predicted future region of the program, and constructs a set of traces in advance of when they are needed. Trace preconstruction can significantly increase both the performance of the trace cache and the robustness of the trace cache to varying workloads. All but one of the SPECint95 benchmarks see a notable reduction in trace cache miss rates from preconstruction: over a 50% reduction for some benchmarks. We also consider the integration of preconstruction with another trace-specific mechanism (preprocessing) to produce a high performance frontend. When combined, preconstruction and trace preprocessing produce an average speedup of 14% for the SPECint95 benchmarks. "} @INPROCEEDINGS{rakvic:completion, AUTHOR = "Ryan Rakvic, Bryan Black and John P. Shen", TITLE = "Completion Time Multiple Branch Prediction for Enhancing Trace Cache Performance", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " The need for multiple branch prediction is inherent to wide instruction fetching. This paper presents a completion time multiple branch predictor called the Tree-based Multiple Branch Predictor (TMP) that builds on previous single branch prediction techniques. It employs a tree structure of branch predictors, or tree-node predictors, and achieves accurate multiple branch prediction by leveraging the high accuracies of the individual branch predictors. A highly-efficient TMP design uses the 2-bit saturating counters for the tree-node predictors. To achieve higher prediction rate, the TMP employs two-level schemes for the tree-node predictors resulting in a three-level TMP design. Placing the TMP at completion time reduces the critical latency in the front-end of the pipeline; the resultant longer update latency does not significantly impact the overall performance. In this paper the TMP is applied to a trace cache design and shown to be very effective in increasing its performance. Results: A realistic-size TMP (72KB) can predict 1, 2, 3, and 4 consecutive blocks with compounded prediction accuracies of 96%, 93%, 87%, and 82%, respectively. The block-based trace cache with this TMP achieves 4.75 IPC for SPECint95 on an idealized machine, which is a 20% IPC for SPECint95 on an idealized machine, which is a 20% performance improvement over the original design [1]. This improved performance is 8% above that of a conventional I-cache design with perfect single branch prediction. "} @INPROCEEDINGS{merten:hotspotrelayout, AUTHOR = "Matthew C. Merten and Andrew R. Trick and Erik M. Nystrom and Ronald D. Barnes and Wen-mei W. Hwu", TITLE = "A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces are not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12KB of hardware than a trace cache requiring 15KB of hardware, while producing long, persistent traces more suited to optimization. "} % % Modeling Performance and Power % @INPROCEEDINGS{oskin:hls, AUTHOR = "Mark Oskin and Frederic T. Chong and Matthew Farrens", TITLE = "{HLS}: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs ", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation methodology allows for quick and accurate contour maps to be generated of the performance space spanned by design parameters. We validate the accuracy of HLS through correlation with existing cycle-by-cycle simulation techniques and current generation hardware. We demonstrate the power of HLS by exploring design spaces defined by two parameters: code properties and value prediction. These examples motivate how HLS can be used to set design goals and individual component performance targets. "} @INPROCEEDINGS{dbrooks:wattch, AUTHOR = "David Brooks and Vivek Tiwari and Margaret Martonosi", TITLE = "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities. This paper presents Wattch, a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. Wattch is 1000X or more faster than existing layout-level power tools, and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. This paper presents several validations of Wattch's accuracy. In addition, we present three examples that demonstrate how architects or compiler writers might use Wattch to evaluate power consumption in their design process. We see Wattch as a complement to existing lower-level tools; it allows architects to explore and cull the design space early on, using faster, higher-level tools. It also opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework. "} @INPROCEEDINGS{vijay:simplepower, AUTHOR = "N. Vijaykrishnan and Mahmut Kandemir and Mary Jane Irwin and Hyun Sik Kim and Wu Ye", TITLE = "Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " With the emergence of a plethora of embedded and portable applications, energy dissipation has joined throughput, area, and accuracy/precision as a major design constraint. Thus, designers must be concerned with both optimizing and estimating the energy consumption of circuits, architectures, and software. Most of the research in energy optimization and/or estimation has focused on single components of the system and has not looked across the interacting spectrum of the hardware and software. The novelty of our new energy estimation framework, SimplePower, is that it evaluates the energy considering the system as a whole rather than just as a sum of parts, and that it concurrently supports both compiler and architectural experimentation. We present the design and use of the SimplePower framework that includes a transition-sensitive, cycle-accurate datapath energy model that interfaces with analytical and transition sensitive energy models for the memory and bus subsystems, respectively. We analyzed the energy consumption of ten codes from the multidimensional array domain, a domain that is important for embedded video and signal processing systems, after applying different compiler and architectural optimizations. Our experiments demonstrate that early estimates from the SimplePower energy estimation framework can help identify the system energy hotspots and enable architects and compiler designers to focus their efforts on these areas. "} % % Memory Hierarchy Improvements % @INPROCEEDINGS{hallnor:cache, AUTHOR = "Erik G. Hallnor and Steven K. Reinhardt", TITLE = "A Fully Associative Software-Managed Cache Design", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features---full associativity and software management---have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches. This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow. "} @INPROCEEDINGS{saulsbury:tlb, AUTHOR = "Ashley Saulsbury and Fredrik Dahlgren and Per Stenstrom", TITLE = "Recency-based TLB Preloading", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Abstract not provided by the authors. "} @INPROCEEDINGS{rixner:mas, AUTHOR = "Scott Rixner and William J. Dally and Ujval J. Kapasi and Peter Mattson and John D. Owens", TITLE = "Memory Access Scheduling", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the ''3-D'' structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth. "} % % Multiprocessors % @INPROCEEDINGS{lai:ltp, AUTHOR = "An-Chow Lai and Babak Falsafi", TITLE = "Selective, Accurate and Timely Self-Invalidation Using Last-Touch Prediction", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn and predict the last touch to a memory block by one processor before the block is accessed and subsequently invalidated by another. By predicting a last-touch and (self-)invalidating the block in advance, an LTP hides the invalidation time, significantly reducing the coherence overhead. The key behind accurate last-touch prediction is trace-based correlation, associating a last-touch with the sequence of instructions (i.e., a trace) touching the block from a coherence miss until the block is invalidated. Correlating instructions enables an LTP to identify a last-touch to a memory block uniquely throughout an application's execution. In this paper, we use results from running shared-memory applications on a simulated DSM to evaluate LTPs. The results indicate that: (1) our base case LTP design, maintaining trace signatures on a per-block basis, substantially improves prediction accuracy over previous self-invalidation schemes to an average of 79%; (2) our alternative LTP design, maintaining a global trace signature table, reduces storage overhead but only achieves an average accuracy of 58%; (3) last-touch prediction based on a single instruction only achieves an average accuracy of 41% due to instruction reuse within and across computation; and (4) LTP enables selective, accurate, and timely self-invalidation in DSM, speeding up program execution on average by 11%. "} @INPROCEEDINGS{margolus:simd, AUTHOR = "Norman Margolus", TITLE = "An Embedded DRAM Architecture for Large-Scale Spatial-Lattice Computations", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Spatial-lattice computations with finite-range interactions are an important class of easily parallelized computations. This class includes many simple and direct algorithms for physical simulation, virtual-reality simulation, agent-based modeling, logic simulation, 2D and 3D image processing and rendering, and other volumetric data processing tasks. The range of applicability of such algorithms is completely dependent upon the lattice-sizes and processing speeds that are computationally feasible. Using embedded DRAM and a new technique for organizing SIMD memory and communications we can efficiently utilize 1Tbit/sec of sustained memory bandwidth in each chip in an indefinitely scalable array of chips. This allows a 10,000-fold speedup per memory chip for these algorithms compared to the CAM-8 lattice gas computer, and is about one million times faster per memory chip for these calculations than a CM-2. "} @INPROCEEDINGS{mai:smart, AUTHOR = "Ken Mai and Tim Paaske and Nuwan Jayasena and Ron Ho and William J. Dally and Mark Horowitz", TITLE = "Smart Memories: A Modular Reconfigurable Architecture", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, targeted at computing needs in the 0.1um technology generation. A Smart Memories chip is made up of many processing tiles, each containing local memory, local interconnect, and a processor core. For efficient computation under a wide class of possible applications, the memories, the wires, and the computational model can all be altered to match the applications. To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation. "} % % Analysis of Workloads and Systems % @INPROCEEDINGS{zilles:slices, AUTHOR = "Craig B. Zilles and Gurindar S. Sohi", TITLE = "Understanding the Backward Slices of Performance Degrading Instructions", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction. "} @INPROCEEDINGS{lepak:svl, AUTHOR = "Kevin M. Lepak and Mikko H. Lipasti", TITLE = "On the Value Locality of Store Instructions", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "10", MONTH = "June", ABSTRACT = " Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously-seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions, or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding. "} @INPROCEEDINGS{cvetanovic:analysis, AUTHOR = "Zarka Cvetanovic and R.E. Kessler", TITLE = "Performance Analysis of the Alpha 21264-based Compaq ES40 System", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Abstract not provided by the authors. "} % % Customizable Systems % @INPROCEEDINGS{faraboschi:lx, AUTHOR = "Paolo Faraboschi and Josh Fisher and Geoffrey Brown and Giuseppe Desoli and Fred Homewood", TITLE = "Lx: A Technology Platform for Customizable VLIW Embedded Processing", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Lx is a scalable and customizable VLIW processor technology platform that allows variation in instruction issue width, the number and capabilities of structures such as functional units and register files, and the processor instruction set. For Lx we developed the architecture and software from the beginning to support both scalability (variable numbers of identical processing resources) and customizability (special purpose resources). In this paper we consider the following issues. When is customization or scaling beneficial? How can one determine the degree of customization or scaling that will provide the greatest payoff for a particular application domain? What architectural compromises were made in the Lx project to contain the complexity inherent in a customizable and scalable processor family? The experiments described in the paper show that specialization for an application domain is effective, yielding to large gain factors in price/performance ratio. In addition, we show how scaling machine resources scales performance within a limited range, although not uniformly across all applications. Finally we show that, today, customization on an application-by-application basis, although doable in some case, is still very dangerous and much remains to be done for it to become a viable solution "} @INPROCEEDINGS{ranganathan:reconfigurable, AUTHOR = "Parthasarathy Ranganathan and Sarita Adve and Norman P. Jouppi", TITLE = "Reconfigurable Caches and their Application to Media Processing", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip transistors are applicable across these different domains. For example, current processor designs often devote the largest fraction of on-chip transistors (up to 80\%) to caches. Many workloads, however, do not make effective use of large caches; e.g., media processing workloads which often have streaming data access patterns and large working sets. This paper proposes a new reconfigurable cache design. This design enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities. These activities can benefit applications that would otherwise not use the storage allocated to large conventional caches. Our design involves relatively few modifications to conventional cache design, and analysis using a modification of the CACTI analytical model shows a small impact on cache access time. We evaluate one representative use of reconfigurable caches - instruction reuse for media processing. We find this use gives IPC improvements ranging from 1.04X to 1.20X in simulation across eight media processing benchmarks. "} @INPROCEEDINGS{ye:chimaera, AUTHOR = "Alex Ye and Andreas Moshovos and Scott Hauck and Prithviraj Banerjee", TITLE = "CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Reconfigurable hardware has the potential for significant perfor- mance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically- scheduled superscalar processor. Chimaera is capable of perform- ing 9-input/1-output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for exe- cution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control-flow into RFU operations, and (3) supporting a more powerful fine- grain data-parallel model than that supported by current mul- timedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22% of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub-shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4-way out-of-order superscalar processor Chimaera results in average performance improvements of 21%. This assuming a very aggressive core processor design (most pessimis- tic RFU latency model) and communication overheads from and to the RFU. "} % % Circuit Considerations % @INPROCEEDINGS{henry:circuits, AUTHOR = "Dana S. Henry and Bradley C. Kuszmaul and Gabriel H. Loh and Rahul Sami", TITLE = "Circuits for Wide-Window Superscalar Processors", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometric mean of 31%) in program speed compared to today's processors. The processor operates at clock speeds comparable to today's processors, but achieves significantly higher ILP. To measure the impact of a large window on clock speed, we design and simulate new implementations of the logic components that most limit the critical path of our large-window processor: the schedule logic and the wake-up logic. We use log-depth cyclic segmented prefix (CSP) circuits to reimplement these components. Our layouts and simulations of critical paths through these circuits indicate that our large-window processor could be clocked at frequencies exceeding 500MHz in today's technology. Our commit logic and rename logic can also run at these speeds. To measure the impact of a large window on ILP, we compare two microarchitectures, the first has a 128-instruction window, an 8-wide fetch unit, and 20-wide issue (four integer, branch, multiply, float, and memory units), whereas the second has a 32-instruction window, and a 4-wide fetch unit and is comparable to today's processors. For each, we simulate different window reuse and bypass policies. Our simulations show that the large-window processor achieves significantly higher IPC. This performance increase comes despite the fact that the large-window processor uses a wrap-around window while the small-window processor uses a compressing window, thus effectively increasing its number of outstanding instructions. Furthermore, the large-window processor sometimes pays an extra clock cycle for bypassing. "} @INPROCEEDINGS{agarwal:IPC, AUTHOR = "Vikas Agarwal and M. S. Hrishikesh and Stephen W. Keckler and Doug Burger", TITLE = "Clock Rate Versus IPC: The End of the Road for Conventional Microarchitectures", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = "The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance---estimating both clock rate and IPC---of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5\%, which is far worse than the annual 50-60 to which we have grown accustomed." } % % Extracting Parallelism % @INPROCEEDINGS{smith:vectors, AUTHOR = "J.E. Smith and Greg Faanes and Rabin Sugumar", TITLE = "Vector Instruction Set Support for Conditional Operations", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Vector instruction sets are receiving renewed interest because of their applicability to multimedia. Current multimedia instruction sets use short vectors with SIMD implementations, but long vector, pipelined implementations have a number of advantages and are a logical next step in multimedia ISA development. Support for conditional operations (as occur in loops containing IF statements) is an important aspect of a vector ISA. Seven ISA alternatives for implementing conditional operations are systematically explored. Performance considerations are discussed through evaluation of a typical IF loop over a range of vector lengths and true conditional values. An approach using masked operations is shown to be one of the better methods, especially if its implementation is able to skip over blocks of false mask bits. Additional analyses of complex IF loops and parallel pipeline implementations support the masked operation approach. The paper concludes with a practical implementation of masked operations that skips over power-of-2-length blocks of false values. This implementation is simpler than skipping arbitrary-length blocks and provides similar performance. "} @INPROCEEDINGS{chou:icop, AUTHOR = "Yuan Chou and John Paul Shen", TITLE = "Instruction Path Coprocessors", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own mini-instruction set, that operates on the core processors instructions to transform them into an internal format that can be more efficiently executed. It is located off the critical path of the core processor to ensure that it does not negatively impact the core processor's cycle time or pipeline depth. An I-COP is highly versatile and can be used to implement different types of instruction transformations to enhance the IPC of the core processor. We study four potential applications of the I-COP to demonstrate the feasibility of this concept and investigate the design issues of such a coprocessor. A prototype instruction set for the I-COP is presented along with an implementation framework that facilitates achieving high I-COP performance. Initial results indicate that the I-COP is able to efficiently implement the trace cache fill unit as well as the register move, stride data prefetching and linked data structure prefetching trace optimizations. "} @INPROCEEDINGS{barroso:piranha, AUTHOR = "Luiz Andre Barroso and Kourosh Gharachorloo and Robert McNamara and Andreas Nowatzyk and Shaz Qadeer and Barton Sano and Scott Smith and Robert Stets and Ben Verghese", TITLE = "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Abstract not available. "} % % Microarchitecture Innovations % @INPROCEEDINGS{radha:java, AUTHOR = "Ramesh Radhakrishnan and Deependra Talla and Lizy Kurian John", TITLE = "Allowing for ILP in an Embedded Java Processor", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Java processors are ideal for embedded and network computing applications such as Internet TV's, set-top boxes, smart phones, and other consumer electronics applications. In this paper, we investigate cost-effective microarchitectural techniques to exploit parallelism in Java bytecode streams. Firstly, we propose the use of a fill unit that stores decoded bytecodes into a decoded bytecode cache. This mechanism improves the fetch and decode bandwidth of Java processors by 2 to 3 times. These additional hardware units can also be used to perform optimizations such as instruction folding. This is particularly significant because experiments with the Verilog model of Sun Microsystems picoJava-II core demonstrates that instruction folding lies in the critical path. Moving folding logic from the critical path of the processor to the fill unit allows to improve the clock frequency by 25%. Out-of-order ILP exploitation is not investigated due to the prohibitive cost, but in-order dual-issue with a 64-entry decoded bytecode cache is seen to result in 10% to 14% improvement in execution cycles. Another contribution of the paper is a stack disambiguation technique that allows elimination of false dependencies between different types of stack accesses. Stack disambiguation further exposes parallelism and a dual in-order issue microengine with a 64-entry bytecode cache yields an additional 10% reduction in cycles, leading to an aggregate reduction of 17% to 24% in execution cycles. "} @INPROCEEDINGS{Bekerman:Tracking, AUTHOR = "Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev and Ronny Ronen", TITLE = "Early Load Address Resolution Via Register Tracking", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the increasing load-to-use latency, by allowing the early issue of load instructions. Early load address resolution relies on register tracking to safely compute the addresses of memory references in the front-end part of the processor pipeline. Register tracking enables decode-time computation of register values by tracking simple operations of the form regħimmediate. Register tracking may be performed in any pipeline stage following instruction decode and prior to execution. Several tracking schemes are proposed in this paper: 1) Stack pointer tracking allows safe early resolution of stack references by keeping track of the value of the ESP register (the stack pointer). About 25% of all loads are stack loads and 95% of these loads may be resolved in the front-end. 2) Absolute address tracking allows the early resolution of constant-address loads. 3) Displacement-based tracking tackles all loads with addresses of the form regħimmediate by tracking the values of all general-purpose registers. This class corresponds to 82% of all loads, and about 65% of these loads can be safely resolved in the front-end pipeline. The paper describes the tracking schemes, analyzes their performance potential in a deeply pipelined processor and discusses the integration of tracking with memory disambiguation."} @INPROCEEDINGS{cruz:register, AUTHOR = "Jose-Lorenzo Cruz and Antonio Gonzalez and Mateo Valero and Nigel P. Topham", TITLE = "Multiple-Banked Register File Architectures", BOOKTITLE = "Proceedings of the 27th Annual International Symposium on Computer Architecture", YEAR = "2000", PAGES = "???", MONTH = "June", ABSTRACT = " The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked register file, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the register file access time is factored in. "}