%Z ------------------------------------------------------------------------- %Z %Z Refer/bib bibliographic entries for the 22st %Z INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE %Z (1995) created by Julie Fingerson and Mark D. Hill %Y from author input. %Z %Z These entries are correct to the best of our knowledge, %Z but we accept no responsibility for the consequences of %Z any errors. Email corrections to hoffman@cs.wisc.edu. %Z Last change: Tue Oct 24 15:48:43 CDT 1995 %Z %Z ------------------------------------------------------------------------- %Z %Y %Y Session 1: Multiprocessors and Applications %Y %Y 1) The MIT Alewife Machine: Architecture and Performance %T The MIT Alewife Machine: Architecture and Performance %A Anant Agarwal %A Ricardo Bianchini %A David Chaiken %A Kirk L. Johnson %A David Kranz %A John Kubiatowicz %A Beng-Hong Lim %A Kenneth Mackenzie %A Donald Yeung %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 2-13 %X Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms -- including block multithreading and prefetching -- mask unavoidable delays due to communication; Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help to analyze the behavior of the system. Analysis shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives. Block multithreading and prefetching improve performance by up to 25% individually, and 35% together. Finally language constructs that allow programmers to express fine-grain synchronization can improve performance by over a factor of two. %Y 2) The EM-X Parallel Computer: Architecture and Basic Performance %T The EM-X Parallel Computer: Architecture and Basic Performance %A Yuetsu Kodama %A Hirohumi Sakane %A Mitsuhisa Sato %A Hayato Yamana %A Shuichi Sakai %A Yoshinori Yamaguchi %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 14-23 %X Latency tolerance is essential in achieving high performance on parallel computers for remote function calls and fine-grained remote memory accesses. EM-X supports interprocessor communication on an execution pipeline with small and simple packets. It can create a packet in one cycle, and receive a packet from the network in the on-chip buffer without interruption. EM-X invokes threads on packet arrival, minimizing the overhead of thread switching. It can tolerate communication latency by using efficient multi-threading and optimizing packet flow of fine grain communication. EM-X also supports the synchronization of two operands, direct remote memory read/write operations and flexible packet scheduling with priority. This paper describes distinctive features of the EM-X architecture and reports the performance of small synthetic programs and larger more realistic programs. %Y 3) The SPLASH-2 Programs: Characterization and Methodological Considerations %T The SPLASH-2 Programs: Characterization and Methodological Considerations %A Steven Cameron Woo %A Moriyoshi Ohara %A Evan Torrie %A Jaswinder Pal Singh %A Anoop Gupta %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 24-37 %X The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared address space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio, important working set sizes, and issues related to spatial locality. We also discuss how each of these properties scales with problem size and the number of processors. The other, related goal is methodological: to assist researchers who will use these programs for architectural evaluations to prune the space of application and ne parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating regions in terms of cache size and problem size are representative of realistic situations, which are not, and which are redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results. %Y %Y Session 2A: Cache Coherence %Y %Y 1) Efficient Strategies for Software-Only Directory Protocols in Shared Memory Multiprocessors %T Efficient Strategies for Software-Only Directory Protocols in Shared-Memory Multiprocessors %A Hekan Grahn %A Per Stenstrom %J Proc. 22nd Annual International Symposium on Computer Architecture %D June 1995 %P 38-47 %X The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in soft ware, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance. %Y 2) Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors %T Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors %A Alvin R. Lebeck %A David A. Wood %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 48-59 %X This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks--which eliminate both invalidation and acknowledgment messages--for a total reduction in messages of up to 26%. %Y 3) Boosting the Performance of Hybrid Snooping Cache Protocols %T Boosting the Performance of Hybrid Snooping Cache Protocols %A Fredrik Dahlgren %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 60-70 %X Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. In this paper, we analyze the deficiencies of hybrid snooping protocols under release consistency, and show how these deficiencies can be dramatically reduced by using write caches and read snarfing. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer-consumer sharing. We show that a hybrid protocol, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83% and 95% as compared to a write-invalidate protocol for all five applications in this study. In addition, the number of bus transactions is reduced by between 36% and 60% for four of the applications and by 9% for the fifth application. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe that this combination is an effective approach to boost the performance of bus-based multiprocessors. %Y %Y Session 2B: Interconnect Technology and I/O %Y %Y 1) S-Connect: from Networks of Workstations to Supercomputer Peformance %T S-Connect: from Networks of Workstations to Supercomputer Performance %A A. G. Nowatzyk %A M. C. Browne %A E. Kelly %A M. Parkin %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 71-82 %X S-Connect is a new high speed, scalable interconnect system that has been developed to support networks of workstations to efficiently share computing resources. It uses off-the-shelf CMOS technology to directly drive fiber-optic systems at speeds greater than 1 Gbit/sec and can realize bisection bandwidths comparable to high-end MPP systems while being >10x more cost-effective. S-Connect systems do not rely on centralized switches, but rather are composed of adaptive, topology independent routing elements that are integrated into each node. The S-Connect routing algorithm is optimized for fine grained, irregular traffic and is designed to support high traffic loads, that can utilize most of the physically available bandwidth. Such traffic is typical of a distributed shared memory system, which is one of the intended applications. S-Connect innovations include a novel distributed phase locking method that allows global synchronization, HW support for multiple message priorities, in-band monitoring and control facilities, and a low overhead channel protocol that supports multiple in-transit messages on the same fiber. The first version of the S-Connect switching element has been successfully implemented in a commercial, 0.65 mm CMOS process. %Y 2) Destage Algorithms for Disk Arrays with Nonvolatile Caches %T Destage Algorithms for Disk Arrays with Nonvolatile Caches %A Anujan Varma %A Quinn Jacobson %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 83-95 %X No abstract submitted by the authors. %Y 3) Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer %T Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer %A Gordon Stoll %A Bin Wei %A Douglas Clark %A Edward W. Felten %A Kai Li %A Patrick Hanrahan %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 96-105 %X No abstract submitted by the authors. %Y 4) Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems %T Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems %A A. G. Nowatzyk %A P. R. Prucnal %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 106-116 %X Crossbar switches are rarely considered for large, scalable multiprocessor interconnect systems because they require O(n^2) switching elements, are difficult to control efficiently and are hard to implement once their size becomes too large to fit on one integrated circuit. However these problems are technology dependent and a recent innovation in fiber optic devices has led to a new implementation of crossbar switches that does not share these problems while retaining the full advantages of a crossbar switch: low latency, high throughput, complete connectivity and multicast capability. Moreover, this new technology has several characteristics that allow a distributed control system which scales linearly in the number of attached nodes. The innovation that led to this research is an optical and-gate that can be used to demultiplex multiple high speed data streams that are carried on one common optical medium. Optical time domain multiplexing can combine the data from many nodes and broadcast the result back to all nodes. This paper discusses OTDM technology only to the extent necessary to understand its characteristics and capabilities. The main contribution lies in the description and analysis of interconnect architectures that utilize OTDM to achieve a level performance that is beyond electronic means. It is expected that cost-reduced OTDM systems will become competitive with the next generation of interconnect systems. %Y %Y Session 3: Instruction Level Parallelism %Y %Y 1) Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor %T Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor %A Stephan Jourdan %A Pascal Sainrat %A Daniel Litaize %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 117-125 %X This study has been carried out in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit Instruction-Level-Parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units. %Y 2) Unconstrained Speculative Execution with Predicated State Buffering %T Unconstrained Speculative Execution with Predicated State Buffering %A Hideki Ando %A Chikako Nakanishi %A Tetsuya Hara %A Masao Nakaya %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 126-137 %X Speculative execution is execution of instructions before it is known whether these instructions should be executed. Compiler-based speculative execution has the potential to achieve both a high instruction per cycle rate and high clock rate. Pure compiler-based approaches, however, have greatly limited instruction scheduling due to a limited ability to handle side effects of speculative execution. Significant performance improvement is, thus, difficult in non-numerical applications. This paper proposes a new architectural mechanism, called predicating, which provides unconstrained speculative execution. Predicating removes restrictions which limit the compiler's ability to schedule instructions. Through our hardware support, the compiler is allowed to move instructions past multiple basic block boundaries from any succeeding control path. Predicating buffers the side effects of speculative execution with its predicate, and the buffered predicate efficiently commits or squashes the side effects. The mechanism also provides a speculative exception handling scheme. The scheme, called the future condition, properly postpones speculative exceptions and efficiently restarts the process. We show that our mechanism can be implemented through a modest amount of hardware with little complexity. The evaluation results show that our mechanism significantly improves performance, and achieves a 2.45x speedup over scalar machines. %Y 3) A Comparison of Full and Partial Predicated Execution Support for ILP Processors %T A Comparison of Full and Partial Predicated Execution Support for ILP Processors %A Scott Mahlke %A Richard Hank %A James McCormick %A David August %A Wen-mei Hwu %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 138-150 %X One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30% improvement. %Y %Y Session 4a: New Microarchitectures %Y %Y 1) Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High Performance RISC Microprocessor %T Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High Performance RISC Microprocessor %A M. Simone %A A. Essen %A A. Ike %A A. Krishnamoorthy %A T. Maruyama %A N. Patkar %A M. Ramaswami %A M. Shebanow %A V. Thirumalaiswamy %A D. Tovey %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 151-162 %X The implementation of a superscalar, speculative execution SPARC-V9 microprocessor incorporating Restricted Data Flow principles required many design trade-offs. Consideration was given to both performance and cost. Performance is largely a function of cycle time and instructions executed per cycle while cost is primarily a function of die area. Here we describe our Restricted Data Flow implementation and the means with which we arrived at its configuration. Future semiconductor technology advances will allow these trade-offs to be relaxed and higher performance Restricted Data Flow machines to be built. %Y 2) Performance Evaluation of the PowerPC 620 Microarchitecture %T Performance Evaluation of the PowerPC 620 Microarchitecture %A Trung A. Diep %A Christopher Nelson %A John Paul Shen %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 163-175 %X The PowerPC 620TM microprocessor is the most recent and performance leading member of the PowerPCTM family. The 64-bit PowerPC 620 microprocessor employs a two-phase branch prediction scheme, dynamic renaming for all the register files, distributed multi-entry reservation stations, true out-of- order execution by six execution units, and a completion buffer for ensuring precise exceptions. This paper presents an instruction-level performance evaluation of the 620 microarchitecture. A performance simulator is developed using the VMW (Visualization-based Microarchitecture Workbench) retargetable framework. The VMW-based simulator accurately models the microarchitecture down to the machine cycle level. Extensive trace-driven simulation is performed using the SPEC92 benchmarks. Detailed quantitative analyses of the effectiveness of all key microarchitecture features are presented. %Y %Y Session 4b: Managing Memory Hierarchies %Y %Y 1) Reducing TLB and Memory Overhead using Online Superpage Promotion %T Reducing TLB and Memory Overhead Using Online Superpage Promotion %A Theodore H. Romer %A Wayne H. Ohlrich %A Anna R. Karlin %A Brian N. Bershad %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 176-187 %X Modern microprocessors contain small TLBs that maintain a cache of recently used translations. A TLB's coverage is the sum of the number of bytes mapped by each entry. Applications with working sets larger than the TLB coverage will perform poorly due to high TLB miss rates. Superpages have been proposed as a mechanism for increasing TLB coverage. A superpage is a virtual memory page with size and alignment that are a power of two multiple of the system's base page size. In this paper, we describe online policies for superpage management that monitor TLB miss traffic to decide when a superpage should be constructed. Our policies take into account both the benefit of a superpage promotion (potential for preventing future misses) and the cost (page copying). Although our approach increases the cost of each TLB miss, the net effect is to improve total execution time by eliminating a large number of misses without significantly increasing memory usage, thereby improving system performance. %Y 2) Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching %T Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching %A Zheng Zhang %A Josep Torrellas %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 188-200 %X While many parallel applications exhibit good spatial locality, other important codes in areas like graph problem-solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently, while the former applications benefit from long cache lines, the latter prefer short lines. One good solution is to combine short lines with prefetching. In this way, each application can exploit the amount of spatial locality that it has. However, prefetching, if provided, should also work for the irregular codes. This paper presents a new prefetching scheme that, while usable by regular applications, is specifically targeted to irregular ones: Memory Binding and Group Prefetching. The idea is to hardware-bind and prefetch together groups of data that the programmer suggests are strongly related to each other. Examples are the different fields in a record or two records linked by a permanent pointer. This prefetching scheme, combined with short cache lines, results in a memory hierarchy design that can be exploited by both regular and irregular applications. Overall, it is better to use a system with short lines (16-32 bytes) and our prefetching than a system with long lines (128 bytes) with or without our prefetching. The former system runs 6 out of 7 Splash-class applications faster. In particular, some of the most irregular applications run 25-40% faster. %Y %Y Session 5a: Interconnection Network Routing %Y %Y 1) An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA %T An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA %A K. V. Anjan %A Timothy Mark Pinkston %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 201-210 %X This paper presents a simple, efficient and cost effective routing strategy that considers deadlock recovery as opposed to prevention. Performance is optimized in the absence of deadlocks by allowing maximum flexibility in routing. Disha supports true fully adaptive routing where all virtual channels at each node are available to packets without regard for deadlocks. Deadlock cycles, upon forming, are efficiently broken by progressively routing one of the blocked packets through a deadlock-free lane. This lane is implemented using a central ``floating'' deadlock buffer resource in routers which is accessible to all neighboring routers along the path. Simulations show that the Disha scheme results in superior performance and is extremely simple, ensuring quick recovery from deadlocks and enabling the design of fast routers. %Y 2) Analysis and Implementation of Hybrid Switching %T Analysis and Implementation of Hybrid Switching %A Kang G. Shin %A Stuart W. Daniel %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 211-219 %X No abstract submitted by the authors. %Y 3) Configurable Flow Control Mechanisms for Fault-Tolerant Routing %T Configurable Flow Control Mechanisms for Fault-Tolerant Routing %A Binh Vien Dao %A Jose Duato %A Sudhakar Yalamanchili %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 220-229 %X No abstract submitted by the authors. %Y 4) NIFDY: A Low Overhead, High Throughput Network Interface %T NIFDY: A Low Overhead, High Throughput Network Interface %A Timothy Callahan %A Seth Copen Goldstein %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 230-242 %X In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are received by a processor in the order in which they were sent, even if the underlying network delivers the packets out of order. The basic idea behind NIFDY is that each processor is allowed to have at most one outstanding packet to any other processor unless the destination processor has granted the sender the right to send multiple unacknowledged packets. Further, there is a low upper limit on the number of outstanding packets to all processors. We present results from simulations of a variety of networks (meshes, tori, butterflies, and fat trees) and traffic patterns to verify NIFDY's efficacy. Our simulations show that NIFDY increases throughput and decreases overhead. The utility of NIFDY increases as a network's bisection bandwidth decreases. When combined with the increased payload allowed by in-order delivery NIFDY increases total bandwidth delivered for all networks. The resources needed to implement NIFDY are small and constant with respect to network size. %Y %Y Session 5B: Novel Memory Access Mechanisms %Y %Y 1) Vector Multiprocessors with Arbitrated Memory Access %T Vector Multiprocessors with Arbitrated Memory Access %A Montse Peiron %A Mateo Valero %A Eduard Ayguade %A Tomas Lang %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 243-252 %X The high latency of memory accesses is one of the factors that most contribute to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in the case of multiprocessors make that the execution time of applications increases significantly. In this work we propose a memory access method that for both cases of vector uniprocessors and multiprocessors allows to perform stream accesses with the smallest possible latency in the majority of the cases. The basic idea is to arbitrate the memory access by defining the order in which the memory modules are visited. The stream elements are requested out of order. In addition, the access method also reduces the cost of the interconnection network. %Y 2) Design of Cache Memories for Multi-Threaded Dataflow Architecture %T Design of Cache Memories For Multi-Threaded Dataflow Architecture %A Krishna M. Kavi %A Ali R. Hurson %A Phenil Patadia %A Elizabeth Abraham %A Ponnarasu Shanmugam %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 253-264 %X Cache memories have proven their effectiveness in the von Neumann architecture when localities of reference govern the execution loci of programs. A pure dataflow program, in contrast, no locality of reference since the execution sequence is enforced only by the availability of arguments. Instruction locality may be enhanced if, dataflow programs are reordered. Enhancing the locality of data references in the dataflow architecture is a more challenging problem. In this paper we report our approaches to the design of instruction, data (operand) and I-Structure cache memories using the Explicit Token Store (ETS) model of dataflow systems. We will present the performance results obtained using various benchmark programs. %Y 3) Skewed Associativity Enhances Performance Predictability %T Skewed Associativity Enhances Performance Predictability %A Fran\,cois Bodin %A Andr\'e Seznec %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 265-275 %X Performance tuning becomes harder as computer technology advances. One of the factors is the increasing complexity of memory hierarchies. Most modern machines now use at least one level of cache memory. To reduce execution stalls, cache misses must be very low. Software techniques used to improve locality have been developed for numerical codes, such as loop blocking and copying. Unfortunately, the behavior of direct mapped and set associative caches is still erratic when large numerical data is accessed. Execution time can vary drasticly for the same loop kernel depending on uncontrolled factors such as array leading size. The only software method available to improve execution time stability is the copying of frequently used data, which is costly in execution time. Users are not usually cache organisation experts. They are not aware of such phenomena, and have no control over it. In this paper, we show that the recently proposed four-way skewed associative cache yields very stable execution times and good average miss ratios on blocked algorithms. As a result, execution time is faster and much more predictable than with conventional caches. As a result of its better comportment, it is possible to use larger blocks sizes with blocked algorithms, which will furthermore reduce blocking overhead costs. %Y %Y Session 6: Branch Prediction %Y %Y 1) A Comparative Analysis of Schemes for Correlated Branch Prediction %T A Comparative Analysis of Schemes for Correlated Branch Prediction %A Cliff Young %A Nicolas Gloy %A Michael D. Smith %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 276-286 %X Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that they use. The framework allows us to compare and contrast branch prediction schemes, and to analyze why they work. We use the framework to show how a static correlated branch prediction scheme increases branch bias and thus improves overall branch prediction accuracy. We also use the framework to identify the fundamental differences between static and dynamic correlated branch prediction schemes. This study shows that there is room to improve the prediction accuracy of existing branch prediction schemes. %Y 2) Next Cache Line and Set Prediction %T Next Cache Line and Set Prediction %A Brad Calder %A Dirk Grunwald %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 287-297 %X Accurate instruction fetch and branch prediction is increasingly important on today's wide-issue architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely out-come of branch instructions. Several researchers have proposed very effective fetch and branch prediction mechanisms including branch target buffers (BTB) that store the target addresses of taken branches. An alternative approach fetches the instruction following a branch by using an index into the cache instead of a branch target address. We call such an index a next cache line and set (NLS) predictor. A NLS predictor is a pointer into the instruction cache, indicating the target instruction of a branch. In this paper we examine the use of NLS predictors for efficient and accurate fetch and branch prediction. Previous studies associated each NLS predictor with a cache line and provided only one-bit conditional branch prediction. Our study examines the use of NLS predictors with highly accurate 2-level correlated branch architectures. We examine the performance of decoupling the NLS predictors from the cache line and storing them in a separate tag-less memory buffer. Our results show that the decoupled architecture performs better than associating the NLS predictors with the cache line, that the NLS architecture benefits from reduced cache miss rates and it is particularly effective for programs containing many branches. We also provide an in-depth comparison between the NLS and BTB architectures, showing that the NLS architecture is a competitive alternative to the BTB design. %Y %Y Session 7A: System Evaluation %Y %Y 1) A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D %T A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D %A Vijay Karamcheti %A Andrew A. Chien %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 298-307 %X Programming models based on messaging continue to be an important programming model for parallel machines. Messaging costs are strongly influenced by a machine's network interface architecture. We examine the impact of architectural support for messaging in two machines -- the TMC CM-5 and the Cray T3D -- by exploring the design and performance of several messaging implementations. The additional features in the T3D support remote operations: memory access, fetch-and-increment, atomic swaps, and prefetch. Experiments on the CM-5 show that requiring processor involvement for message reception can increase the communication overheads from 60% to 300% for moderate variations in computation grain size at the destination. In contrast, the T3D hardware for remote operations decouples message reception from processor activity, producing high-performance messaging independent of computation grain size or variability. In addition, hardware support for a shared address space in the T3D can be used to solve the output contention problem (output hot spots), producing messaging implementations that are robust over a wide variety of traffic patterns. Atomic swap hardware can be used to build a distributed message queue, enabling a pull messaging scheme where the destination requests data transfer upon receive. This scheme uses prefetches to mask receive latency. While this yields performance robust over output contention, its base cost is competitive only for small messages (up to 64 bytes) because of the high cost of issuing and resolving prefetches in the T3D. Emulation shows that if the interaction costs can be reduced by a factor of eight (250ns to 31ns), perhaps by moving the prefetch queue on chip, and there is a corresponding increase in the prefetch queue size, the pull scheme can give superior performance in all cases. %Y 2) Optimizing Memory System Performance for Communications in Parallel Computers %T Optimizing Memory System Performance for Communication in Parallel Computers %A T. Stricker %A T. Gross %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 308-319 %X Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth. Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon. In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination. Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options. %Y 3) Empirical Evaluation of the CRAY-T3D: A Compiler Perspective %T Empirical Evaluation of the CRAY-T3D: A Compiler Perspective %A Remzi H. Arpaci %A David E. Culler %A Arvind Krishnamurthy %A Steve G. Steinberg %A Kathy Yelick %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 320-332 %X Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The Cray T3D provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. We provide a detailed empirical performance characterization of these primitives using micro-benchmarks and evaluate their utility in compiling for a parallel language. We have found that the raw performance of the machine is quite impressive and the most effective forms of communication are prefetch and write. Other shell provisions, such as the bulk transfer engine and the external Annex register set, are cumbersome and of little use. By evaluating the system in the context of a language implementation, we shed light on important trade-offs and pitfalls in the machine architecture. %Y %Y Session 7B: Instruction Fetching %Y %Y 1) Optimization of Instruction Fetch Mechanisms for High Issue Rates %T Optimization of Instruction Fetch Mechanisms for High Issue Rates %A Thomas M. Conte %A Kishore N. Menezes %A Patrick M. Mills %A Burzin A. Patel %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 333-344 %X Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this probelm and presents several schemes with varying degrees of performance and cost. The most- general scheme, the {\em collapsing buffer}, achieves near-perfect performance and consistently aligns instructions in excess of 90% of the time, over a wide range of issue rates. The performance boost provided by compiler optimization techniques is also investigated. Results show that compiler optimization can significantly enhance performance across all schemes. The {\em collapsing buffer} supplemented by compiler techniques remains the best-performing mechanism. The paper closes with recommendations and suggestions for future. %Y 2) Instruction Fetching: Coping with Code Bloat %T Instruction Fetching: Coping with Code Bloat %A Richard Uhlig %A David Nagle %A Trevor Mudge %A Stuart Sechrest %A Joel Emer %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 345-356 %X Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads. This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance in the context of the IBS workloads. We study the impact of cache organization, transfer bandwidth, prefetching, and pipelined memory systems on machines that rely on the use of relatively small primary instruction caches to facilitate increased clock rates. We find that, although of little use for SPEC, the right combination of these techniques substantially benefits IBS. Even so, under IBS, a stubborn lower bound on the instruction-fetch CPI remains as an obstacle to improving overall processor performance. %Y 3) Instruction Cache Fetch Policies for Speculative Execution %T Instruction Cache Fetch Policies for Speculative Execution %A Dennis Lee %A Jean-Loup Baer %A Brad Calder %A Dirk Grunwald %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 357-368 %X Current trends in processor design are pointing to deeper and wider pipelines and superscalar architectures. The efficient use of these resources requires speculative execution, a technique whereby the processor continues executing the predicted path of a branch before the branch condition is resolved. In this paper, we investigate the implications of speculative execution on instruction cache performance. We explore policies for managing instruction cache misses ranging from aggressive policies (always fetch on the speculative path) to conservative ones (wait until branches are resolved). We test these policies and their interaction with next-line prefetching by simulating the effects on instruction caches with varying architectural parameters. Our results suggest that an aggressive policy combined with next-line prefetching is best for small latencies while more conservative policies are preferable for large latencies. %Y %Y Session 8: Caches %Y %Y 1) Streamlining Data Cache Access with Fast Address Calculation %T Streamlining Data Cache Access with Fast Address Calculation %A Todd M. Austin %A Dionisios N. Pnevmatikatos %A Gurindar S. Sohi %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 369-380 %X For many programs, especially integer codes, untolerated load instruction latencies account for a significant portion of total execution time. In this paper, we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores. Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache. If the prediction is correct, the cache access is overlapped with non-speculative effective address calculation. Otherwise, the cache is accessed again in the following cycle, this time using the correct effective address. The impact on the cache access critical path is minimal; the prediction circuitry adds only a single OR operation before cache access can commence. In addition, verification of the predicted effective address is completely decoupled from the cache access critical path. Analyses of program reference behavior and subsequent performance analysis of this approach shows that this design is a good one, servicing enough accesses early enough to result in speedups for all the programs we tested. Our approach also responds well to software support, which can significantly reduce the number of mispredicted effective addresses, in many cases providing better program speedups and reducing cache bandwidth requirements. %Y 2) CAT -- Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches %T CAT -- Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches %A Hong Wang %A Tong Sun %A Qing Yang %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 381-391 %X This paper presents a technique for minimizing chip-area cost of implementing an on-chip cache memory of microprocessors. The main idea of the technique is Caching Address Tags, or CAT cache for short. The CAT cache exploits locality property that exists among addresses of memory references for the purpose of minimizing chip area-cost of address tags. By keeping only a limited number of distinct tags of cached data rather than having as many tags as cache lines, the CAT cache can reduce the cost of implementing tag memory by an order of magnitude without noticeable performance difference from ordinary caches. Therefore, CAT represents another level of caching for cache memories. Simulation experiments are carried out to evaluate performance of CAT cache as compared to existing caches. Performance results of SPEC92 programs show that the CAT cache with only a few tag entries performs as well as ordinary caches while chip-area saving is significant. Such area saving will increase as the address space of a processor increases. By allocating the saved chip area for larger cache capacity, or more powerful functional units, CAT is expected to have a great impact on overall system performance. %Y %Y Session 9: Processor Architecture %Y %Y 1) Simultaneous Multithreading: Maximizing On-Chip Parallelism %T Simultaneous Multithreading: Maximizing On-Chip Parallelism %A Dean M. Tullsen %A Susan J. Eggers %A Henry M. Levy %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 392-403 %X This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple- issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors;simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultataneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space. %Y 2) Architecture Validation for Processors %T Architecture Validation for Processors %A Richard C. Ho %A C. Han Yang %A Mark A. Horowitz %A David L. Dill %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 404-413 %X No abstract submitted by the authors. %Y 3) Multiscalar Processors %T Multiscalar Processors %A Gurindar S. Sohi %A Scott E. Breach %A T.N. Vijaykumar %J Proc. 22nd Annual Symposium on Computer Architecture %D June 1995 %P 414-425 %X Multiscalar processors use a new, aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs. A single program is divided into a collection of tasks by a combination of software and hardware. The tasks are distributed to a number of parallel processing units which reside within a processor complex. Each of these units fetches and executes instructions belonging to its assigned task. The appearance of a single logical register file is maintained with a copy in each parallel processing unit. Register results are dynamically routed among the many parallel processing units with the help of compiler-generated masks. Memory accesses may occur speculatively without knowledge of preceding loads or stores. Addresses are disambiguated dynamically, many in parallel, and processing waits only for true data dependences. This paper presents the philosophy of the multiscalar paradigm, the structure of multiscalar programs, and the hardware architecture of a multiscalar processor. The paper also discusses performance issues in the multiscalar model, and compares the multiscalar paradigm with other paradigms. Experimental results evaluating the performance of a sample of multiscalar organizations are also presented.