Multi-cores / Many-cores 2007
Presentations
Abstract
Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing. Scale-out solutions are particularly effective in high-throughput web-centric applications. In this talk, we discuss the behavior of two competing approaches to parallelism, scale-up and scale-out, in emerging commercial applications. We show that a scale-out strategy can be the key to good performance even on a scale-up machine and we point out some existing limitations in scale-out for commercial computing. We also discuss how scale-out solutions offer better price/performance, although at an increase in management complexity.
Abstract
This talks describes the challenges that exploiting large scale thread-level parallelism will present to programmers and potential the solutions for allowing average programmers to achieve to get 90% of parallel performance with 10% more effort than sequential programming.
Abstract
We present a novel way to produce dense linear algebra factorization algorithms. The current state-of-the-art (SOA) dense linear algebra algorithms have a performance inefficiency and hence they give sub-optimal performance for most of LAPACKÕs factorizations. We show that standard FORTRAN and C two dimensional arrays are the main reason for the inefficiency. For the other standard format ( packed one dimensional arrays for symmetric and/or triangular matrices ) the situation is much worse. We show how to correct these performance inefficiencies by using new data structures (NDS) along with so-called kernel routines. The NDS generalize the current storage layouts for both standard layouts. We use the concept of Equivalence and Elementary matrices along with coordinate (linear) transformations to indicate why our method works for an entire class of dense linear algebra algorithms. Also, we use the Algorithms and Architecture approach to justify why our new method gives higher efficiency. The simplest forms of the new factorization algorithms are a direct generalization of the commonly used LINPACK algorithms. On many platforms they can be generated from simple, textbook-type codes, by the platformÕs Fortran and/or C compiler. On the IBM Power3 processor our implementation of Cholesky factorization achieves 92% of peak performance whereas conventional SOA full format LAPACK DPOTRF achieves 77% of peak performance. The simple algorithm of LU factorization with partial pivoting for the NDF is direct generalization of the LINPACK algorithm, DGEFA. All programming for our NDS can be accomplished in standard Fortran, through the use of higher than two-dimensional arrays. Thus, no new compiler support is necessary For the multi-core type CELL platform we cite some new results of the PLASMA project at Univ. Tenn., Knoxville on the Linpack Benchmark. Key features of this work are the use iterative refinement, NDS and the ÓlookaheadÓ principle.
Abstract
Abstract
The PeakStream Platform is a software development and and deployment environment that makes is easy to program a variety of multi-core processors, including GPUs, the AMD Stream Processor, and x86 multi-core CPUs. The PeakStream Platform includes a set of development tools, such as debuggers and profilers, and a runtime deployment system called the the PeakStream Virtual Machine. The PeakStream programming model is a data parallel programming model. The platform offers developers a high productivity means of programming a wide range of multi-core processors in a portable fashion. The PeakStream Platform is ideal for the computationally intensive applications created in the industries of Oil & Gas, Financial Services, Government, and Research & Academia.
Abstract
Multi-core processors are emerging as a new building block for designing next generation large-scale clusters for HPC and Data-Centers. In this talk, we propose new designs for multi-core aware middleware for such systems. For HPC environment, we propose new cache-aware designs for multi-core systems to enhance intra-CMP and inter-CMP shared-memory communication for the popular MPI library. For Data-center environments, we also propose dedicated cores to handle memory copy and other functionalities (such as shared state). Performance benefits of these techniques for a number of applications are illustrated.
Abstract
Abstract
It is clear that the industry is turning towards multicore design patterns to address physical limitations that have been hit by the race to always higher clocks rates that took place in the last years. Since the definition of multicore processor is still very blurry and different solutions have been proposed by the chip makers, it is unclear how this new disruptive technology is going to affect Linear Algebra software. This talk presents the results of an experimental phase that could lead to the definition of a general approach to the development of Linear Algebra code for Multicore Architectures.
Abstract
Although multicore chips are a major hardware event for the computing community at-large, the basic architecture they represent is not qualitatively new. Shared memory parallel (SMP) computers have been commercially available for decades, and many of the experiences on SMPs can be carried forward to inform us about how to best utilize multicore chips.
In this talk we survey some of the important experiences gained from programming SMP computers, from the Cray X-MP through SMP workstations and on to distributed shared memory systems. Although the low cost and ubiquity of multicore chips, along with the vastly larger body of programmers and users will surely have an impact, there are some basic truisms from SMP computing that will likely carry over to multicore.
Abstract
In an ideal HPC programming environment, scientists and engineers would express problems in the mathematical and technical language they are trained in, free of knowledge about machine details. The burden of programming efficiently for specific computer architectures would fall entirely to the creators of libraries and compilers. One reason we haven't achieved this ideal is because we retain the intellectual legacy of using floating point operations per second (FLOPS) as the performance currency, decades after memory performance became the dominant burden of HPC workloads.
We need to learn to ignore FLOPS, especially at the library and compiler level, and reinvent every aspect of our computing infrastructure that derives from the idea that FLOPS are the goal: computer science curricula, procurement requirements, product pricing, and over fifty years of algorithms that no longer express what most matters for achieving high performance. By making more aspects of the memory (size, latency, bandwidth, concurrency) explicit in the way we program computers, we can make libraries that survive many generations of architecture change with only minor, simple adjustments.
Abstract
As we consider programming methodologies for the upcoming era of multi-core processors (as well as clusters of them), it seems clear that the programming model must respect data locality, and facilitate intelligent migration of data across memory hierarchies. In addition, it desirable that programmers should be freed from resource management issues such as load balancing. I will present arguments as to why The Charm++ model, which is a C++ based paradigm, and supports data-driven objects, is the right model which fits the bill. Because of its data-driven scheduler and object-based parallelization, it is able to decide pre-fetching of data (as needed by the CELL BE processor, for example. The objects promote locality. Further, the data-driven model is familiar to a generation of desktop programmers who write event-driven programs. This makes its acceptability in this community better than that in the science/engineering community, where it already has created a strong niche. I will also overview higher level abstractions built on top of Charm++, including one that provides a race-free global view of data.
Abstract
I am interested in describing the experience of participating in a research project for the development of the initial versions of the streaming virtual machine abstraction across multiple architectures, and what we learned in doing so and in developing an automatic mapper for the abstraction. In the case of the Cell processor, there appear to be numerous standard abstractions for expressing the choreography of cores and the motion of data among and to/from the cores. I will examine the question of whether there could and ought to be a standard abstraction for Cell and other architectures, and what elements this common abstraction should or could contain. I am particularly interested in the opinions of the audience on this question.
Abstract
Suborbital space tourism may finally open up space to private enterprise and commerce. If economically successful, which studies suggest is likely, private suborbital flight will lead to continually increasing demands for more performance and capability - similar to the evolution path of commercial air travel - that will eventually include ultra-rapid point-to-point global travel and routine and affordable orbital transportation. Hypersonic space planes propelled by a combination of air-breathing engines and rockets are leading contenders for achieving these ends due to their efficiency, flexibility and potentially favorable economics. On the other hand, hypersonic vehicles are highlyintegrated, encounter complex flow physics, and suffer from high performance and economic prediction uncertainty due to less than complete knowledge of the physics they encounter, and the difficulty associated with obtaining that knowledge through testing. To counter these challenges, high-fidelity physical modeling tools are continually being developed that can more accurately and rapidly predict hypersonic flow physics, and structural performance under combined thermal and mechanical loads. Also being developed are integrated multidisciplinary design analysis and optimization (MDA/O) tools and processes. Due to the computer intensive nature of high-fidelity analysis codes and MDA/O systems, faster and more efficient computing systems and algorithms are required to meet the design challenges that will be imposed by the unfolding demands of space commerce. This presentation will focus onthese themes, defining ahighlighting some requirements for creating this future.