Multi-cores / Many-cores 2007
Problem Statement
When considering the design issues for a problem area like highly scalable massively distributed systems (whatever this may actually mean), it is important to ground the discussion in a context that allows us to more concretely explore the problem space. The following description shouldn’t be considered a definitive prose on the subject. Rather some points are well grounded, others are speculative and some are simply meant to be provocative to engage a lively discussion. What is a highly scalable, massively distributed system? A rationale answer is simply a system that solves a large-scale problem.
Therefore, when discussing the design considerations for a substrate for a highly scalable massively distributed system, the discussion can be approached from two perspectives. Although these perspectives are often considered separate, they can also be thought of as two views of the same problem. We can either talk about the characteristics of the big problems people want to solve for example: large coupled-physics problems, parallel rendering, or massively multiplayer online gaming. Alternatively we can drive a discussion around how to build and utilize big hardware, where big doesn’t necessarily imply a big computer. In this case, big could be a large number of connected processing elements independent of their topology. Regardless, of our underlying computing infrastructure and it’s distributed programming model, our claim is that today, we are not effectively utilizing our processing capability.
Previous and Current Distributed Systems
As we review previous systems, we need to keep in mind the characteristics of the big problems that were solved by utilizing these systems, as well as their definition of big hardware. Additionally, we also need to keep in mind, that the definition of what constitutes big hardware changes over time. For systems like CORBA and DCOM, a large set of practical applications was distributed transaction processing for banks, insurance companies, etc. In these applications, scalability meant supporting a large number of accounts, as well as a large number of simultaneous users.
These distributed systems, often called object brokers, had a core set of fundamental system services including: naming, binding, object activation, and lifetime management. Remote procedure calls, served as the foundation of these distributed computing technologies as well as to their successors such as Java RMI and .NET Remoting. Unfortunately these systems tended to be tightly coupled and forced all parties to standardize on a single execution environment. As loosely coupled distributed systems which run on a variety of embedded devices started to come into favor, systems such as Sun’s Jini which relied on mobile code, and the SOAP standard, which was designed to securely cross firewalls were introduced. These systems tended to run on blade servers and clusters.
Issues for Future Scalable Distributed Systems
In the future, next generation applications will require ever-increasing computational resources. These “petascale” applications such as engineering design, utilizing parametric runs of multiple interacting physical simulations, the modeling of highly complicated and inter-related systems such as the human body, as well as future entertainment applications derived from the merging of massively multiplayer online computer games and movies, will all have strict performance and scalability requirements. In these systems, support for parallelism in any form is just a mechanism used to accomplish a larger goal and should simply be considered a means to an ends.
Furthermore, the programming model required to solve these next generation problems, will not be a “pure” model. Rather a hybrid solution will be required. Today, we have a plethora of systems built using separate and distinct technologies, i.e. remote procedure calls, messaging, transactional capabilities, cross-platform communication, etc. Recently, Microsoft developed the Indigo distributed programming model, to explicitly address these issues, and to serve as a unified solution. Unfortunately high performance was not a key design driver for Indigo.
As we look at solutions, which tend to scale, for example a GRID application like SETI@home: autonomy and asynchrony become important drivers in its design. The more autonomous a solution is, the less of a dependence it will have on coordination. In this case, it is important to recognize that any solution that requires global coordination hinders scalability and any solution that emphasizes local autonomy inherently increases scalability. Thus to maximize throughput, maintaining the autonomy of the computation (not having to coordinate) will be more important than the performance of that computation (how long we have to wait). This observation will become more and more apparent as we get more N-way in our computing infrastructure, whether this means multi-core, many-core, many-node, many-cluster, etc.
Panel Format
The objective of this panel is to have a relevant applications centric discussion on requirements for future substrates for highly scalable, massively distributed systems based on a few relevant application domains. The panelist will take into consideration the kinds of hardware, networks, and potential applications that will be online in the next five years. Panelists will give a 15-minute presentation on their views of what issues should be considered, when designing a future substrate. These presentations will be followed by a 30-minute group discussion.
The panelists can use the questions below to serve as a guide to formulate their presentations:
- What lessons have we learned from previous distributed systems, transaction systems, and distributed object databases?
- How do future applications drive explicit elements of the solution?
- What should the requirements be for a new low-level substrate be?
- What are the minimum features required for this low-level substrate?
- What is the event model, the messaging model, the security model, the wire protocol, etc?
- How do we integrate the distributed solution into the OS, what hooks are needed?
- What are the language integration issues?
- Should an object system be an integral part of the system?
- Alternatively, can we standardize a low level substrate and then bolt on top, multiple objects systems and their execution environments depending on the needs of the problem?
- If the solution is based on objects, how is identity defined in the distributed system?
- What are the issues related to versioning and evolution of the objects in the distributed system?
- What approach is used for lifetime management?
- How does the substrate support loosely coupled, strongly coupled, autonomous application needs?
- How is interoperability supported in the substrate?
- How does the solution explicitly support the needs of high performance applications?
- How is next generation hardware and it’s many forms of parallelism leveraged?