High performance cloud computing

High Performance Cloud Computing (HPC2) is a term coined by Robert L. Clay of Sandia National Labs to refer to a body of work focused on providing a scalable application runtime environment based on core notions from Cloud Computing (specifically, extreme hardware fault tolerance through software) applied for use on high performance machine architectures (high cross-section bandwidth).

Work on HPC2 emerged as a response to the perceived breakdown of several core assumptions in traditional high performance computing (HPC) at extreme scale (exascale and beyond). These assumptions include: 1. That compute nodes persist for the duration of a job. 2. That the MPI programming model will scale to arbitrary size. 3. That we can build hardware that is sufficiently reliable (to not require fault oblivious software). 4. That capability machines are fundamentally different than capacity machines.

It should be noted that these assertions were meant as much for rhetorical purposes as strictly technical observations.

An alternative set of assumptions was offered to replace this set, based on the perception that (at least the first three of) the above assumptions were in effect failing as we scaled machines and applications.

This alternate set of assumptions include: 1. That compute nodes will fail during execution of job, and so will other hardware components. 2. That the MPI cooperative computing model will not scale far enough. 3. That sufficiently reliable hardware is too expensive and impractical at scale. 4. That capability machines of the future May Be similar to capacity machines.

This fourth assertion was posed as a question.

The core notions driving HPC2 are oriented around building an application runtime system that can scale to arbitrary size, and that is not specific to any one hardware system design or configuration.