Fastflow (computing)

Fastflow is a skeletal programming framework for writing software programs that take advantage of multi-core processors. FastFlow is specifically targeted to support the development of efficient streaming applications on cache-coherent multi-core platforms, and it is realised as a C++ template library.

Pragmatics

Fastflow, similarly to Intel Threading Building Blocks, could be directly used to develop efficient multithreaded applications in a skeletal fashion. Using Fastflow, the user can describe quite general streaming networks, which include the arbitrary nesting of pipeline and farm skeletons (whereas TBB admit pipeline skeleton only). These compositions could be extended with feedback channels to build recursive streaming networks (such as parallel Divide&Conquer and Branch&Bound algorithms, parallels Dataflow interpreter, etc.). Fastflow has been mainly designed to support the development of high-level programming frameworks on multi-cores platforms, such as Problem Solving Environment and general-purpose programming environment for multi-core applications.

Implementation

Fastflow is implemented as a template library that offers a set of low-level mechanisms to support low-latency and high-bandwidth data flows in a network of threads running on a cache-coherent multi-core. On these architectures, the key performance issues concern memory fences, which are required to keep the various caches coherent. Fastflow provides the programmer with two basic mechanisms: efficient communication channels and a memory allocator. Communication channels, as typical is in streaming applications, are unidirectional and asynchronous. They are implemented via lock-free (and memory fence-free) Multiple-Producer-Multiple-Consumer (MPMC) queues. The memory allocator is built on top of these queues, thus taking advantage of their efficiency.

Traditionally, MPMC queues are built as passive entities: threads concurrently synchronise to access data; these synchronisations are usually supported by one or more atomic operations (e.g. Compare-And-Swap) that behave as memory fences. FastFlow design follows a different approach: in order to avoid any memory fence, MPMC queues are build by assembling fence-free Single-Producer-Single-Consumer (SPSC) queues by means of an arbiter thread, which gathers or multicast/unicast data items from input queues to output queues. These entities are called either Emitter (E) or Collector (C) according to their role; they read an item from one or more fence-free SPSC queues and write onto one or more lock-free SPSC queues. This requires a memory copy but no atomic operations (this is a trivial corollary of lock-free SPSC correctness.

Fastflow is open source under GPL.

Performance

Experiments demonstrate that Fastflow is often more efficient than state-of-the-art multi-core programming frameworks in a set of micro-benchmarks and on real world applications, such as the Smith-Waterman algorithm; the speedup edge of Fastflow over other solutions might be substantial for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on Intel Threading Building Blocks for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm.

See also

  • POSIX Threads
  • Intel Threading Building Blocks
  • Cilk
  • OpenMP