Tag Archives: HPCC’10

Highlights of HPCC 2010

The 12th IEEE Internal Conference on High Performance Computing and Communications was not the first conference I attended. However, it was the first one where I actually presented a paper in the main research track.

As usual, the conference covered a wide variety of different topics. For me the following presentations were the most interesting, since they discuss problems related to my own research.

GPGPUs are a hot topic and HPCC covered several different aspects. Often it was not the main focus of the presented paper, which was inherently interesting for me, instead it was the problems the authors were facing during their experiments. For instance the presentation on Sparse Matrix Formats Evaluation and Optimization on a GPU highlighted the difficulties of programming systems with such complex memory hierarchies as present in today’s GPGPU systems. Similar, OpenCL: Make Ubiquitous Supercomputing Possible gave an introduction on how to develop applications for GPGPU systems.

The presentation on Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments gave interesting insights in how super computing center operate and what challenges they face. One interesting anecdote was that the power saving functionality of modern CPUs actually causes some unexpected trouble in such large-scale systems. The temperature differences caused by powering down CPUs can cause their dies to crack resulting in destroyed CPUs. Furthermore, the talk also included a projection of how future super computers will look like. Interestingly, this projection says that the tradeoff between power efficiency and sophisticated CPU optimizations for single-core performance will be decided in favor for power efficiency. Thus, they expect future super computers to have dramatically higher number of much simpler cores than what is used today. That means, optimizations like branch prediction and out-of-order execution will most likely not be present on CPU cores for exascale systems to be able to build high performance systems that fit into the practical energy envelope, i.e., are coolable.

Related to cluster computing was the presentation on Enabling GPU and Many-Core Systems in Heterogeneous HPC Environments Using Memory Considerations. Here the memory bandwidth limitations were considered to schedule tasks on heterogeneous clusters. The idea is that tasks that saturate the memory system will slow down the overall systems. Thus, better distribution of tasks, taking the available memory bandwidth into account will lead to better execution times.

The Evaluation of the Task Programming Model in the Parallelization of Wavefront Programs provides the arguments for the intuitive assumption that fine-grained parallelism is inefficient. However, the approach here was based on OpenMP and Intel’s TBB library, thus, there is no automatic solution for problems which are expressed naturally in a very fine-grained way. As far as I remember form the TiC Summer School, X10’s compiler is actually capable to coarsen up fine-grained task parallelism, and thus provides some automatic solution for this problem.

However, my very personal highlight of the conference was moment when I unexpectedly received the Best Student Paper Award for my paper on Insertion Tree Phasers.

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

The last half year was an interesting departure from my actual PhD research. First, I though the idea of barriers and phasers might be interesting to incorporate into a virtual machine as part of my thesis, but as it turned out, they are much to high-level and are better off implemented in a library. The gain for direct support in a VM is just not proportional to the effort and restrictions which come with that step.

However, the time was not spend just for the sake of the academic exercise. I had a small idea how to improve the existing approaches and after quite some work, which proved that the initial idea was just broken, I had an algorithm that actually worked. Even so that idea is not contributing anything directly to my thesis, I was lucky enough to have a paper about it accepted at the IEEE HPCC’10 conference.

Abstract

This paper presents an algorithm and a data structure for scalable dynamic synchronization in fine-grained parallelism. The algorithm supports the full generality of phasers with dynamic, two-phase, and point-to-point synchronization. It retains the scalability of classical tree barriers, but provides unbounded dynamicity by employing a tailor-made insertion tree data structure.

It is the first completely documented implementation strategy for a scalable phaser synchronization construct. Our evaluation shows that it can be used as a drop-in replacement for classic barriers without harming performance, despite its additional complexity and potential for performance optimizations. Furthermore, our approach overcomes performance and scalability limitations which have been present in other phaser proposals.

  • Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism, Stefan Marr, Stijn Verhaegen, Bruno De Fraine, Theo D’Hondt, Wolfgang De Meuter, Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC), IEEE CS, September (2010) (to appear).
  • Paper: PDF
    ©IEEE, 2010. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
  • BibTex: BibSonomy