Aug 4 2010

Poster at SPLASH’10

On a related note to the last post, my poster got also accepted for SPLASH’10.

It will get a minor revision, and include some of the latest ideas related to locality and encapsulation. However, the abstract which will appear in the proceedings is rather high-level, and does not go into detail. It is another case of reusing IWT report material to have another little dot on the CV, and free food ;)

Abstract

We propose to search for common abstractions for concurrency models to enable multi-language virtual machines to support a wide range of them. This would enable domain-specific solutions for concurrency problems. Furthermore, such an abstraction could improve portability of virtual machines to the vastly different upcoming many-core architectures.

  • Many-Core Virtual Machines: Decoupling Abstract from Concrete Concurrency, Stefan Marr, Theo D’Hondt, Proceedings of the first SPLASHCon, USA, ACM (2010).
  • Paper: PDF
    ©ACM, 2010. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. To appear.
  • BibTex: BibSonomy

Jul 30 2010

Doctoral Symposium at SPLASH 2010

In October, I will give a brief presentation on the state of affairs with my PhD research at the SPLASH 2010 Doctoral Symposium. The basic idea has not changed since my last presentation at the TiC’10 summer school. I haven’t been able to do a lot of real work for it, but the ideas are a bit clearer now. The following two-page proposal will be published as part of the conference proceedings.

Abstract

We propose to search for common abstractions for different concurrency models to enable high-level language virtual machines to support a wide range of different concurrency models. This would enable domain-specific solutions for the concurrency problem. Furthermore, advanced knowledge about concurrency in the VM model will most likely lead to better implementation opportunities on top of the different upcoming many-core architectures. The idea is to investigate the concepts of encapsulation and locality to this end. Thus, we are going to experiment with different language abstractions for concurrency on top of a virtual machine, which supports encapsulation and locality, to see how language designers could benefit, and how virtual machines could optimize programs using these concepts.

  • Encapsulation And Locality: A Foundation for Concurrency Support in Multi-Language Virtual Machines?, Stefan Marr, Proceedings of the first SPLASHCon, USA, ACM (2010).
  • Paper: PDF
    ©ACM, 2010. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. To appear.
  • BibTex: BibSonomy

Jul 6 2010

Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism

The last half year was an interesting departure from my actual PhD research. First, I though the idea of barriers and phasers might be interesting to incorporate into a virtual machine as part of my thesis, but as it turned out, they are much to high-level and are better off implemented in a library. The gain for direct support in a VM is just not proportional to the effort and restrictions which come with that step.

However, the time was not spend just for the sake of the academic exercise. I had a small idea how to improve the existing approaches and after quite some work, which proved that the initial idea was just broken, I had an algorithm that actually worked. Even so that idea is not contributing anything directly to my thesis, I was lucky enough to have a paper about it accepted at the IEEE HPCC’10 conference.

Abstract

This paper presents an algorithm and a data structure for scalable dynamic synchronization in fine-grained parallelism. The algorithm supports the full generality of phasers with dynamic, two-phase, and point-to-point synchronization. It retains the scalability of classical tree barriers, but provides unbounded dynamicity by employing a tailor-made insertion tree data structure.

It is the first completely documented implementation strategy for a scalable phaser synchronization construct. Our evaluation shows that it can be used as a drop-in replacement for classic barriers without harming performance, despite its additional complexity and potential for performance optimizations. Furthermore, our approach overcomes performance and scalability limitations which have been present in other phaser proposals.

  • Insertion Tree Phasers: Efficient and Scalable Barrier Synchronization for Fine-grained Parallelism, Stefan Marr, Stijn Verhaegen, Bruno De Fraine, Theo D’Hondt, Wolfgang De Meuter, Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC), IEEE CS, September (2010) (to appear).
  • Paper: PDF
    ©IEEE, 2010. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
  • BibTex: BibSonomy

Jun 21 2010

Trends in Concurrency 2010, part 3

This is the last post about the TiC’10 summer school and covers the two remaining lectures. Mark Moir talked about concurrent data structures and transactional memory, and last but not least, Madan Musuvathi presented the work they did at Microsoft Research to improve the testing of concurrent applications.

Mark Moir: Concurrent Data Structures and Transactional Memory

Stacks are the basic data structure in computer science, so it seams, well, at least they are used to introduce concurrency issues like the ABA problem. After presenting the lock-free stack with a version counter, which he claimed is for some people even with 64bit not safe enough, he presented a strategy to implement a stack much more scalable. The idea is that pairs of push-pop operations leave the actual stack unchanged as a result. Thus, these pairs could meet up in parallel and independent of the actual stack to avoid bottlenecks and contention. However, such a strategy still has to maintain some correctness criterion. Which he defines to be that it is linearizable with respect to a specific sequential semantics.

Afterwards he discusses and defines non-blocking progress conditions, i.e., wait-free, lock-free, and obstruction-free. The general idea is that weakening the requirements can enable more efficient or simpler implementation strategies.

As an example of how weakening the requirements allows to achieve an implementation strategy, which matches better the performance requirements, he introduced their SNZI counter-like constructs. SNZI means Scalable Non-Zero Indicator.

In the second lecture, he reports about their experience with transactional memory, not only software transactional memory, but also hardware TM, and hybrid approaches. He presented several interesting insights they gained while experimenting with the TM provided by Sun’s Rock processors. Very interesting was the report on the several factors, which lead to problems in the HTM systems. For instance, one of the main problems was, that a transaction could fail because the branch predictor was going down the wrong path and speculatively executing instructions, which then in return lead to transaction aborts.

Thereafter, he discussed the performance they managed to gain by using the different TM or lock-based systems. The result is that it very much depends on the algorithms and its parallel behavior. But the general conclusion is: STM, HTM, and also hybrid approaches do not just yet provide the necessary performance for general applications.

Another experiment he reported on was the attempt to simplify the implementation of concurrent data structures, which was mostly a success. They also need carful tuning with TM but enable simplified algorithms and even performance improvements depending on the use-case/algorithm.

Madan Musuvathi: Concurrency Testing: Challenges, Algorithms, and Tools

First Madan tried to wake us up by engaging in a discussion about the tradeoffs of testing vs. verification. Well, this is obviously about the effort you have to spend to achieve reasonable results. However, if the question comes to critical systems, which might cause the loss of life, it ends up in weighting ethical concerns with economic decisions. So, yes, he achieved his goal and of course, people were expressing their strong opinions.

He then continued with giving a demo of CHESS. It is a tool for debugging concurrent programs. It tries to explore the possibly problematic instruction interleavings with different algorithms since a brute force approach would not be feasible for real programs. The tool supports two different modes. In the fast mode, it inserts scheduling points before every synchronization, volatile access, and interlocked operation. This allows to find many bugs in practices, but not all. The second mode, the data-race mode inserts scheduling operations before every memory access, which allows to identify race-conditions, since it captures all sequentially consistent executions. What the tool does not do is generating and testing different inputs, which is a hard problem on its own.

The remainder of the talk discusses two different strategies to approach the exploding number of possible interleavings. The first approach is a reduction approach. The goal is to eliminate behavioral equivalent interleavings. Here he describes a number of different opportunities. The second approach is to prioritize interleavings, which might exhibit problematic behavior. Well, his claim is that random is the best heuristic here but of course can be combined with the reduction approach.

In the second lecture he details the randomization approach and presents another tool called Cuzz, which allows disciplined randomization of schedules. Now he goes into the theory behind this approach to establish an estimate of how likely it is to find bugs with this approach. The general idea is that bugs have different complexity if it comes to the necessary preconditions and possible constellations of interleavings. Another point is, that many bugs share the same root cause. This is expressed by a metric called ‘bug depth’, which is the number of ordering constrains that are sufficient to find the bug. In his experience, most bugs have a rather small bug depth. With the bug depth, he can calculate the probability of being able to find every bug in the program.

His claim is, that with all the optimizations applied to the algorithms they use, in practice, Cuzz beats the theoretical boundary. Most programs exhibit a lot of bugs, and these bugs are executed often which allows to find many of them in a few hundred runs through Cuzz.


Jun 14 2010

Trends in Concurrency 2010, part 2

This post is a follow up on my first report on the TiC’10 summer school. It covers mainly the talks about X10 and formal aspects of concurrency.

Vijay Saraswat gave an overview about X10, Matthew Parkinson introduced separation logic and deny-guarantee reasoning, and Ganesan Ramalingam reported on Analysis and Verification of Concurrent Programs.

Vijay Saraswat: The X10 Language: Design and Implementation

The talk started with a brief description of the usual application scenarios IBM tries to address. They are looking for a language, which allows to program homogenous multi- or many-core systems as well as potentially heterogeneous cluster or HPC systems. The main goal is to design an expressive, productive and efficient programming model for application developers. The means to that end are a type system, static analysis, a comfortable debugger, and optimized standard libraries.

X10 is part of the PGAS family, but compared to UPC or Co-Array Fortran, X10 is an asynchronous PGAS (APGAS) language, which means that accesses over location boundaries are handled asynchronous by the runtime.

Some of the language properties, which were discussed, are briefly summarized below. As an APGAS language writes are only local. For remote writes you need to use a place-shift operation. The semantics of it is synchronous, but usually the implementation is asynchronous via messages. In that regard, Vijay also mentioned a tail-call like optimization to minimize the number of messages, which need to be exchanged. The atomic semantics is, well, rough. They use a per-place lock for atomics, but atomic operations on primitive values like integers are mapped to atomic operations provided by the underlying system, thus it is inherently ‘broken’, i.e., it is up to the programmer to ensure that different atomic blocks do not interfere with each other.

The implementation uses the notion of workers, which represent threads of the underlying system. There are 1 to n workers per place and activities are executed by workers. At the moment they use a model similar to Java’s fork/join, where the number of workers can grow over time. A problematic point is the implementation of the ‘when(C)’ construct, which currently requires to spawn new threads since the thread executing the activity which uses the ‘when’ construct will actually block on it. They will later change to a Cilk-like work-stealing system, which will allow to use a constant number of threads.

The place structure is static from the beginning, and currently a flat structure. In the future, this might be changed. Currently, they think about a hierarchical model as a better fit for GPGPUs. The plan is to explore two-level hierarchical places.

The design with workers and activates allows the compiler to coarsen up concurrency to remove the overhead of very fine-grained activities.

Another more flexible language construct they introduced recently is labeled finishes to allow the implementation of server-like programs where you do not want a pure hierarchical relation between asyncs and finishs but need more flexibility. Thus, an async can indicate on which finish it wants to announce its completion.

Matthew Parkinson: Deny-Guarantee Reasoning

The largest portion of his lectures was about the foundation for his own work. Thus, he was introducing separation logic. The goal is to have logic foundation to reason about programs, which can use pointer data structures, i.e., a heap like in usual object oriented programs. As far as I understood, this logic provides the means to establish statements about the behavior of programs. More precisely, it is possible to prove certain semantic assertions about their stacks and heaps.

One thing, which can be done, is to check whether the sets of memory locations modified by a program a disjoint, and thus, whether the program is race-free.

Rely-guarantee reasoning is a logic framework to reason about composed concurrent programs and is independent of separation logic. However, this logic is not able to capture the typical fork/join parallelism of many programs.

Deny-Guarantee reasoning is a reformulation of both, which explicitly supports dynamic parallelism, and thus, enables a formal reasoning about these kind of concurrent programs.

Well, I think I understand what it is useful for. The next thing I would need is a good textbook introducing all the basics very patiently. Might be necessary to prove that my barrier/phaser algorithm is actually correct :-/

Ganesan Ramalingam: Analysis & Verification of Concurrent Programs

Another Greek talk *sigh*. Anyway, lets see what it is good for: The goal of the work Mr. Ramalingam presented is to verify programs for some correctness criterion and for instance find concurrency bugs. To this end, they are using technique called concurrent shape analysis, which allows static verification in the presence of a dynamic heap and a statically unbounded number of threads by using abstract interpretation.

Before he goes into the details, he presents a typical stack example, and how it breaks if concurrency comes into play. To be able to reason about the correctness of the implementation a sequential specification of the stack is necessary. In a concurrent setting, this implies that the operations have to be linearizable in some form. The definition for that is: an interleaved execution is said to be linearizable iff it is equivalent to a legal sequential execution that preserves the order of non-overlapping operations.

Well, and then he goes on to explain the technique in detail.

In the second lecture, he starts with talking about serializability, similar to what is done in databases. His goal is to use the techniques presented earlier, to synthesize correct locking-schemes for originally sequential data-structures, while avoiding the trivial solution but providing an efficient scheme. The simplified idea is to compute a sequential proof, introduce the locks, and then afterwards eliminate redundant locks to achieve performance.