Tag Archives: summer school

Trends in Concurrency 2010, part 3

This is the last post about the TiC’10 summer school and covers the two remaining lectures. Mark Moir talked about concurrent data structures and transactional memory, and last but not least, Madan Musuvathi presented the work they did at Microsoft Research to improve the testing of concurrent applications.

Mark Moir: Concurrent Data Structures and Transactional Memory

Stacks are the basic data structure in computer science, so it seams, well, at least they are used to introduce concurrency issues like the ABA problem. After presenting the lock-free stack with a version counter, which he claimed is for some people even with 64bit not safe enough, he presented a strategy to implement a stack much more scalable. The idea is that pairs of push-pop operations leave the actual stack unchanged as a result. Thus, these pairs could meet up in parallel and independent of the actual stack to avoid bottlenecks and contention. However, such a strategy still has to maintain some correctness criterion. Which he defines to be that it is linearizable with respect to a specific sequential semantics.

Afterwards he discusses and defines non-blocking progress conditions, i.e., wait-free, lock-free, and obstruction-free. The general idea is that weakening the requirements can enable more efficient or simpler implementation strategies.

As an example of how weakening the requirements allows to achieve an implementation strategy, which matches better the performance requirements, he introduced their SNZI counter-like constructs. SNZI means Scalable Non-Zero Indicator.

In the second lecture, he reports about their experience with transactional memory, not only software transactional memory, but also hardware TM, and hybrid approaches. He presented several interesting insights they gained while experimenting with the TM provided by Sun’s Rock processors. Very interesting was the report on the several factors, which lead to problems in the HTM systems. For instance, one of the main problems was, that a transaction could fail because the branch predictor was going down the wrong path and speculatively executing instructions, which then in return lead to transaction aborts.

Thereafter, he discussed the performance they managed to gain by using the different TM or lock-based systems. The result is that it very much depends on the algorithms and its parallel behavior. But the general conclusion is: STM, HTM, and also hybrid approaches do not just yet provide the necessary performance for general applications.

Another experiment he reported on was the attempt to simplify the implementation of concurrent data structures, which was mostly a success. They also need carful tuning with TM but enable simplified algorithms and even performance improvements depending on the use-case/algorithm.

Madan Musuvathi: Concurrency Testing: Challenges, Algorithms, and Tools

First Madan tried to wake us up by engaging in a discussion about the tradeoffs of testing vs. verification. Well, this is obviously about the effort you have to spend to achieve reasonable results. However, if the question comes to critical systems, which might cause the loss of life, it ends up in weighting ethical concerns with economic decisions. So, yes, he achieved his goal and of course, people were expressing their strong opinions.

He then continued with giving a demo of CHESS. It is a tool for debugging concurrent programs. It tries to explore the possibly problematic instruction interleavings with different algorithms since a brute force approach would not be feasible for real programs. The tool supports two different modes. In the fast mode, it inserts scheduling points before every synchronization, volatile access, and interlocked operation. This allows to find many bugs in practices, but not all. The second mode, the data-race mode inserts scheduling operations before every memory access, which allows to identify race-conditions, since it captures all sequentially consistent executions. What the tool does not do is generating and testing different inputs, which is a hard problem on its own.

The remainder of the talk discusses two different strategies to approach the exploding number of possible interleavings. The first approach is a reduction approach. The goal is to eliminate behavioral equivalent interleavings. Here he describes a number of different opportunities. The second approach is to prioritize interleavings, which might exhibit problematic behavior. Well, his claim is that random is the best heuristic here but of course can be combined with the reduction approach.

In the second lecture he details the randomization approach and presents another tool called Cuzz, which allows disciplined randomization of schedules. Now he goes into the theory behind this approach to establish an estimate of how likely it is to find bugs with this approach. The general idea is that bugs have different complexity if it comes to the necessary preconditions and possible constellations of interleavings. Another point is, that many bugs share the same root cause. This is expressed by a metric called ‘bug depth’, which is the number of ordering constrains that are sufficient to find the bug. In his experience, most bugs have a rather small bug depth. With the bug depth, he can calculate the probability of being able to find every bug in the program.

His claim is, that with all the optimizations applied to the algorithms they use, in practice, Cuzz beats the theoretical boundary. Most programs exhibit a lot of bugs, and these bugs are executed often which allows to find many of them in a few hundred runs through Cuzz.

Trends in Concurrency 2010, part 2

This post is a follow up on my first report on the TiC’10 summer school. It covers mainly the talks about X10 and formal aspects of concurrency.

Vijay Saraswat gave an overview about X10, Matthew Parkinson introduced separation logic and deny-guarantee reasoning, and Ganesan Ramalingam reported on Analysis and Verification of Concurrent Programs.

Vijay Saraswat: The X10 Language: Design and Implementation

The talk started with a brief description of the usual application scenarios IBM tries to address. They are looking for a language, which allows to program homogenous multi- or many-core systems as well as potentially heterogeneous cluster or HPC systems. The main goal is to design an expressive, productive and efficient programming model for application developers. The means to that end are a type system, static analysis, a comfortable debugger, and optimized standard libraries.

X10 is part of the PGAS family, but compared to UPC or Co-Array Fortran, X10 is an asynchronous PGAS (APGAS) language, which means that accesses over location boundaries are handled asynchronous by the runtime.

Some of the language properties, which were discussed, are briefly summarized below. As an APGAS language writes are only local. For remote writes you need to use a place-shift operation. The semantics of it is synchronous, but usually the implementation is asynchronous via messages. In that regard, Vijay also mentioned a tail-call like optimization to minimize the number of messages, which need to be exchanged. The atomic semantics is, well, rough. They use a per-place lock for atomics, but atomic operations on primitive values like integers are mapped to atomic operations provided by the underlying system, thus it is inherently ‘broken’, i.e., it is up to the programmer to ensure that different atomic blocks do not interfere with each other.

The implementation uses the notion of workers, which represent threads of the underlying system. There are 1 to n workers per place and activities are executed by workers. At the moment they use a model similar to Java’s fork/join, where the number of workers can grow over time. A problematic point is the implementation of the ‘when(C)’ construct, which currently requires to spawn new threads since the thread executing the activity which uses the ‘when’ construct will actually block on it. They will later change to a Cilk-like work-stealing system, which will allow to use a constant number of threads.

The place structure is static from the beginning, and currently a flat structure. In the future, this might be changed. Currently, they think about a hierarchical model as a better fit for GPGPUs. The plan is to explore two-level hierarchical places.

The design with workers and activates allows the compiler to coarsen up concurrency to remove the overhead of very fine-grained activities.

Another more flexible language construct they introduced recently is labeled finishes to allow the implementation of server-like programs where you do not want a pure hierarchical relation between asyncs and finishs but need more flexibility. Thus, an async can indicate on which finish it wants to announce its completion.

Matthew Parkinson: Deny-Guarantee Reasoning

The largest portion of his lectures was about the foundation for his own work. Thus, he was introducing separation logic. The goal is to have logic foundation to reason about programs, which can use pointer data structures, i.e., a heap like in usual object oriented programs. As far as I understood, this logic provides the means to establish statements about the behavior of programs. More precisely, it is possible to prove certain semantic assertions about their stacks and heaps.

One thing, which can be done, is to check whether the sets of memory locations modified by a program a disjoint, and thus, whether the program is race-free.

Rely-guarantee reasoning is a logic framework to reason about composed concurrent programs and is independent of separation logic. However, this logic is not able to capture the typical fork/join parallelism of many programs.

Deny-Guarantee reasoning is a reformulation of both, which explicitly supports dynamic parallelism, and thus, enables a formal reasoning about these kind of concurrent programs.

Well, I think I understand what it is useful for. The next thing I would need is a good textbook introducing all the basics very patiently. Might be necessary to prove that my barrier/phaser algorithm is actually correct :-/

Ganesan Ramalingam: Analysis & Verification of Concurrent Programs

Another Greek talk *sigh*. Anyway, lets see what it is good for: The goal of the work Mr. Ramalingam presented is to verify programs for some correctness criterion and for instance find concurrency bugs. To this end, they are using technique called concurrent shape analysis, which allows static verification in the presence of a dynamic heap and a statically unbounded number of threads by using abstract interpretation.

Before he goes into the details, he presents a typical stack example, and how it breaks if concurrency comes into play. To be able to reason about the correctness of the implementation a sequential specification of the stack is necessary. In a concurrent setting, this implies that the operations have to be linearizable in some form. The definition for that is: an interleaved execution is said to be linearizable iff it is equivalent to a legal sequential execution that preserves the order of non-overlapping operations.

Well, and then he goes on to explain the technique in detail.

In the second lecture, he starts with talking about serializability, similar to what is done in databases. His goal is to use the techniques presented earlier, to synthesize correct locking-schemes for originally sequential data-structures, while avoiding the trivial solution but providing an efficient scheme. The simplified idea is to compute a sequential proof, introduce the locks, and then afterwards eliminate redundant locks to achieve performance.

Trends in Concurrency 2010, part 1

I already posted the presentation I gave at the summer school earlier. In the following posts, I will report a bit about the lectures of the summer school, similar to my posts about the TPLI summer school of last year.

Like last year, my thanks go to FWO and the summer school, for providing me with grants to cover all the costs. Thanks :)

The first two days covered four different topics. Adam Welc started with an introduction to Software Transactional Memory, Jean-Pierre Talpin gave lectures on the topic Virtual Prototyping Embedded Architectures, Satnam Singh reported on his experience with Data-Parallel Programming, and last but not least, Peter Sewell introduced us to the mystics of hardware memory models with his talk titled Low-Level Concurrency – For Real.

Adam Welc: Software Transactional Memory

Adam, who works for Intel Labs, talked in his first presentation mainly about the semantics of STM and the different approaches for designing a STM system. The second presentation concentrated on how the implementation of such a system could work.

His presentations were very much in the spirit of the STM system Intel provides as a prototype, but the general ideas are used in other systems as well.

The different options for semantics of an STM included open or closed transaction nesting, transaction flattening as well as abort and retry. Furthermore, he discussed the interaction for instance with legacy code, which leads to the question of weak and strong atomicity properties, or the semantics of single global lock atomicity. The last semantic question he discussed was how consistency should be provided, and how pointer-privatization could be supported by the system. Afterwards, he outlined the design space for STM systems, introducing optimistic vs. pessimistic transactions as well as write buffering vs. in-place updates.

In the second talk, he presented the implementation details.

For their C++-based STM the compiler needs additional information about the characteristics of all code, which is done by annotations on class and method level. Similar, the compiler supports atomic blocks, abort, and retry statements. Beside all the implementation details, he also discussed several optimizations the compiler attempts to reduce the STM overhead with regard to read/write-barriers wherever possible. My gut feeling tells me, someone should write a STM+TracingJIT story. Not sure whether that buys anything, but what I have seen with STM looks a lot like the trace-guards.

The last slides he presented were about performance. The basic conclusion still is: performance is an issue. However, he gave me an interesting pointer to one of his papers, which basically states, that transaction can be used for looks without contention, quite straight forwardly: Transparently reconciling transactions with locking for Java synchronization.

Jean-Pierre Talpin: Virtual Prototyping Embedded Architectures

Unfortunately, I missed the introductorily part of these lectures due to problems with our accommodation *sigh*. However, as far as I understood, in the domain of embedded systems for large scale applications like aircrafts, there is always a need to combine a large number of different technologies, and the final goal is to simulate, analyze, and verify the software before it is deployed into production use. Especially the verification part is important for critical systems like in avionics.

The systems in this domain are mostly event driven, i.e., in his terminology it is an asynchronous composition of reactive processes.

For these purpose, they developed Polychrony including an Eclipse integration. It allows using data-flow models for computation and mode automata for control. These models are verified with model-checking techniques and allow controller synthesis.

At the heart of this system is a data-flow language, which allows expressing such an asynchronous event system in terms of synchronous modules, which have better characteristics with respect to verification. Furthermore, they established the means to prove certain correctness criteria.

For an aerospace project, they used this as a basis and code-generation framework. One of their techniques uses a SSA as an intermediate representation to model existing software. They use SSA to translate the input to their synchronous data-flow formalism enabling their analysis’.

Satnam Singh: Data-Parallel Programming

The motivation behind his work is to allow users to utilize parallelism in a civil manner, i.e., with a reasonable effort/gain ratio. His premise is that programming models like SIMD, OpenMP, and MPI are inherently low-level and require very detailed knowledge about the target architecture, which restricts their applicability. Not only increases it the development cost, but it also restricts the application to a single platform. With the Microsoft Accelerator, they develop a high-level data-parallel library to target various platforms, starting with standard multi-core CPUs with SIMD instruction extensions, GPGPUs, and FPGAs.

The user describes the computation by defining a data-flow graph, which describes the intention instead of the detailed mapping to an execution. Most notably, they provide parallel array structures, on which operations can be performed without necessarily using an index, but deferring these details to the runtime, which can optimize the relevant array operations for the target architecture. Furthermore, it provides certain standard mechanisms to work in a data-parallel way. This includes various operations on the parallel arrays, which for instance can be combined with appropriate reduce operations. An important operation he emphasized is the array shift operation, which is extensively used to describe necessary data transformations. Instead of doing these kind of transformations explicitly, the runtime system can highly-optimize them to the target platform.

The resulting program description is just-in-time compiled for the target machine, and only then executed. For the typical problems they are approaching, the construction of the program describing graph and the JIT overhead are negligible.

However, this form of nested data parallelism brings also some problems. Especially the distribution of the problem over the computing platform is hard since the shape of the resulting computational graph depends on the input data.

Peter Sewell: Low-Level Concurrency – For Real

The basic conclusion of his talk is: it is impossible to write portable, correct, and efficient concurrent code for any existing hardware platform, because they do not actually tell you all the constrains/properties in their specifications.

So he basically starts to rant about all processor manufacturers and their specs. He starts with a seemingly simple example, and demonstrates that the expected outcome is not guaranteed.

Then he goes on and describes various examples for x86 and relates them to the tricks a CPU applies, which result in rather weak memory models.

Based on that, he presents a model they developed called x86-TSO (Total Store Order), which allows to reason about the correctness of concurrent x86 programs. This model is eventually used to introduce the notion of triangular races. This allows to reason about a class of programs, which can contain data races, for instance lock implementations.

From here, he goes on to rant about all the other existing hardware architectures and their semantics or lack of specification. They tried for years to come up with a formal model like x86-TSO but failed.

To conclude his lectures, he presents 10 different options how to define a sane memory model for instance for a programming language. He starts out with option 1: DON’T i.e. no concurrency. Well, my personal impression still is, that this is his favorite option. It has several advantages. Especially, it is very simple. In the end he comes to option 10, which means to have a memory model similar to C++0x. It is a data-race free model, which provides low-level concurrency primitives, and allows to break abstractions. Beside the notion of the data-race free part, that is something, I have expected/thought about earlier for a good VM.

Locality and Encapsulation: My Student’s Presentation at the TiC Summer School

On the first day here at the summer school, the organizers gave us the opportunity to present our research ideas to the lecturers and other participants.

I presented my idea on extending the virtual machine models with explicit support for concurrency models. My brief presentation focused on using multiple-language virtual machines to provide support for domain-specific concurrent languages. It basically states the question whether it is possible to use the concepts of locality and encapsulation as a common foundation for a wide range of different models. Since I am at the very beginning of working on this specific idea, there aren’t any answers yet.

Theory and Practice of Language Implementation, part 3

The third and last part of the summer school started with Ondrej Lhotak talking about pointer analysis. David Bacon presented a basic introduction to garbage collection and detailed description of the real-time Metronome GC. The last lecturer for this summer school was Patrick Cousot, who gave a very basic and thus understandable introduction to abstract interpretation.

Pointer Analysis

Ondrej’s lectures were focused on pointer analysis for C and Java-like languages. He introduced it as a static analysis for several problems like for instance optimization, call-graph construction, dependency analysis and optimization, cast elimination, side effect, and escape analysis. The major problem seems to be that it is a very broad subject with many different applications, and thus, many different techniques to do the analysis.

So, before applying pointer analysis the right abstraction for a specific problem has to be chosen. Usually, this kind of analysis will result in an over approximation of the actual problem. Under approximations are unsound and the exact points-to sets are not computable. Thus, a suitable over approximation has to be found which is exact enough for a specific question to be answered.

Just to recapitulate two examples, he introduced abstractions based on type filtering, i.e, points-to sets are calculated on the bases of types of the potential object during a program execution. Furthermore, he explained field sensitive abstractions. Here the main property of interest is the field to which an object is assigned and this information is used to calculate the points-to sets. More examples and more advance techniques like refinement demand-driven analysis are shown in the slide set which will hopefully appear on the summer school webpage.

Garbage Collection and the Metronome GC

David started his lectures by showing off a bit ;) He showed a nice demo of what the Metronome system is able to achieve in terms of real-time constraints. Definitely very interesting and it fortifies my belief that a Java-like language will be the next system programming language succeeding C/C++.

Afterwards, he gave a short introduction to mark/sweep and semi-space collectors. Useful was his explanation about the actual semantics of StopTheWorld, parallel, concurrent, and incremental GCs supported with an illustration (should remember that one). BTW: The world needs a better book on garbage collection. I don’t like the one of Jones and Lins. It does not solve my problems :(

Eventually, he started explaining the Metronome-2 systems. The paging approach let me wonder whether there is anything an operating system is actually doing for a modern VM. Well there is, but is it enough to justify the additional trouble? Dreaming of Singularity… The second lesson was basically dedicated to synchronization issues and parallel/concurrent data structures like lock-free stacks and which strategies are out there to implement them. This was a nice addition to what we discussed during Yannis’ lectures and gave me some basic inspiration to implement a lock-free ring buffer for multiple writers and a single reader.

His explanation of all the details of the metronome algorithm and how the different phases interact was quite difficult, and well, not to easy to follow especially since I lost the overall picture from time to time. But well, it is a hard problem and a lot of engineering without a simple solution as it seems.

Abstract Interpretation

The last topic at the summer school was abstract interpretation and the lectures were given by Mr. Abstract Interpretation Patrick Cousot himself. Even so it is a theoretical topic, I liked his introduction. Probably for some of the people it was to basic, but since I never had looked at it before, for me it was the right level of detail and complexity to get an idea what abstract interpretation is actually about.

Unfortunately, my own notes are very spares, and I was not able to retrieve the slide set up to now. However, abstract interpretation tries to provide a sound approximation of the semantics of a computer program, to enable various kinds of static analysis. The main problem here is, similar to pointer analysis, to choose the right abstraction for a given problem.

The examples I recall could have fit into Summit’s presentation about bound computation of loops as well. He introduced ideas like step-wise approximation and other techniques. But, well, unfortunately my memory is too blurry to write something down here with certainty. Beside the introduction of the formal techniques, he also spent some time on how to implement them with Ocaml examples. He also made quite some advertisement for his product ASTRÉE, which originally was developed to analyze code for avionic systems. Well, if I can get my hands on his slides, I might add some more information about the content, but for the moment that’s it.

Where to go next Year?

Need less to say, that I really enjoyed the summer school. The lectures had different levels and quality, but the overall experience was great. Hope, that there will be a VM summer school next year, somewhere on this nice planet. I would love to attend it, too. :)