Tag Archives: parallel

OOSPLA 2011 @SPLASH2011, Day 3

The third day started with Brendan Eich’s keynote on JavaScript’s world domination plan. It was a very technical keynote, not very typical I suppose. And he was rushing through his slides with an enormous speed. Good that I have some JavaScript background. Aside all the small things he mentioned, interesting for me is that he seemed to be very interested to get Intel’s RiverTrail approach to data-parallelism into ECMAScript in one or another form. That is kind of contradicting the position I heard so far, that ECMAScript would be done with having WebWorkers as a model for concurrency and parallel programming.

Language Implementation

The first session had two interesting VM talks for me. The first being JIT Compilation Policy for Modern Machines. With the assumption that you cannot get your multicore/manycore machines busy with application threads, they experimented how additional compilation threads can be used to optimize code better. I do not remember the details completely, but I think, after 7 compilation threads they reached a mark where it was not worthwhile anymore to add more threads.

The second talk was on Reducing Trace Selection Footprint for Large-scale Java Applications with no Performance Loss. The goal here is to reduce the number of trace in a tracing JIT that need to be kept around and optimized. Might be something interesting for PyPy and LuaJIT2 to consider.

Parallel and Concurrent Programming

The last session I attended was again on parallel and concurrent programing. The first paper A Simple Abstraction for Complex Concurrent Indexes was a pretty formal one.

The second paper is a pessimistic approach to implement atomic statements meant for systems programming: Composable, Nestable, Pessimistic Atomic Statements. It is not optimistic like STM, and does not use a global locking order, which would need to be determined statically. Instead they annotate the fields with so-called shelters. Shelters build a hierarchy, which describes the necessary parts to synchronize with.

The third paper was also a very interesting one for me: Delegated Isolation. It is an approach again similar to STM in a sense, but it avoids unbounded number of transaction retries. For that, the object graph is essentially partitioned in growing subgraphs that can be processed in parallel. Only when a conflict, a race-condition occurred, subgraphs are merged logically, and the computation is serialized. A neat idea and an interesting use of the object-ownership idea.

The last talk was on: AC: Composable Asynchronous IO for Native Languages. It was a presentation of work done in the context of the Barrelfish manycore OS. The goal was to have an easy to use programming model, that comes close to sequential programming, but has similar performance properties as typical asynchronous APIs in operating systems. The result is a model based on async/finish, that seems to be relatively nice. As I understand it, it is basically syntactic sugar and some library support to wrap typical asynchronous APIs. But it is a model that is purely focused on such request/response asynchrony, and does not handle concurrency/parallism.

And that was basically it. SPLASH is a nice conference when it comes to the content. Not so interesting when it comes to the social “events”, it wasn’t much of an event anyway. Not even the food was notable…

OOSPLA 2011 @SPLASH2011, Day 2

The second day of the technical tracks started with a keynote by Markus Püschel. He is not the typical programming language researcher you meet at OOPSLA, but he does research in automatic optimization of programs. In his keynote, he showed a number of examples how to get the best performance for a given algorithm out of a particular processor architecture. Today’s compilers are still not up to the task, and will probably never be up to it. Given a naïve implementation, hand-optimized C code can have 10x speedup when dependencies are made explicit, and the compiler knows that no aliasing can happen. He was then discussing how that can be approached in an automated way, and was also thinking about what programming languages could do.

Award Papers

Afterwards, I attended the session with the awarded OOPSLA papers. The Hybrid Partial Evaluation talk presented an approach to avoid the typical cost of use of reflection or ‘interpretation’. The presentation of SugarJ: Library-based Syntactic Languages felt like a déjà vu. I did not get where it is different from Helvetica other than that it is for Java. The third paper on Reactive Imperative Programming with Dataflow Constraints was interesting in that it used also memory protection tricks to realize a reactive model in C++. The last presentation: Two for the Price of One: A Model for Parallel and Incremental Computation was very interesting. I have not used incremental computations as far as I am aware of anywhere other than for course work, but bringing it together with parallel programming in a single programming model, gives plenty of opportunities for super-linear speedups.

Parallel and Concurrent Programming

The second session of the day was on parallel and concurrent programming. Kismet: Parallel Speedup Estimates for Sequential Programs tackled the problem to get an idea of what opportunities for parallelism are available in a given program without having to change the used algorithms and approaches to much. For that, it uses data dependency analysis to characterize the critical path on a data-flow level. Since that usually does not give realistic results because of overestimation of parallelizability, they use in addition a hierarchical model of loops and the knowledge of the available hardware parallelism to better predict possible speedups.

The second and the third talk where almost identical in terms of problem and goal. Essentially, they provide the necessary infrastructure to run different variants of sequential implementations in parallel and then chose either the winner in terms of runtime or precision. These approaches are especially interesting if the available algorithms have very different properties for different input or input sizes. For instance, some mathematical algorithms just do not converge to a solution for certain inputs while they are very fast for others.

The last talk of the session discussed Scalable Join Patterns. Join patterns are an old approach to describe synchronization mechanism flexibly and declaratively. The presented work provided a scalable implementation approach that seems to work quite well and when they would use a compilation based approach for the patters, I guess it could be a very feasible and flexible replacement for standard synchronization mechanism provided as libraries.

Panel

Instead of attending the third paper session of the day, I attended the panel on Multicore, Manycore, and Cloud Computing: Is a new Programming Language Paradigm required?. Well, it was entertaining :) Nothing really new, no surprising arguments as far as I recall, but certainly interesting to watch. I think, they also recorded it. So it might be floating around the web soon.

The Sly3 Programming Language

The following introduction and analysis of the Sly3 programming language was written by Pablo Inostroza Valdera as part of his course work for the Multicore Programming course of Tom Van Cutsem. The assignment was to write a blog post about a topic of their own choice, and I repost Pablo’s work here with his permission to spread the word about Sly a bit wider. We made his article also available as part of his work for the Renaissance project itself.

Multicore Programming, Course Work

The Sly3 Programming Language

Pablo Inostroza Valdera

< pinostro _ vub.ac.be >

23 June 2011

Introduction

Sly3 is a Smalltalk dialect that features two concepts in order to deal with parallelism: ensembles, which are collections of objects that can receive parallel messages, and modifiers, which decorate messages in order to modify their parallel behavior (adverbs) or to specify how to handle reductions (gerunds). In this article we analyze Sly3‘s programming model.

Sly3 is a language being developed in the context of the Renaissance Project, at IBM. With Sly3, their authors David Ungar and Sam Adams explore abstractions for nondeterministic parallel programming. Sly3 is the experimental successor of LY, a javascript-like interpreted language that was implemented on top of Smalltalk[UA10]. As the 3 in Sly3 indicates, Sly itself underwent already three iterations of design. The main idea behind the design of Sly3 is that multicore programming is commonly addressed by restricting the nondeterminism parts and trying to enforce strict determinism. As an example of this, Ungar and Adams refer to the actors paradigm. While the actor model certainly adds some order to concurrent computations, the truth is that some nondeterminism still persists because the global order in which messages arrive to an actor is nondeterministic. They proposed that instead of trying to restrict the nondeterminism, it should be embraced. The Ly language and, later on, the Sly3 language are steps towards that direction. In this article we discuss the latter.

Ensembles, or How to Represent A Flock of Birds

Sly3‘s key elements were conceived by observing the nature. Think of a flock of birds, an ant colony or a school of fishes. These are all examples for a multitude of independent agents, acting in a nondeterministic way, whose aggregated behavior is more complex than that of the individuals. In other words, a more complex behavior emerges. The idea of a whole formed from independent units is key to Sly3, and it has been reflected in the concept of an ensemble. An ensemble is a collection of objects, subject of receiving messages as a whole. In the following example, we create an ensemble and execute an operation on it:

  numbers := Sly3Ensemble with: {1. 2. 3. 4. 5}.
  squaredNumbers := numbers squared.
  

This code yields: %{1. 4. 9. 16. 25}. Recall that the curly braces represent an array in Smalltalk. In Sly3, curly braces preceded by % denote an ensemble.

Notice how the ensemble received a message as a whole, but it acted on each individual in order to produce a new ensemble with all of the elements squared. It is not surprising to know that the default behavior of this mapping is parallel, i.e., a different thread of execution is triggered for each individual computation. The standard execution strategy in this this simple example is as follows:

  • The message was delegated to members of the ensemble, creating a parallel process for the evaluation of each of the messages. We can say that each member was treated individually and was executing its own task.
  • For generating the final result, a new ensemble was created from the results of all the tasks of the previous step. We can say that we are ensembling together all the individual results.

Notice that we have put in bold letters two words: individually and ensembling. We have done so, because both words characterize a strategy for handling a parallel computation. Actually, this is the default strategy in Sly3, but the programming model was designed to be able to customize these behaviors.

Adverbs and Gerunds, or How to Modulate Parallel Behavior

The basic model of message handling for an ensemble is:

  1. The ensemble that receives the message is treated in accordance to its adverb. A receiver’s adverb precedes the message name. In the absence of an explicit one, the default adverb for receiver, namely individualLY, is used. We elaborate on the meaning of individualLY in subsequent paragraphs, important here is to note that adverbs’ names always end in LY.
  2. If there are operands, they have to be treated in accordance with their adverbs. An operand adverb follows the message name. In the absence of an explicit one, the default adverb for operands, namely wholLY, is used.
  3. The operation is executed according to an overall strategy, also determined by an adverb.
  4. The result of the operation is reduced in accordance to a gerund. In the absence of an explicit one, the default gerund, namely ensemblING, is used. Gerunds’ names always end in ING.

Regarding the point 3, it suffices to mention that currently there are only three overall strategies, namely, serially, plainly and the default one, which we can call parallelly. Note that the metaphor of the adverb is taken to an extreme in Sly3. Many words that do not sound like adverbs are adverbialized in order to fit in the conceptual framework. Serially forces serial executions, so no processes/threads are spawned to compute the results. This can be useful either for semantic reasons or when the cost of parallel execution is to high, i.e., when we are below the sequential threshold. The adverb plainly can be used to turn off ensemble behavior, and therefore, to treat the ensemble as a collection (e.g., we could ask for the size of an ensemble by using plainly, and this would not imply sending a message to each of its members). For the sake of brevity, we are not going to discuss the overall strategies further, since they are not relevant for grasping the essence of the Sly3 programming model.

Adverb Informal description
wholLY just pass me along as a whole ensemble
individualLY create a process/thread for each of my members
collectionLY treat me as a collection instead of an ensemble
duplicativeLY copy each of my members, to produce a new ensemble
roundLY create a process for each combination of the receiver’s members and my members
randomLY chose n of my members randomly, n is a parameter of this adverb
valueLY take a block and apply it to each of my members
Table 1: Some of the principal adverbs in Sly3.

The syntax of Sly3 is cumbersome, so next we present some examples in order to provide a more clear idea of how to express things in this language. Let’s analyze them in the light of three of the steps presented before (1, 2 and 4; 3 is excluded since are not going to discuss the overall strategy). In order to guide our analysis, figure 1 presents a table with brief descriptions of some of the main adverbs in Sly3, and their impact on the message evaluation process. We have not included a similar table for gerunds, because their names are in general more self descriptive.

Example 1

    %{true. true. false. true} andING
  
  • In this example, there are no adverbs specified, therefore, the receiver’s adverb is the default one, i.e., individually. This implies that the processing of the members is done independently (i.e. in different Smalltalk processes).
  • There are no arguments, therefore, no argument’s adverb.
  • There is a gerund andING that acts on the result of the operation reducing the independent results coming from different processes. As its name indicates, andING performs logical and over the members of the resulting ensemble. Notice that there no actual message specified here, thus, andING is applied to unchanged set of members of the ensemble.
  • Therefore, the result of this operation is: false.

Example 2

    %{1. 2. 3} plusRoundLY: %{10. 20. 30}
  
  • Here are no adverbs specified for the receiver, therefore, the receiver’s adverb is the default one, i.e., individually. This implies that the processing of the members is done independently (i.e. in different processes/threads).
  • roundLY is an argument to the operand (%10. 20. 30 ). This adverb says: create a process/thread for each combination of the receiver’s member and the argument’s member; acting as a cartesian product.
  • There is no gerund, which means that the default one (ensemblING) is used.
  • Therefore, the result of this operation is: %{11. 12. 13. 21. 22. 23. 31. 32. 33}.

Example 3

    %{1. 2. 3} valueLY: [ :x  | x * 10]
  
  • valueLY is an argument of the receiver, therefore, the receiver has to be mapped to a new ensemble using the given block as the mapper.
  • There are no arguments, therefore, no argument’s adverb. Recall that the block is an argument to the adverb. As we can see, the problems of Sly3′s syntax become evident in these cases.
  • Therefore, the result of this operation is: %{10. 20. 30}.

An Overview of Sly3′s Implementation

In order to have a comprehensive view of the spectrum of the current custom behaviors available for direct use, in figure 1 we show the hierarchy of gerunds and adverbs that we can find in the current image of Sly3. Notice that this forms a restricted vocabulary that we can use as a basis of more complex strategies, in a LEGO-like way. In fact, we can see this hierarchy as a custom domain-specific language suited for the description of common strategies for handling parallelism.

Figure 1: The hierarchy of modifiers of Sly3.

Based on what we have discussed so far, we can characterize Sly3 as:

  • An Object-oriented language, descendant of Smalltalk, which incorporates ensembles as first-class entities to model parallelism.
  • It features parallelism by default. As soon as a message is sent to an ensemble, it is assumed that the processing will be spawned as independent tasks over the members of this ensemble.
  • Implicit parallelism. The processes that handle messages for ensembles are spawned under the hood, and programmers interact without having to switch to a parallel mindset.
  • The way the receiver and the operands of a message are treated is specified by a set of adverbs.
  • The way the result is reduced is specified by a set of gerunds.
  • The messages are not reified as for instance in the case of Communicating Sequential Processes or actors.
  • Communication is assumed to be synchronous.
  • The processes are not reified as in the case of actor-based computing.

A Note on the Evaluation Process

The Sly3 language is implemented as an extension of a Smalltalk Virtual Machine, the RoarVM. The current strategy is that whenever a message to an ensemble is detected, a custom message dispatcher is launched in order to handle it. Thus, the message is parsed and evaluated at runtime. The message is analyzed by partitioning the receiver according to its adverb, and pairing it with the arguments according to arguments’ adverbs. Then, the actual operation is executed in whichever set of operands resulted from the previous processing. The final result is reduced according to the gerund. When we use the expression according to, it is worth to highlight that the actual behavior is encoded in the classes of the gerunds an adverbs.

Below we give as example the implementation of the totallING gerund. In the image it is the only method of the class Sly3ModTotalling. Notice, that it nothing more but the implementation of a simple summing policy.

    ; method of the class Sly3ModTotalling
    reduceForEvaluator: ev
      "Sum the elements in results and return the result.
      Assumes these elements understand +."
      | result results |
      results := ev result.

      results isEmpty ifTrue:
        [ self error: 'No members, how should we handle?  proceed for nil'.  
          ^nil ].

      result := 0.
      results do: [ :each | result := each + result ].
      ev result: result
  

Below we give the code for the two relevant method of the Sly3ModCollectionly class that implements the collectionLY adverb in Sly3. Although it is necessary to deeply understand the behavior of the process of evaluating a message in order to understand this code, at least we can have an intuition that what is happening at some point is that a collection is created and returned (see line 17). Eventually, this is how the creation of a collection from an ensemble (which is what collectionLY does) is implemented.

    ; methods of the class Sly3ModCollectionly
    amendOperandTuples: operandTuplesIncludingThisOne
    	| ens |
    	ens := (operandTuplesIncludingThisOne collect: [:ot | ot last]).
    	^ operandTuplesIncludingThisOne collect: [:ot |
    		  ot copy removeLast; addLast: ens; yourself]


    extendOperandTuples: operandTuplesSoFar 
               operand: operand
               membersOrNil: operandMembers
    	| mems |
    	mems := operandMembers ifNil: [operand] 
                                ifNotNil: [operandMembers].
    	^ operandTuplesSoFar
    	    ifEmpty: [OrderedCollection with: 
                             (OrderedCollection with: mems)]
    	    ifNotEmpty: [ operandTuplesSoFar collect: 
                               [:tuple | tuple copy addLast: mems; yourself]]
  

Therefore, each modifier in Sly3 is self descriptive and fully determines the impact that their presence provokes on the evaluation process.

Implementing MapReduce in Sly3

In order to show an example of how to use Sly3, we have implemented a naive MapReduce framework. Take a look at the course notes for an Erlang implementation of the same examples.

It is worth to stress that the conciseness of the code that we eventually obtain is a consequence of both using Sly3 and the application of an Object-oriented approach. The framework relies on ensembles that receive messages. By doing that, the messages are implicitly sent in parallel, so we do not have to specify this in the code. We next discuss the core excerpts of the code.

Imagine two well-known applications of MapReduce. First, we want to count the number of occurrences of words in a set of documents. Another application is to know in which documents some words are found. In order to do that, the MapReduce paradigm requires us to provide the problem-specific knowledge by means of a map procedure and a reduce procedure.

Since Sly3 being an object-oriented language, we have modeled the MapReduce problem-independent behavior as a class MapReduce. Any particular implementation has to subclass this class and override two abstract methods, namely, mapForKey:value: and reduceForKey:values:.

Facing the problem of counting words in a document, below are the methods of the class CountWordsMapReduce that implement this behavior in Sly3. Notice that for sakes of simplicity, we have modeled a filesystem as a Dictionary whose keys are strings that represent file names and whose values are ensembles corresponding to the set of words in a file.

    "Specific methods of the class CountWordsMapReduce"
    mapForKey: ignore value: fileName
    	| words |
    	words :=  self consult: fileName.
    	^ words valueLY: [:y| {y. 1}]

    reduceForKey: word values: counts
    	^ {word. counts totallING}

    consult: file
    	^ files at: file
  

Facing the problem of indexing the texts using the words that they contain, below are the methods of the class TextIndexingMapReduce that implement this behavior in Sly3.

    "Specific methods of the class TextIndexingMapReduce"
    mapForKey:  ignore value: fileName
    	|words |
    	words :=   self consult: fileName. 
    	^ words valueLY: [:word| {word. fileName}]

    reduceForKey: word  values: names
    	^ {word. names members asSet}

    consult: file
    	^ files at: file
  

Notice that in both cases, it suffices to use combinations of the existing adverbs to achieve the desired functionality (see lines 5 and 8 of the first listing and line 5 in the second one). Moreover let’s now analyze the class MapReduce, that implements all the machinery of our naive MapReduce implementation.

    "Specific methods of the class MapReduce"
    dictionaryToList: dict
    	|lst|
    	lst := OrderedCollection new.
    	dict associationsDo: 
              [:assoc| lst add:{assoc key. 
                       Sly3Ensemble 
                         withMembersFrom: assoc value}].
    	^ Sly3Ensemble withMembersFrom: lst

    do: input
    	|dict  nonFlatValues flatValues intermediateValues|
    	nonFlatValues:= input valueLY:  [ :pair |
         self mapForKey: (pair at: 1) value: (pair at: 2)].
    	flatValues := nonFlatValues members 
          inject:  (OrderedCollection new) 
          into: [:acc :cur | 
                  acc addAll: 
                        cur members asOrderedCollection. 
                  acc].
    	dict := self groupValuesByKey: flatValues.
    	intermediateValues:= self dictionaryToList: dict.
    	^ intermediateValues valueLY:
           [ :pair | self reduceForKey: (pair at: 1) 
                          values: (pair at: 2) ]

    groupValuesByKey: keyValues
    | dict|
    dict := Dictionary new.
    keyValues do:[:keyValue |
    	| key value |
    	key := keyValue at: 1.
    	value := keyValue at: 2.
    	(dict includesKey: key)
          ifTrue: 
            [(dict at:key) add: value]
          ifFalse: 
            [dict at: key 
                  put: (OrderedCollection newFrom: {value})] .
    	].
    ^ dict

    mapForKey: key value: value
    	self subclassResponsibility.

    reduceForKey: key values: values
    	self subclassResponsibility.
  

As it is possible to see in lines 9, 13 or 23 of the above code, the framework itself also profits from the fact that the data structures that are being passed are ensembles. By having ensembles as the main element being passed around, it is possible to have implicit parallel behavior, which is the key of the MapReduce approach. Evidently, in this case we are discussing a very limited version of MapReduce meant to be used on a single computer, but it is still useful to show how the implicit parallelism in Sly3 works.

Conclusions

In the spectrum of language abstractions for dealing with parallelism and concurrency, Sly3 is a language that has opted for a conservative style, in the sense that it is a specific element of the language (an ensemble) the one that is subject of parallelism. On the other hand, the stress has been put on giving a set of tools to programmers, enabling them to combine these tools. For customized strategies, it is the impression of the author of this article that the level of granularity is too coarse-grained. Although it is conceivable to extend the set of adverbs and gerunds by extending the hierarchy aforementioned, this implies to understand the Sly3 implementation in very detail, something that we most likely do not expect from application programmers. It is possible to think of a meta-programming framework that could inject custom functionalities in the language, lowering the complexity for extending the set of primitives.

Regarding the approach to parallelism, it is implicit and synchronous. Programmers are not aware of how computations are spawned or scheduled, and the only means they have to initiate parallel computations is by using ensembles.

As regards to the syntax, Sly3 uses decorated keyword based messages to model the way the message is handled by the virtual machine. Although it is conceivable that this schema can be improved, there are some problems originated from the fact that adverbs can decorate receivers, arguments, and the overall strategy. If we add to this complexity the gerunds, it is clear that it is difficult to express these multidimensional facets using a textual syntax.

To sum up, Sly3 is a language that incorporates in its design a series of patterns to deal with parallelism, but the mechanism for extending the set of parallel primitives could be reconsidered, in order to reduce the complexity of extending Sly3. However, the MapReduce example has shown that it is possible to concisely express complex patterns of interactions in Sly3, as soon as these patterns fall within the give set of Sly3‘s primitives.

Bibliography

UA10
David Ungar and Sam S. Adams.
Harnessing emergence for manycore programming: early experience integrating ensembles, adverbs, and object-based inheritance.
In Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion, SPLASH ’10, pages 19-26, New York, NY, USA, 2010. ACM.

The Price of the Free Lunch: Programming in the Multicore Era

Last Friday was the annual Lab event of our Software Languages Lab. Like last year, many people related to the lab in one or the other way came to get an overview of what the current topics of our research are.

This year, we presented our research in the form of a Pecha Kucha talk. That means every presenter got 20 slides to present and each of the slides was shown exactly 20 seconds. That gives enough time to convey the general idea, but avoids boring the people with endless technical details.

All in all, that worked out pretty well.

My talk gave an overview of what the Parallel Programming Group is up to, on a very high level. It motivates why we are doing research in languages and language runtimes/virtual machines, and names our approaches to tackle the challenges. Well, for researchers in the field that is probably to vague, but everyone else might get just enough out of it to see in which direction we are going.

 

 

Trends in Concurrency 2010, part 1

I already posted the presentation I gave at the summer school earlier. In the following posts, I will report a bit about the lectures of the summer school, similar to my posts about the TPLI summer school of last year.

Like last year, my thanks go to FWO and the summer school, for providing me with grants to cover all the costs. Thanks :)

The first two days covered four different topics. Adam Welc started with an introduction to Software Transactional Memory, Jean-Pierre Talpin gave lectures on the topic Virtual Prototyping Embedded Architectures, Satnam Singh reported on his experience with Data-Parallel Programming, and last but not least, Peter Sewell introduced us to the mystics of hardware memory models with his talk titled Low-Level Concurrency – For Real.

Adam Welc: Software Transactional Memory

Adam, who works for Intel Labs, talked in his first presentation mainly about the semantics of STM and the different approaches for designing a STM system. The second presentation concentrated on how the implementation of such a system could work.

His presentations were very much in the spirit of the STM system Intel provides as a prototype, but the general ideas are used in other systems as well.

The different options for semantics of an STM included open or closed transaction nesting, transaction flattening as well as abort and retry. Furthermore, he discussed the interaction for instance with legacy code, which leads to the question of weak and strong atomicity properties, or the semantics of single global lock atomicity. The last semantic question he discussed was how consistency should be provided, and how pointer-privatization could be supported by the system. Afterwards, he outlined the design space for STM systems, introducing optimistic vs. pessimistic transactions as well as write buffering vs. in-place updates.

In the second talk, he presented the implementation details.

For their C++-based STM the compiler needs additional information about the characteristics of all code, which is done by annotations on class and method level. Similar, the compiler supports atomic blocks, abort, and retry statements. Beside all the implementation details, he also discussed several optimizations the compiler attempts to reduce the STM overhead with regard to read/write-barriers wherever possible. My gut feeling tells me, someone should write a STM+TracingJIT story. Not sure whether that buys anything, but what I have seen with STM looks a lot like the trace-guards.

The last slides he presented were about performance. The basic conclusion still is: performance is an issue. However, he gave me an interesting pointer to one of his papers, which basically states, that transaction can be used for looks without contention, quite straight forwardly: Transparently reconciling transactions with locking for Java synchronization.

Jean-Pierre Talpin: Virtual Prototyping Embedded Architectures

Unfortunately, I missed the introductorily part of these lectures due to problems with our accommodation *sigh*. However, as far as I understood, in the domain of embedded systems for large scale applications like aircrafts, there is always a need to combine a large number of different technologies, and the final goal is to simulate, analyze, and verify the software before it is deployed into production use. Especially the verification part is important for critical systems like in avionics.

The systems in this domain are mostly event driven, i.e., in his terminology it is an asynchronous composition of reactive processes.

For these purpose, they developed Polychrony including an Eclipse integration. It allows using data-flow models for computation and mode automata for control. These models are verified with model-checking techniques and allow controller synthesis.

At the heart of this system is a data-flow language, which allows expressing such an asynchronous event system in terms of synchronous modules, which have better characteristics with respect to verification. Furthermore, they established the means to prove certain correctness criteria.

For an aerospace project, they used this as a basis and code-generation framework. One of their techniques uses a SSA as an intermediate representation to model existing software. They use SSA to translate the input to their synchronous data-flow formalism enabling their analysis’.

Satnam Singh: Data-Parallel Programming

The motivation behind his work is to allow users to utilize parallelism in a civil manner, i.e., with a reasonable effort/gain ratio. His premise is that programming models like SIMD, OpenMP, and MPI are inherently low-level and require very detailed knowledge about the target architecture, which restricts their applicability. Not only increases it the development cost, but it also restricts the application to a single platform. With the Microsoft Accelerator, they develop a high-level data-parallel library to target various platforms, starting with standard multi-core CPUs with SIMD instruction extensions, GPGPUs, and FPGAs.

The user describes the computation by defining a data-flow graph, which describes the intention instead of the detailed mapping to an execution. Most notably, they provide parallel array structures, on which operations can be performed without necessarily using an index, but deferring these details to the runtime, which can optimize the relevant array operations for the target architecture. Furthermore, it provides certain standard mechanisms to work in a data-parallel way. This includes various operations on the parallel arrays, which for instance can be combined with appropriate reduce operations. An important operation he emphasized is the array shift operation, which is extensively used to describe necessary data transformations. Instead of doing these kind of transformations explicitly, the runtime system can highly-optimize them to the target platform.

The resulting program description is just-in-time compiled for the target machine, and only then executed. For the typical problems they are approaching, the construction of the program describing graph and the JIT overhead are negligible.

However, this form of nested data parallelism brings also some problems. Especially the distribution of the problem over the computing platform is hard since the shape of the resulting computational graph depends on the input data.

Peter Sewell: Low-Level Concurrency – For Real

The basic conclusion of his talk is: it is impossible to write portable, correct, and efficient concurrent code for any existing hardware platform, because they do not actually tell you all the constrains/properties in their specifications.

So he basically starts to rant about all processor manufacturers and their specs. He starts with a seemingly simple example, and demonstrates that the expected outcome is not guaranteed.

Then he goes on and describes various examples for x86 and relates them to the tricks a CPU applies, which result in rather weak memory models.

Based on that, he presents a model they developed called x86-TSO (Total Store Order), which allows to reason about the correctness of concurrent x86 programs. This model is eventually used to introduce the notion of triangular races. This allows to reason about a class of programs, which can contain data races, for instance lock implementations.

From here, he goes on to rant about all the other existing hardware architectures and their semantics or lack of specification. They tried for years to come up with a formal model like x86-TSO but failed.

To conclude his lectures, he presents 10 different options how to define a sane memory model for instance for a programming language. He starts out with option 1: DON’T i.e. no concurrency. Well, my personal impression still is, that this is his favorite option. It has several advantages. Especially, it is very simple. In the end he comes to option 10, which means to have a memory model similar to C++0x. It is a data-race free model, which provides low-level concurrency primitives, and allows to break abstractions. Beside the notion of the data-race free part, that is something, I have expected/thought about earlier for a good VM.