Publishing can really be an odyssey, and it all started with my Master thesis. Today, we have the 28th of December 2011. And I think I handed my thesis in somewhere around the 23rd of September 2008. I agree that there have been issues with the original version of the paper, but nothing was so fundamental that it would explain the more than three years it took to get it finally out. Thank for the warm welcome anyway.
Abstract
CSOM/PL is a software product line SPL derived from applying multi-dimensional separation of concerns MDSOC techniques to the domain of high-level language virtual machine VM implementations. For CSOM/PL, we modularised CSOM, a Smalltalk VM implemented in C, using VMADL virtual machine architecture description language. Several features of the original CSOM were encapsulated in VMADL modules and composed in various combinations. In an evaluation of our approach, we show that applying MDSOC and SPL principles to a domain as complex as that of VMs is not only feasible but beneficial, as it improves understandability, maintainability, and configurability of VM implementations without harming performance.
CSOM/PL: A Virtual Machine Product Line, Michael Haupt, Stefan Marr, Robert Hirschfeld, Journal of Object Technology, Volume 10, (2011), pp. 12:1-30.
Evaluating benchmark results with Excel became
too cumbersome and error prone for me so that I needed an alternative.
Especially, reevaluating new data for the same experiments was a hassle.
However, the biggest problem with Excel was that I did not know a good
way to query the raw data sets and group results easily to
be able to answer different kind of questions about the data.
Thus, I decided I need to learn how to do it in a better way.
Since, I was not really happy with the debuggability, traceability,
and reusability of my spreadsheets either, I gave up on Excel entirely.
I bet, Excel can do most of the things I need, but I wanted a text-based
solution for which I can use my normal tools, too.
While working on ReBench,
I got already in touch with
matplotlib to generate
simple graphs from benchmark results automatically.
But well, Python does not feel like the ultimate language for what I was
looking for either. Instead, R was
mentioned from time to time when it came to statistical evaluation of
measurements. And, at least for me, it turned out to be an interesting
language with an enormous amount of libraries. Actually, a bit
to enormous for my little needs, but it looked like a good starting
point to brush up on my statistics knowledge.
By now, I use it regularly and applied it to a number of problems,
including my work on the RoarVM
and a paper about
CSOM/PL.
Benchmark Execution
Before we can analyze any benchmark results, we need a reliable way to
execute them, ideally, as reproducible as possible. For that purpose,
I use a couple of tools:
A Python application that executes benchmarks based on a given
configuration file that defines the variables to be varied,
the executables to be used and the benchmarks and their
parameters.
This is mostly for the bigger picture, a web application
that provides the basic functionality to track long-term
performance of an application.
Beside having good tools, serious benchmarking requires some background
knowledge. Today’s computer systems are just to complex to get good
results with the most naive approaches.
In that regard, I highly suggest to read the following two
papers which discussing many of the pitfalls that one encounters when
working with modern virtual machines, but also just on any modern
operating system and with state-of-the-art processor tricks.
Blackburn, S. M.; McKinley, K. S.; Garner, R.; Hoffmann, C.; Khan, A. M.; Bentzur, R.; Diwan, A.; Feinberg, D.; Frampton, D.; Guyer, S. Z.; Hirzel, M.; Hosking, A.; Jump, M.; Lee, H.; Moss, J. E. B.; Phansalkar, A.; Stefanovik, D.; VanDrunen, T.; von Dincklage, D. & Wiedermann, B.
Now, we need to find the right tools to get started with R.
After searching a bit, I quickly settled on
RStudio. It feels pretty much like
a typical IDE with a REPL
at its heart. As you can see in the screenshot
below, there are four important areas. Top-left are your *.R files
in a code editor with syntax highlighting and code completion. Top-right
is your current R workspace with all defined variables and functions.
This allows also to view and explore your current data set easily.
On the bottom-right, we got a window that usually contains either
help texts, or plots.
Last but not least, in the bottom-left corner is the actual REPL.
Rscript is the interpreter that can be used on the command-line
and thus, is nice to have for automating tasks. I installed it using MacPorts.
This introduction is accompanied by a
script.
It contains all the code discussed here and can be used to follow the
introduction avoiding to much copy’n'past. Furthermore, the code here
is not as complete as the script, to be a little more concise.
The first thing to do after setting up R is to install some libraries.
This can be done easily from the REPL by executing the following code:
install.packages("plyr"). This will
install the plyr library providing ddply(),
which we are going to use later to process our data set.
Before it is going to install the code, it will ask for a mirror nearby
to download the necessary files.
To visualize the benchmark results afterwards, we are going to use bean plots
from the beanplot library. The doBy library
provides some convenience functions from which we are going to use
orderBy().
After the installing all libraries, they can be loaded by executing
library(lib_name).
The documentation for functions, libraries, and other topics is always
just a question mark away. For instance, to find out how libraries are installed,
execute the following code in the REPL: ?install.packages.
Preprocessing Benchmark Results
To give a little context, the experiment this introduction is based on
tries to evaluate the impact of using an object table instead of direct
references on the performance of our manycore Smalltalk, the RoarVM.
The data set is available here:
object-table-data.csv.bz2.
Download it together with the script and
put it into the same folder.
After opening the script in RStudio, you can follow the description here
while using the code in RStudio directly. The run-button will execute a
line or selection directly in the REPL.
Loading Data
First, change the working directory of R to the directory where you put
the data file: setwd("folder/with/data-file.csv.bz2/").
The file is a compressed comma-separated values file that contains
benchmark results, but no header line. Furthermore, some of the rows do not
contain data for all columns, but this is not relevant here and we can just
fill the gabs automatically.
Load the compressed data directly into R’s workspace by
evaluating the following code:
As you can see in the workspace on the right, the variable bench
contains now 60432 observations for 10 variables. Type the following
into the REPL to get an idea of what kind of data we are looking at:
summary(bench), or View(bench).
Just typing bench into the REPL will print the plain content,
too.
As you will see, every row represents one measurement with a set of
properties, like the executed benchmark, its parameters, and the virtual
machine used.
Transforming Raw Data
As a first step, we will do some parsing and rearrange the data to
be able to work with it more easily. The name of the virtual machine binary
encodes a number of properties that we want to be able to access directly.
Thus, we split up that information into separate columns.
bench <- ddply(bench,
~ VirtualMachine, # this formula groups the data by the value in VirtualMachine
transform,
# the second part of the VM name indicates whether it uses the object table
ObjectTable = strsplit(as.character(VirtualMachine), "-")[[1]][2] == "OT",
# the third part indicates the format of the object header
Header = factor(strsplit(as.character(VirtualMachine), "-")[[1]][3]))
This operation could use some more explanation, but most important
know now is that we add the two new columns and that the data
is being processed grouped by the values in the VirtualMachine
column.
The data set contains also results from another experiment and some of the
benchmarks include very detailed information. Neither of those information
is required at the moment and we can drop the irrelevant data points
by using subset.
bench <- subset(bench,
Header == "full" # concentrate on the VM with full object headers
& Criterion == "total", # use only total values of a measurement
select=c(Time, Platform, # use only a limited number of columns
ObjectTable, Benchmark, ExtraArguments, Cores))
Furthermore, we assume that all variations in measurements come from the
same non-deterministic influences. This allows to order the measurements,
before correlating them pair-wise. The data is order based on all columns,
where Time is used as the first ordering criterion.
In the next step, we can calculate the speed ratio between pairs of
measurements. To that end, we group the data based the unrelated variables
and divide the measured runtime by the corresponding measurement of the VM
that uses an object table. Afterwards we drop the measurements for the VM
with an object table, since the speed ratio here is obviously always 1.
Now we are at a point were we can start to make sense of the benchmark
results.
The summary function provides now a useful overview of the data and
we can for instance concentrate on the speed ratio alone.
As you might expect, you can also get properties like the standard
deviation, arithmetic mean, and the median easily:
However, that is very simple, a bit more interesting are R’s features
to query and process the data. The first question we are interested
in is, whether there is actually an impact on the performance difference
for different numbers of cores:
Starring at numbers is sometimes informative, but usually only useful
for small data sets.
Since we humans are better in recognizing patterns in visual representations,
beanplots are a better
way to make sense of the data. They are an
elegant visualization of the distribution of measurements.
beanplot(SpeedRatio ~ Platform,
data = norm_bench,
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT",
las=2)
This beanplot tells us, like the numbers before, that the mean is smaller
than 1, which is the expected result and means that the indirection
of an object table slows the VM down. However, the numbers did not tell
use that there are various clusters of measurements that now become visible
in the beanplot and are worth investigating.
Let’s look at the results split up by number of cores.
beanplot(SpeedRatio ~ Cores,
data = norm_bench,
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT")
Digging Deeper
While noticing that the variant without object table is faster
for more cores, we also see that the speed ratios are distributed unevenly.
While the bean for 1-core has a 2-parted shape, the 2-cores bean has a lot
more points where measurements clump.
That is probably related to the benchmarks:
beanplot(SpeedRatio ~ Benchmark,
data = subset(norm_bench, Cores==2),
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT", las=2)
beanplot(SpeedRatio ~ Benchmark,
data = subset(norm_bench, Cores==16),
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT", las=2)
Those two graphs are somehow similar, but you might notice that the float
loop benchmark has a couple of strong outliers.
I forgot the exact identifier of the float loop benchmark, so lets find out
with this code: levels(norm_bench$Benchmark)
Now let’s filer the data shown by that benchmark.
beanplot(SpeedRatio ~ Cores,
data = subset(norm_bench, Benchmark == "SMarkLoops.benchFloatLoop"),
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT")
This visualization reassures me, that there is something strange going on.
The distribution of result still looks clumped as if there is another
parameter influencing the result, which we have not regarded yet.
The only column we have not looked at is ExtraArguments,
so we will add them. Note that droplevels() is applied on the
data set this time before giving it to the beanplot() function.
This is necessary since the plot would contain all unused factor levels
instead, which would reduce readability considerably.
beanplot(SpeedRatio ~ Cores + ExtraArguments,
data = droplevels(subset(norm_bench, Benchmark == "SMarkLoops.benchFloatLoop")),
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT", las=2)
Fixing up a Mistake
This plot now shows us that there are three groups of results with
different ExtraArguments.
Think I forgot that the data set contains some very specific benchmarks.
The overall goal of the benchmarks is to test the weak scaling behavior of
the VM by increasing work-load and number-of-cores together.
However, for the questions we are interested here, only those weak-scaling
benchmarks are of interest.
Thus, we need to filter out a couple of more data points, since the results
will be unnecessarily biased otherwise.
To filter the data points we use grepl.
It matches the strings of ExtraArguments and allows us to
filter out the single-core and the 10x-load benchmarks.
norm_bench <- subset(norm_bench,
!grepl("^1 ", ExtraArguments) # those beginning with "1" put load on a single core
& !grepl("s0 ", ExtraArguments)) # those having "s0" in it put 10x load on each core
norm_bench <- droplevels(norm_bench)
And since I always wanted to say that:
The exercise to regenerate all interesting graphs and re-answer the
original questions are left to the interested reader.
Conclusion
As a final graph, we can plot all benchmarks for all cores,
and get an overview. The par function allows to adapt the
margins of the plot, which is necessary here to get the full benchmark
names on the plot.
par(mar=c(20, 4, 1, 1))
beanplot(SpeedRatio ~ Cores + Benchmark,
data = norm_bench,
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT", las=2)
As we already knew, we see an influence of the number of cores on the
results, but more importantly, we see most benchmarks benefitting from removing
the extra indirection through the object table. The float loops benefit by
far the strongest. The float objects are so small and usually used only
temporary that avoiding the object table pays off.
For the integer loops it does not make a difference, since the VMs uses
immediate values (tagged integers). Thus, the integers used here are not
objects allocated in the heap and the object table is not used either.
Beyond the won insight into the performance implications of an object table
this analysis also demonstrates the benefits of using a language like
R. Its language features allow us to filter and reshape data easily.
Furthermore, regenerating plots and tracing steps becomes easy, too.
Here it was necessary since some data points needed to be removed from
the data set to get to reasonable results. Reexecuting part of the script
or just exploring the data is convenient and done
fast, which allows me to ask more questions about the data and understand the
measurements more deeply. Furthermore, it is less a hassle to reassess
the data in case certain assumptions have changed, we made a mistake,
or the data set changed for other reasons.
From my experience, it is much more convenient than Excel, but that might
just be because I spend more time on learning R than on learning Excel.
In case you try it out yourself, you will certainly want to experiment
with other types of visualizations, save them to files, etc.
A few of these things can be found in my benchmarking scripts on
GitHub: BenchR.
Today, I gave a talk at the ExaScience Lab, Intel Labs Europe in Leuven at IMEC. I talked mainly about the idea of nondeterministic programming, the Sly programming language and some details on our Smalltalk manycore virtual machine that enables those experiments. Thus, tried to spread the word about our Renaissance project at bit further.
Below you can find the abstract and slides.
Abstract
The manycore future has several challenges ahead of us that suggest that fundamental assumptions of contemporary programming approaches do not apply anymore when scalability is required.
Sly is a language prototype designed to experiment with the inherently nondeterministic properties of parallel systems. It is designed to enable programmers to embrace nondeterminism instead of guiding them to fight it. Nature shows that complex system can be built from independent entities that achieve a common goal without global synchronization/communication. Sly is design to enable the prototyping of algorithms that show such emerging behavior. It will be introduced in the first part of the talk.
The second part of the talk will focus on the underlying problems of building virtual machines for the manycore future, which allow to harness the available computing power. The RoarVM was design to experiment on the Tilera TILE64 manycore processor architecture which provides 64 cores and characteristics that are distinctly different from today’s commodity multicore processors. Memory bandwidth, caches and communication are the biggest challenges on such architectures and this talk will give a brief overview over the design choices of the RoarVM which tackle the characteristics of the TILE64 architecture.
Acknowledgement: Sly and the RoarVM were designed and implemented by David Ungar and Sam Adams at IBM Research.
The following introduction and analysis of the Sly3 programming language was written by Pablo Inostroza Valdera as part of his course work for the Multicore Programming course of Tom Van Cutsem. The assignment was to write a blog post about a topic of their own choice, and I repost Pablo’s work here with his permission to spread the word about Sly a bit wider. We made his article also available as part of his work for the Renaissance project itself.
Sly3 is a Smalltalk dialect that features two
concepts in order to deal with parallelism: ensembles, which are
collections of objects that can receive parallel messages, and modifiers,
which decorate messages in order to modify their parallel behavior
(adverbs) or to specify how to handle reductions
(gerunds). In this article we analyze Sly3‘s programming model.
Sly3 is a language being developed in the
context of the Renaissance Project, at
IBM. With Sly3, their authors David Ungar and Sam
Adams explore abstractions for nondeterministic parallel programming. Sly3 is the experimental successor of LY,
a javascript-like interpreted language that was implemented on top of
Smalltalk[UA10]. As the 3 in Sly3 indicates, Sly
itself underwent already three iterations of design. The main idea behind
the design of Sly3 is that multicore programming is
commonly addressed by restricting the nondeterminism parts and trying to
enforce strict determinism. As an example of this, Ungar and Adams refer to
the actors paradigm. While the actor model certainly adds some order to
concurrent computations, the truth is that some nondeterminism still
persists because the global order in which messages arrive to an actor is
nondeterministic. They proposed that instead of trying to restrict the
nondeterminism, it should be embraced. The Ly
language and, later on, the Sly3 language are steps
towards that direction. In this article we discuss the latter.
Ensembles, or How to Represent A Flock of Birds
Sly3‘s key elements were conceived by observing the
nature. Think of a flock of birds, an ant colony or a school of fishes.
These are all examples for a multitude of independent agents, acting in a
nondeterministic way, whose aggregated behavior is more complex than that
of the individuals. In other words, a more complex behavior emerges. The idea of a whole formed from independent units
is key to Sly3, and it has been reflected in the concept of
an ensemble. An ensemble is a collection of
objects, subject of receiving messages as a whole. In the following
example, we create an ensemble and execute an operation on it:
This code yields: %{1. 4. 9. 16. 25}. Recall that the curly
braces represent an array in Smalltalk. In Sly3, curly braces preceded by %
denote an ensemble.
Notice how the ensemble received a message as a whole, but it acted on
each individual in order to produce a new ensemble with all of the elements
squared. It is not surprising to know that the default behavior of this
mapping is parallel, i.e., a different thread of execution is
triggered for each individual computation. The standard execution strategy
in this this simple example is as follows:
The message was delegated to members of the ensemble, creating a
parallel process for the evaluation of each of the messages. We can say
that each member was treated individually and
was executing its own task.
For generating the final result, a new ensemble was created from the
results of all the tasks of the previous step. We can say that we are
ensembling together all the individual
results.
Notice that we have put in bold letters two words: individually and
ensembling. We have done so, because both words characterize a strategy for
handling a parallel computation. Actually, this is the default strategy in
Sly3, but the programming model was designed to be able to
customize these behaviors.
Adverbs and Gerunds, or How to Modulate Parallel Behavior
The basic model of message handling for an ensemble is:
The ensemble that receives the message is treated in accordance to
its adverb. A receiver’s adverb precedes the
message name. In the absence of an explicit one, the default adverb for
receiver, namely individualLY, is used.
We elaborate on the meaning of individualLY in subsequent
paragraphs, important here is to note that adverbs’ names always end in LY.
If there are operands, they have to be treated in accordance with
their adverbs. An operand adverb follows the
message name. In the absence of an explicit one, the default adverb for
operands, namely wholLY, is used.
The operation is executed according to an overall strategy, also
determined by an adverb.
The result of the operation is reduced in accordance to a
gerund. In the absence of an explicit one,
the default gerund, namely ensemblING, is used. Gerunds’ names
always end in ING.
Regarding the point 3, it suffices to mention that currently there are
only three overall strategies, namely, serially, plainly and the
default one, which we can call parallelly.
Note that the metaphor of the adverb is taken to an extreme in
Sly3. Many words that do not sound like adverbs are
adverbialized in order to fit in the
conceptual framework.
Serially forces serial executions, so no processes/threads
are spawned to compute the results. This can be useful either for semantic
reasons or when the cost of parallel execution is to high, i.e., when we
are below the sequential threshold. The adverb plainly can be used to turn off ensemble behavior, and
therefore, to treat the ensemble as a collection (e.g., we could ask for
the size of an ensemble by using plainly, and this would not imply sending
a message to each of its members). For the sake of brevity, we are not
going to discuss the overall strategies further, since they are not
relevant for grasping the essence of the Sly3 programming
model.
Adverb
Informal description
wholLY
just pass me along as a whole ensemble
individualLY
create a process/thread for each of my members
collectionLY
treat me as a collection instead of an ensemble
duplicativeLY
copy each of my members, to produce a new ensemble
roundLY
create a process for each combination of the receiver’s members and my members
randomLY
chose n of my members randomly, n is a parameter of this
adverb
valueLY
take a block and apply it to each of my members
Table 1: Some of the principal adverbs in Sly3.
The syntax of Sly3 is cumbersome, so next we present some
examples in order to provide a more clear idea of how to express things in
this language. Let’s analyze them in the light of three of the steps
presented before (1, 2 and 4; 3 is excluded since are not going to discuss
the overall strategy). In order to guide our
analysis, figure 1 presents a table with brief
descriptions of some of the main adverbs in Sly3, and their
impact on the message evaluation process. We have not included a similar
table for gerunds, because their names are in general more self
descriptive.
Example 1
%{true. true. false. true} andING
In this example, there are no adverbs specified, therefore, the
receiver’s adverb is the default one, i.e.,
individually. This implies that the
processing of the members is done independently (i.e. in different
Smalltalk processes).
There are no arguments, therefore, no argument’s adverb.
There is a gerund andING that acts on the result of the
operation reducing the independent results coming from different
processes. As its name indicates, andING performs logical
and over the members of the resulting
ensemble. Notice that there no actual message specified here, thus,
andING is applied to unchanged set of members of the
ensemble.
Therefore, the result of this operation is: false.
Example 2
%{1. 2. 3} plusRoundLY: %{10. 20. 30}
Here are no adverbs specified for the receiver, therefore, the
receiver’s adverb is the default one, i.e.,
individually. This implies that the
processing of the members is done independently (i.e. in different
processes/threads).
roundLY is an argument to the operand (%10. 20. 30
). This adverb says: create a process/thread for each combination of the
receiver’s member and the argument’s member; acting as a cartesian
product.
There is no gerund, which means that the default one
(ensemblING) is used.
Therefore, the result of this operation is: %{11. 12. 13. 21. 22.
23. 31. 32. 33}.
Example 3
%{1. 2. 3} valueLY: [ | x * 10]
valueLY is an argument of the receiver, therefore, the
receiver has to be mapped to a new ensemble using the given block as the
mapper.
There are no arguments, therefore, no argument’s adverb. Recall that
the block is an argument to the adverb. As we can see, the problems of
Sly3′s syntax become evident in these cases.
Therefore, the result of this operation is: %{10. 20.
30}.
An Overview of Sly3′s Implementation
In order to have a comprehensive view of the spectrum of the current
custom behaviors available for direct use, in
figure 1 we show the hierarchy of gerunds and
adverbs that we can find in the current image of Sly3.
Notice that this forms a restricted vocabulary that we can use as a basis
of more complex strategies, in a LEGO-like way. In fact, we can see this
hierarchy as a custom domain-specific language suited for the description
of common strategies for handling parallelism.
Figure 1: The hierarchy of modifiers of Sly3.
Based on what we have discussed so far, we can characterize
Sly3 as:
An Object-oriented language, descendant of Smalltalk, which
incorporates ensembles as first-class entities to model parallelism.
It features parallelism by default. As soon as a message is sent to
an ensemble, it is assumed that the processing will be spawned as
independent tasks over the members of this ensemble.
Implicit parallelism. The processes that handle messages for
ensembles are spawned under the hood, and
programmers interact without having to switch to a parallel mindset.
The way the receiver and the operands of a message are treated is
specified by a set of adverbs.
The way the result is reduced is specified by a set of gerunds.
The messages are not reified as for instance in the case of
Communicating Sequential Processes or
actors.
Communication is assumed to be synchronous.
The processes are not reified as in the case of actor-based
computing.
A Note on the Evaluation Process
The Sly3 language is implemented as an extension of a
Smalltalk Virtual Machine, the RoarVM. The current strategy is that
whenever a message to an ensemble is detected, a custom message dispatcher
is launched in order to handle it. Thus, the message is parsed and
evaluated at runtime. The message is
analyzed by partitioning the receiver according to its adverb, and pairing
it with the arguments according to arguments’ adverbs. Then, the actual
operation is executed in whichever set of operands resulted from the
previous processing. The final result is reduced according to the gerund.
When we use the expression according to, it is
worth to highlight that the actual behavior is encoded in the classes of
the gerunds an adverbs.
Below we give as example the implementation of the totallING
gerund. In the image it is the only method of the class
Sly3ModTotalling. Notice, that it nothing more but the
implementation of a simple summing policy.
; method of the class Sly3ModTotalling
reduceForEvaluator: ev
"Sum the elements in results and return the result.
Assumes these elements understand +."
| result results |
results := ev result.
results isEmpty ifTrue:
[ self error: 'No members, how should we handle? proceed for nil'.
^nil ].
result := 0.
results do: [ :each | result := each + result ].
ev result: result
Below we give the code for the two relevant method of the
Sly3ModCollectionly class that implements the
collectionLY adverb in Sly3. Although it is necessary to deeply
understand the behavior of the process of evaluating a message in order to
understand this code, at least we can have an intuition that what is
happening at some point is that a collection is created and returned (see
line 17). Eventually, this is how the creation
of a collection from an ensemble (which is what collectionLY does)
is implemented.
Therefore, each modifier in Sly3 is self descriptive and
fully determines the impact that their presence provokes on the evaluation
process.
Implementing MapReduce in Sly3
In order to show an example of how to use Sly3, we have
implemented a naive MapReduce framework. Take a look at the
course notes
for an Erlang implementation of the
same examples.
It is worth to stress that the conciseness of
the code that we eventually obtain is a consequence of both using
Sly3 and the application of an Object-oriented approach. The
framework relies on ensembles that receive messages. By doing that, the
messages are implicitly sent in parallel, so we do not have to specify this
in the code. We next discuss the core excerpts of the code.
Imagine two well-known applications of MapReduce. First, we want to
count the number of occurrences of words in a set of documents. Another
application is to know in which documents some words are found. In order to
do that, the MapReduce paradigm requires us to provide the problem-specific
knowledge by means of a map procedure and a reduce procedure.
Since Sly3 being an object-oriented language, we have
modeled the MapReduce problem-independent behavior as a class
MapReduce. Any particular implementation has to subclass this
class and override two abstract methods, namely, mapForKey:value:
and reduceForKey:values:.
Facing the problem of counting words in a document, below are the
methods of the class CountWordsMapReduce that implement this
behavior in Sly3.
Notice that for sakes of simplicity, we have modeled a filesystem as
a Dictionary whose keys are strings that represent file names
and whose values are ensembles corresponding to the set of words in a
file.
"Specific methods of the class CountWordsMapReduce"
mapForKey: ignore value: fileName
| words |
words := self consult: fileName.
^ words valueLY: [:y| {y. 1}]
reduceForKey: word values: counts
^ {word. counts totallING}
consult: file
^ files at: file
Facing the problem of indexing the texts using the words that they
contain, below are the methods of the class TextIndexingMapReduce
that implement this behavior in Sly3.
"Specific methods of the class TextIndexingMapReduce"
mapForKey: ignore value: fileName
|words |
words := self consult: fileName.
^ words valueLY: [:word| {word. fileName}]
reduceForKey: word values: names
^ {word. names members asSet}
consult: file
^ files at: file
Notice that in both cases, it suffices to use combinations of the
existing adverbs to achieve the desired functionality (see lines 5 and 8 of the
first listing and line 5 in the second
one). Moreover let’s now analyze the class MapReduce, that
implements all the machinery of our naive
MapReduce implementation.
As it is possible to see in lines 9, 13 or 23 of the above code,
the framework itself also profits from the fact that the data structures
that are being passed are ensembles. By having ensembles as the main
element being passed around, it is possible to have implicit parallel
behavior, which is the key of the MapReduce approach. Evidently, in this
case we are discussing a very limited version of MapReduce meant to be used
on a single computer, but it is still useful to show how the implicit
parallelism in Sly3 works.
Conclusions
In the spectrum of language abstractions for dealing with parallelism
and concurrency, Sly3 is a language that has opted for a
conservative style, in the sense that it is a specific element of the
language (an ensemble) the one that is subject of parallelism. On the other
hand, the stress has been put on giving a set of tools to programmers,
enabling them to combine these tools. For customized strategies, it is the
impression of the author of this article that the level of granularity is
too coarse-grained. Although it is conceivable to extend the set of adverbs
and gerunds by extending the hierarchy aforementioned, this implies to
understand the Sly3 implementation in very detail, something
that we most likely do not expect from application programmers. It is
possible to think of a meta-programming framework that could inject custom functionalities in the language, lowering the
complexity for extending the set of primitives.
Regarding the approach to parallelism, it is implicit and synchronous.
Programmers are not aware of how computations are spawned or scheduled, and
the only means they have to initiate parallel computations is by using
ensembles.
As regards to the syntax, Sly3 uses decorated keyword
based messages to model the way the message is handled by the virtual
machine. Although it is conceivable that this schema can be improved, there
are some problems originated from the fact that adverbs can decorate
receivers, arguments, and the overall strategy. If we add to this
complexity the gerunds, it is clear that it is difficult to express these
multidimensional facets using a textual syntax.
To sum up, Sly3 is a language that incorporates in its
design a series of patterns to deal with parallelism, but the mechanism for
extending the set of parallel primitives could be reconsidered, in order to
reduce the complexity of extending Sly3. However, the
MapReduce example has shown that it is possible to concisely express
complex patterns of interactions in Sly3, as soon as these
patterns fall within the give set of Sly3‘s primitives.
We are happy to announce, now officially, RoarVM: the first single-image manycore virtual machine for Smalltalk.
The RoarVM supports the parallel execution of Smalltalk programs on x86 compatible multicore systems and Tilera TILE64-based manycore systems. It is tested with standard Squeak 4.1 closure-enabled images, and with a stripped down version of a MVC-based Squeak 3.7 image. Support for Pharo 1.2 is currently limited to 1 core, but we are working on it.
A small teaser: 1 core 66286897 bytecodes/sec; 2910474 sends/sec 8 cores 470588235 bytecodes/sec; 19825677 sends/sec
RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research. The port to x86 multicore systems was done by me. They open-sourced their VM, formerly know as Renaissance VM (RVM), under the Eclipse Public License. Official announcement of the IBM source code release: http://www.stefan-marr.de/rvm-open-source-release/
The source code of the RoarVM has been released as open source to enable the Smalltalk community to evaluate the ideas and possibly integrate them into existing systems. So, the RoarVM is meant to experiment with Smalltalk systems on multi- and manycore machines.
The open source project, and downloads can also be found on GitHub:
For more detailed information, please refer to the README file. Instructions to compile the RoarVM on Linux and OS X can be found in the INSTALL file. Windows is currently not supported, however, there are good chances that it will work with cygwin or pthreads for win32, but that has not be verified in anyway. If you feel brave, please give it a shot and report back.
If the community does not object, we would like to occupy the vm-dev@lists.squeakfoundation.org mailinglist for related discussions. So, if you run into any trouble while experimenting with the RoarVM, do not hesitate to report any problems and ask any questions.
I started as a PhD student in October 2008 at the Software Languages Lab of the Vrije Universiteit Brussel. My field of research are virtual machines especially with respect to concurrency issues and language constructs for modularization. Recently, I started working on concurrency support at the VM instruction set level.