My understanding of the workshop’s format is that its goal is to encourage the participants to actively interact. Far to often, workshops are just a collection of semi-related presentations, without a common problem and without a common goal. I fear a bit, the MISS workshop will have a similar problem. Being part of the program committee, I have seen all the submissions and the author do tend to prefer business as usual over actual position papers. From my perspective, this is really a pity. It is a lost chance to really exchange ideas actively and perhaps start collaborations with interesting people. A technical paper, with a few ideas and a work-in-progress prototype does not qualify as a position paper in my opinion. Usually, that kind of work only encourages discussion between people that have been working on similar things already. But let’s see how it turns out.
Our contribution to the workshop is a little experience report on how concurrency and modularity are related to each other in interpreter implementations. And, to make it short: modularity does matter to manage concurrency invariants, but things like AOP are far less important than some people might hope.
Abstract
In this paper, we review what we have learned from implementing languages for
parallel and concurrent programming, and investigate the role of modularity.
To identify the approaches used to facilitate correctness and maintainability,
we ask the following questions: What guides modularization? Are informal
approaches used to facilitate correctness? Are concurrency concerns
modularized? And, where is language support lacking most?
Our subjects are AmbientTalk, SLIP, and the RoarVM. All three evolved over the
years, enabling us to look back at specific experiments to understand the
impact of concurrency on modularity.
We conclude from our review that concurrency concerns are one of the strongest
drivers for the definition of module boundaries. It helps when languages offer
sophisticated modularization constructs. However, with respect to concurrency,
other language features like single-assignment are of greater importance.
Furthermore, tooling that enables remodularization taking concurrency
invariants into account would be of great value.
Modularity and Conventions for Maintainable Concurrent Language Implementations: A Review of Our Experiences and Practices, Stefan Marr, Jens Nicolay, Tom Van Cutsem, Theo D’Hondt, Proceedings of the 2nd Workshop on Modularity In Systems Software (MISS’2012), ACM (2012), to appear.
Publishing can really be an odyssey, and it all started with my Master thesis. Today, we have the 28th of December 2011. And I think I handed my thesis in somewhere around the 23rd of September 2008. I agree that there have been issues with the original version of the paper, but nothing was so fundamental that it would explain the more than three years it took to get it finally out. Thank for the warm welcome anyway.
Abstract
CSOM/PL is a software product line SPL derived from applying multi-dimensional separation of concerns MDSOC techniques to the domain of high-level language virtual machine VM implementations. For CSOM/PL, we modularised CSOM, a Smalltalk VM implemented in C, using VMADL virtual machine architecture description language. Several features of the original CSOM were encapsulated in VMADL modules and composed in various combinations. In an evaluation of our approach, we show that applying MDSOC and SPL principles to a domain as complex as that of VMs is not only feasible but beneficial, as it improves understandability, maintainability, and configurability of VM implementations without harming performance.
CSOM/PL: A Virtual Machine Product Line, Michael Haupt, Stefan Marr, Robert Hirschfeld, Journal of Object Technology, Volume 10, (2011), pp. 12:1-30.
The second conference day was unfortunately full of “conflicts of interest”… It was pretty hard to choose between all the talks on the schedule.
Beside DLS and VMIL, there was also a workshop on Actor systems. However, I did not make it to this one. Cherry picking between DLS and VMIL was already hard enough and worked out only partially.
The two DLS keynotes were certainly highlights of the whole SPLASH conference. Gilad’s talk, on Dart was not just interesting but quite entertaining; especially their take on the optional type system looks useful. I like the reasoning behind their approach a lot. Who needs sound type systems anyway? Well, then again, I had a few bugs were a complete type system could have notified me earlier to make me aware that I am doing something stupid. So, I wonder whether it would be hard to have another optional type system in addition, which could be enabled on demand for certain modules to check a more strict set of constraints?
In the end, the parallel revolution will need to find the balance between performance and correctness, since correctness is always bought by introducing delays and thus hampers performance. Another point he made more implicitly is that programming these systems becomes not only harder because of the parallel aspect, but also because of the ever-growing complexity of the underlying hardware. We just do not know anymore what is going on and why the performance is not as expected.
The VMIL workshop started with Eliot’s presentation. But, well, if you know how Eliot presents his work on the Cog VM then you know, that it is likely that people not acquainted with the details of Smalltalk VMs have a very hard time to follow. His paper made the message a bit clearer, I think. His main point was the VM integration of the Bochs processor emulator for debugging.
After the first break, Lars talked a bit too much about Dart as a language, and the VM details were rather scarce. They said that they are going for simplicity, and the one aspect that I remember was that they do not have any static initialization to avoid ‘spontaneous’ executions. And well, their Smalltalk and Java history certainly shows between the lines.
The talk titled “Virtual machines should be invisible” was essentially about how to build a VM that reuses the standard systems ABI to integrate better with the surrounding system. They use the DWARF debugging format and dynamic linking techniques to reach that goal. Interesting but not platform-independent and I do not really see why that would be desirable or necessary for what I consider a VM.
The talk titled “Intermediate Language Extensions for Parallelism” was interesting, but I found that this IL is rather language dependent and is only applicable for a X10 or Habanero-ish language. Sure, the Cilk-style seems to be useful for a number of parallel problems, but it is not the only paradigm out there, so I wonder how that IL would look like if it would be designed for a multi-language VM.
As preparation for SPLASH’11, here my paper for the VMIL workshop. It is a position paper discussing in which direction virtual machines should evolve in the future with regard to the challenges manycore architectures and concurrent programming bring.
As I said, this is a position paper, which hopefully provokes discussion. Feedback of any kind is welcome, and I am happy to adapt my presentation accordingly.
Abstract
While parallel programming for very regular problems has been used in the scientific community by non-computer-scientists successfully for a few decades now, concurrent programming and solving irregular problems remains hard. Furthermore, we shift from few expert system programmers mastering concurrency for a constrained set of problems to mainstream application developers being required to master concurrency for a wide variety of problems.
Consequently, high-level language virtual machine (VM) research faces interesting questions. What are processor design changes that have an impact on the abstractions provided by VMs to provide platform independence? How can application programmers’ diverse needs be facilitated to solve concurrent programming problems?
We argue that VMs will need to be ready for a wide range of different concurrency models that allow solving concurrency problems with appropriate abstractions. Furthermore, they need to abstract from heterogeneous processor architectures, varying performance characteristics, need to account for memory access cost and inter-core communication mechanisms but should only expose the minimal useful set of notions like locality, explicit communication, and adaptable scheduling to maintain their abstracting nature.
Eventually, language designers need to be enabled to guarantee properties like encapsulation, scheduling guarantees, and immutability also when an interaction between different problem-specific concurrency abstractions is required.
Which Problems Does a Multi-Language Virtual Machine Need to Solve in the Multicore/Manycore Era?, Stefan Marr, Mattias De Wael, Michael Haupt, Theo D’Hondt, Proceedings of the 5th Workshop on Virtual Machines and Intermediate Languages, USA, ACM (2011), to appear.
Evaluating benchmark results with Excel became
too cumbersome and error prone for me so that I needed an alternative.
Especially, reevaluating new data for the same experiments was a hassle.
However, the biggest problem with Excel was that I did not know a good
way to query the raw data sets and group results easily to
be able to answer different kind of questions about the data.
Thus, I decided I need to learn how to do it in a better way.
Since, I was not really happy with the debuggability, traceability,
and reusability of my spreadsheets either, I gave up on Excel entirely.
I bet, Excel can do most of the things I need, but I wanted a text-based
solution for which I can use my normal tools, too.
While working on ReBench,
I got already in touch with
matplotlib to generate
simple graphs from benchmark results automatically.
But well, Python does not feel like the ultimate language for what I was
looking for either. Instead, R was
mentioned from time to time when it came to statistical evaluation of
measurements. And, at least for me, it turned out to be an interesting
language with an enormous amount of libraries. Actually, a bit
to enormous for my little needs, but it looked like a good starting
point to brush up on my statistics knowledge.
By now, I use it regularly and applied it to a number of problems,
including my work on the RoarVM
and a paper about
CSOM/PL.
Benchmark Execution
Before we can analyze any benchmark results, we need a reliable way to
execute them, ideally, as reproducible as possible. For that purpose,
I use a couple of tools:
A Python application that executes benchmarks based on a given
configuration file that defines the variables to be varied,
the executables to be used and the benchmarks and their
parameters.
This is mostly for the bigger picture, a web application
that provides the basic functionality to track long-term
performance of an application.
Beside having good tools, serious benchmarking requires some background
knowledge. Today’s computer systems are just to complex to get good
results with the most naive approaches.
In that regard, I highly suggest to read the following two
papers which discussing many of the pitfalls that one encounters when
working with modern virtual machines, but also just on any modern
operating system and with state-of-the-art processor tricks.
Blackburn, S. M.; McKinley, K. S.; Garner, R.; Hoffmann, C.; Khan, A. M.; Bentzur, R.; Diwan, A.; Feinberg, D.; Frampton, D.; Guyer, S. Z.; Hirzel, M.; Hosking, A.; Jump, M.; Lee, H.; Moss, J. E. B.; Phansalkar, A.; Stefanovik, D.; VanDrunen, T.; von Dincklage, D. & Wiedermann, B.
Now, we need to find the right tools to get started with R.
After searching a bit, I quickly settled on
RStudio. It feels pretty much like
a typical IDE with a REPL
at its heart. As you can see in the screenshot
below, there are four important areas. Top-left are your *.R files
in a code editor with syntax highlighting and code completion. Top-right
is your current R workspace with all defined variables and functions.
This allows also to view and explore your current data set easily.
On the bottom-right, we got a window that usually contains either
help texts, or plots.
Last but not least, in the bottom-left corner is the actual REPL.
Rscript is the interpreter that can be used on the command-line
and thus, is nice to have for automating tasks. I installed it using MacPorts.
This introduction is accompanied by a
script.
It contains all the code discussed here and can be used to follow the
introduction avoiding to much copy’n'past. Furthermore, the code here
is not as complete as the script, to be a little more concise.
The first thing to do after setting up R is to install some libraries.
This can be done easily from the REPL by executing the following code:
install.packages("plyr"). This will
install the plyr library providing ddply(),
which we are going to use later to process our data set.
Before it is going to install the code, it will ask for a mirror nearby
to download the necessary files.
To visualize the benchmark results afterwards, we are going to use bean plots
from the beanplot library. The doBy library
provides some convenience functions from which we are going to use
orderBy().
After the installing all libraries, they can be loaded by executing
library(lib_name).
The documentation for functions, libraries, and other topics is always
just a question mark away. For instance, to find out how libraries are installed,
execute the following code in the REPL: ?install.packages.
Preprocessing Benchmark Results
To give a little context, the experiment this introduction is based on
tries to evaluate the impact of using an object table instead of direct
references on the performance of our manycore Smalltalk, the RoarVM.
The data set is available here:
object-table-data.csv.bz2.
Download it together with the script and
put it into the same folder.
After opening the script in RStudio, you can follow the description here
while using the code in RStudio directly. The run-button will execute a
line or selection directly in the REPL.
Loading Data
First, change the working directory of R to the directory where you put
the data file: setwd("folder/with/data-file.csv.bz2/").
The file is a compressed comma-separated values file that contains
benchmark results, but no header line. Furthermore, some of the rows do not
contain data for all columns, but this is not relevant here and we can just
fill the gabs automatically.
Load the compressed data directly into R’s workspace by
evaluating the following code:
As you can see in the workspace on the right, the variable bench
contains now 60432 observations for 10 variables. Type the following
into the REPL to get an idea of what kind of data we are looking at:
summary(bench), or View(bench).
Just typing bench into the REPL will print the plain content,
too.
As you will see, every row represents one measurement with a set of
properties, like the executed benchmark, its parameters, and the virtual
machine used.
Transforming Raw Data
As a first step, we will do some parsing and rearrange the data to
be able to work with it more easily. The name of the virtual machine binary
encodes a number of properties that we want to be able to access directly.
Thus, we split up that information into separate columns.
bench <- ddply(bench,
~ VirtualMachine, # this formula groups the data by the value in VirtualMachine
transform,
# the second part of the VM name indicates whether it uses the object table
ObjectTable = strsplit(as.character(VirtualMachine), "-")[[1]][2] == "OT",
# the third part indicates the format of the object header
Header = factor(strsplit(as.character(VirtualMachine), "-")[[1]][3]))
This operation could use some more explanation, but most important
know now is that we add the two new columns and that the data
is being processed grouped by the values in the VirtualMachine
column.
The data set contains also results from another experiment and some of the
benchmarks include very detailed information. Neither of those information
is required at the moment and we can drop the irrelevant data points
by using subset.
bench <- subset(bench,
Header == "full" # concentrate on the VM with full object headers
& Criterion == "total", # use only total values of a measurement
select=c(Time, Platform, # use only a limited number of columns
ObjectTable, Benchmark, ExtraArguments, Cores))
Furthermore, we assume that all variations in measurements come from the
same non-deterministic influences. This allows to order the measurements,
before correlating them pair-wise. The data is order based on all columns,
where Time is used as the first ordering criterion.
In the next step, we can calculate the speed ratio between pairs of
measurements. To that end, we group the data based the unrelated variables
and divide the measured runtime by the corresponding measurement of the VM
that uses an object table. Afterwards we drop the measurements for the VM
with an object table, since the speed ratio here is obviously always 1.
Now we are at a point were we can start to make sense of the benchmark
results.
The summary function provides now a useful overview of the data and
we can for instance concentrate on the speed ratio alone.
As you might expect, you can also get properties like the standard
deviation, arithmetic mean, and the median easily:
However, that is very simple, a bit more interesting are R’s features
to query and process the data. The first question we are interested
in is, whether there is actually an impact on the performance difference
for different numbers of cores:
Starring at numbers is sometimes informative, but usually only useful
for small data sets.
Since we humans are better in recognizing patterns in visual representations,
beanplots are a better
way to make sense of the data. They are an
elegant visualization of the distribution of measurements.
beanplot(SpeedRatio ~ Platform,
data = norm_bench,
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT",
las=2)
This beanplot tells us, like the numbers before, that the mean is smaller
than 1, which is the expected result and means that the indirection
of an object table slows the VM down. However, the numbers did not tell
use that there are various clusters of measurements that now become visible
in the beanplot and are worth investigating.
Let’s look at the results split up by number of cores.
beanplot(SpeedRatio ~ Cores,
data = norm_bench,
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT")
Digging Deeper
While noticing that the variant without object table is faster
for more cores, we also see that the speed ratios are distributed unevenly.
While the bean for 1-core has a 2-parted shape, the 2-cores bean has a lot
more points where measurements clump.
That is probably related to the benchmarks:
beanplot(SpeedRatio ~ Benchmark,
data = subset(norm_bench, Cores==2),
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT", las=2)
beanplot(SpeedRatio ~ Benchmark,
data = subset(norm_bench, Cores==16),
what = c(1,1,1,0), log="",
ylab="Runtime: noOT/OT", las=2)
Those two graphs are somehow similar, but you might notice that the float
loop benchmark has a couple of strong outliers.
I forgot the exact identifier of the float loop benchmark, so lets find out
with this code: levels(norm_bench$Benchmark)
Now let’s filer the data shown by that benchmark.
beanplot(SpeedRatio ~ Cores,
data = subset(norm_bench, Benchmark == "SMarkLoops.benchFloatLoop"),
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT")
This visualization reassures me, that there is something strange going on.
The distribution of result still looks clumped as if there is another
parameter influencing the result, which we have not regarded yet.
The only column we have not looked at is ExtraArguments,
so we will add them. Note that droplevels() is applied on the
data set this time before giving it to the beanplot() function.
This is necessary since the plot would contain all unused factor levels
instead, which would reduce readability considerably.
beanplot(SpeedRatio ~ Cores + ExtraArguments,
data = droplevels(subset(norm_bench, Benchmark == "SMarkLoops.benchFloatLoop")),
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT", las=2)
Fixing up a Mistake
This plot now shows us that there are three groups of results with
different ExtraArguments.
Think I forgot that the data set contains some very specific benchmarks.
The overall goal of the benchmarks is to test the weak scaling behavior of
the VM by increasing work-load and number-of-cores together.
However, for the questions we are interested here, only those weak-scaling
benchmarks are of interest.
Thus, we need to filter out a couple of more data points, since the results
will be unnecessarily biased otherwise.
To filter the data points we use grepl.
It matches the strings of ExtraArguments and allows us to
filter out the single-core and the 10x-load benchmarks.
norm_bench <- subset(norm_bench,
!grepl("^1 ", ExtraArguments) # those beginning with "1" put load on a single core
& !grepl("s0 ", ExtraArguments)) # those having "s0" in it put 10x load on each core
norm_bench <- droplevels(norm_bench)
And since I always wanted to say that:
The exercise to regenerate all interesting graphs and re-answer the
original questions are left to the interested reader.
Conclusion
As a final graph, we can plot all benchmarks for all cores,
and get an overview. The par function allows to adapt the
margins of the plot, which is necessary here to get the full benchmark
names on the plot.
par(mar=c(20, 4, 1, 1))
beanplot(SpeedRatio ~ Cores + Benchmark,
data = norm_bench,
what = c(1,1,1,0), log="", ylab="Runtime: noOT/OT", las=2)
As we already knew, we see an influence of the number of cores on the
results, but more importantly, we see most benchmarks benefitting from removing
the extra indirection through the object table. The float loops benefit by
far the strongest. The float objects are so small and usually used only
temporary that avoiding the object table pays off.
For the integer loops it does not make a difference, since the VMs uses
immediate values (tagged integers). Thus, the integers used here are not
objects allocated in the heap and the object table is not used either.
Beyond the won insight into the performance implications of an object table
this analysis also demonstrates the benefits of using a language like
R. Its language features allow us to filter and reshape data easily.
Furthermore, regenerating plots and tracing steps becomes easy, too.
Here it was necessary since some data points needed to be removed from
the data set to get to reasonable results. Reexecuting part of the script
or just exploring the data is convenient and done
fast, which allows me to ask more questions about the data and understand the
measurements more deeply. Furthermore, it is less a hassle to reassess
the data in case certain assumptions have changed, we made a mistake,
or the data set changed for other reasons.
From my experience, it is much more convenient than Excel, but that might
just be because I spend more time on learning R than on learning Excel.
In case you try it out yourself, you will certainly want to experiment
with other types of visualizations, save them to files, etc.
A few of these things can be found in my benchmarking scripts on
GitHub: BenchR.
I started as a PhD student in October 2008 at the Software Languages Lab of the Vrije Universiteit Brussel. My field of research are virtual machines especially with respect to concurrency issues and language constructs for modularization. Recently, I started working on concurrency support at the VM instruction set level.