Can We Know Whether a Profiler is Accurate?

If you have been following the adventures of our hero over the last couple of years, you might remember that we can’t really trust sampling profilers for Java, and it’s even worse for Java’s instrumentation-based profilers.

For sampling profilers, the so-called observer effect gets in the way: when we profile a program, the profiling itself can change the program’s performance behavior. This means we can’t simply increase the sampling frequency to get a more accurate profile, because the sampling causes inaccuracies. So, how could we possibly know whether a profile correctly reflects an execution?

We could try to look at the code and estimate how long each bit takes, and then painstakingly compute what an accurate profile would be. Unfortunately, with the complexity of today’s processors and language runtimes, this would require a cycle-accurate simulator that needs to model everything, from the processor’s pipeline, over the cache hierarchy, to memory and storage. While there are simulators that do this kind of thing, they are generally too slow to simulate a full JVM with JIT compilation for any interesting program within a practical amount of time. This means that simulation is currently impractical, and it is impractical to determine what a ground truth would be.

So, what other approaches might there be to determine whether a profile is accurate?

In 2010, Mytkowicz et al. already checked whether Java profilers were actionable by inserting computations at the Java bytecode level. On today’s VMs, that’s unfortunately an approach that changes performance in fairly unpredictable ways, because it interacts with the compiler optimizations. However, the idea to check whether a profiler accurately reflects the slowdown of a program is sound. For example, an inaccurate profiler is less likely to correctly identify a change in the distribution of where a program spends its time. Similarly, if we change the overall amount of time a program takes, without changing the distribution of where time is spent, it may attribute run time to the wrong parts of a program.

We can detect both of these issues by accurately slowing down a program. And, as you might know from the previous post, we are able to slow down programs fairly accurately. Figure 1 illustrates the idea with a stacked bar chart for a hypothetical distribution of run-time over three methods. This distribution should remain identical, independent of a slowdown observed by the program. So, there’s a linear relation between the absolute time measured and a constant relation between the percentage of time per method, depending on the slowdown.

Figure 1: A stacked bar chart for a hypothetical program execution, showing the absolute time per method. A profiler should see the linear increase in run time taken by each method, but still report the same percentage of run time taken. If a profiler reports something else, we have found an inaccuracy.

With this slowdown approach, we can detect whether the profiler is accurate with respect to the predicted time increase. I’ll leave all the technical details to the paper. We can also slow down individual basic blocks accurately to make a particular method take more time. As it turns out, this is a good litmus test for the accuracy of profilers, and we find a number of examples where they fail to attribute the run time correctly. Figure 2 shows an example for the Havlak benchmark. The bar charts show how much change the four profilers detect after we slowed down Vector.hasSome to the level indicated by the red dashed line. In this particular example, async-profiler detects the change accurately. JFR is probably within the margin of error. However, JProfiler and YourKit are completely off. JProfiler likely can’t deal with inlining and attributes the change to the forEach method that calls hasSome. YourKit does not seem to see the change at all.

Figure 2: Bar chart with the change in run time between the baseline and slowed-down version, for the top 5 methods of the Havlak benchmark. The red dashed line indicates the expected change for the Vector.hasSome method. Only async-profiler and JFR come close to the expectation.

With this slowdown-based approach, we finally have a way to see how accurate sampling profilers are by approximating the ground truth profile. Since we can’t measure the ground truth directly, we found a way to sidestep a fundamental problem and found a reasonably practical solution.

The paper details how we implement our divining approach, i.e., how we slow down programs accurately. It also has all the methodological details, research questions, benchmarking setup, and lots more numbers, especially in the appendix. So, please give it a read, and let us know what you think.

If you happen to attend the SPLASH conference, Humphrey is presenting our work today and on Saturday.

Questions, pointers, and suggestions are always welcome, for instance, on Mastodon, BlueSky, or Twitter.

Thanks to Octave for feedback on this post.

Abstract

Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.

To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.

Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.

We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.

  • Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown
    H. Burchell, S. Marr; Proceedings of the ACM on Programming Languages, OOPSLA'25, ACM, 2025.
  • Paper: PDF
  • DOI: 10.1145/3763180
  • Appendix: online appendix
  • BibTex: bibtex
    @article{Burchell:2025:Divining,
      abstract = {Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results.
      
      To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today's software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs.
      
      Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers.
      
      We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.},
      acceptancerate = {0.356},
      appendix = {https://doi.org/10.5281/zenodo.16911348},
      articleno = {402},
      author = {Burchell, Humphrey and Marr, Stefan},
      blog = {https://stefan-marr.de/2025/10/can-we-know-whether-a-profiler-is-accurate/},
      doi = {10.1145/3763180},
      issn = {2475-1421},
      journal = {Proceedings of the ACM on Programming Languages},
      keywords = {Accuracy GroundTruth Java MeMyPublication Profiling Sampling myown},
      month = oct,
      number = {OOPSLAB25},
      numpages = {32},
      pdf = {https://stefan-marr.de/downloads/oopsla25-burchell-marr-divining-profiler-accuracy.pdf},
      publisher = {{ACM}},
      series = {OOPSLA'25},
      title = {{Divining Profiler Accuracy: An Approach to Approximate Profiler Accuracy Through Machine Code-Level Slowdown}},
      year = {2025},
      month_numeric = {10}
    }
    

First Day: A New Chapter at the JKU

It’s Wednesday. Is this important? It’s my first day in a new position. So, perhaps the real question is: what’s going to be important to me from now on?

Let’s get the titles out of the way first: Today is my first day as Universitäts­professor. That’s a full professor, chair, W3 Professor, gewoon hoogleraar, or similar. Yeah, there are lots of different names in different countries. It’s also my first day as the head of the Institute for System Software. The term institute is used here for something that’s a research group in many other places. This means I have the opportunity to work with a number of very smart people to offer university courses in the field of programming languages, compilers, and more broadly system software. It also means I am asked to advise, mentor, and support others in their research journey, from taking their very first steps, up to becoming their own independent academics, and professors in their own right. To me, this sounds fun. I am asked to help people learn, pursue knowledge, and develop their skills. Something I not only enjoy, but also find important to prepare the next generation to tackle the problems of our time. However, this also means I reached the end of a journey. That’s it. I am a full professor now, and I have convinced enough people that I am not entirely terrible at this job. Or so we all hope…

At this point, I already have to thank all the people at the JKU for the very warm welcome I received over the last few weeks. Particularly, thank you Peter, Herbert, Markus, and Karin, for all the support to get me started here! Similarly, I wouldn’t be here without my dear colleagues and mentors at Kent and in the wider programming language research community. You know who you are, I hope.

What Now?

With the new job and responsibilities, I need to think about what’s now important to me. What follows isn’t a detailed plan. I had already been asked to formulate one of those, and I’ll continue to work on realizing it. Instead, I wanted to think here a bit broader.

Teaching: Advocate for Fundamentals

Let’s start with teaching, since my first lectures will already be next week.

Our institute teaches various courses, including software development, compiler construction, advanced compiler construction, system software, dynamic compilation and run-time optimization, and principles of programming languages.

My impression from early discussions with colleagues is that I will need to work on making sure that we can keep teaching these fundamental topics in the future. While there seems to be a very strong push for AI everything, I remain to be convinced that this means that the fundamentals are any less important. On the contrary, it feels that we need to keep reminding people of classic techniques that are guaranteed to work, are correct, and efficient. So, when it comes to teaching, I think an important part of my job will be advocating for the fundamentals.

Of course, looking at the material I’ll teach this term on compiler construction and system software, perhaps I can adapt it in future years. Currently, 6 out of 13 compiler construction lectures are on parsing. This makes me want to work out what the most useful learning outcomes for such a course should be today.

Research: Take Risks and Pursue Problems Too Hard for Industry

Some people seem to advocate for exploring new things and expanding one’s horizon when reaching this career level. Indeed, I have the chance to take risks, explore new research topics and communities, and ways of working.

If there’s a single tag line for the work I have in mind, it might be: improve language implementations to better enable old and new kinds of applications. After all, I like to explore ideas that enable developers to make better use of computing systems.

This will take new ways of looking at problems. For instance, with few exceptions, I have been shying away from very formal work in the past. Though, a while ago I started dreaming of defining a new kind of high-level memory model, for which we may need a more formal approach in addition to building working prototypes. Looking at today’s memory models, they seem too low-level for dynamic languages such as Python and Ruby. I already gave a few talks about the background of this work and will also give one at SPLASH. This will be a huge project, and a risky one. Not least because it’s unclear whether the language communities care enough about the issue until they start suffering from not having a memory model more notably.

And then there is interpreter performance, a topic I have been working on for a long time already. Since I am now in a group with a long history in the area of compilers, I would like to double down on generating fast interpreters. Interpreters, the way we build them today, have a lot of headroom in terms of performance. The classic ones, implemented in C/C++, and even more so, the ones on top of meta-compilation systems. The work of Haoran Xu suggests that we can do much better. Unfortunately, it’s a really hard problem, for various reasons. Something that doesn’t fit into the short and mid-term priorities of most companies. But we can chip away at it slowly and steadily, benefiting lots of programming languages in the process.

I’ll also continue to work with my colleagues at Oracle on compiler topics and with colleagues from PLAS. We’ll keep doing fun stuff, some of which we’ll present at SPLASH in two weeks, including work on making programs slower (yes, slower!) and approximating the ground truth profile for sampling profilers.

I’ll stop here for now. Seems like I do need to get on with the actual job… somewhere in Science Park 3. I am looking forward to starting to work with all my new colleagues at the JKU and seeing which new collaborations and cooperations we can begin. If you’re a student and interested in a project, please see the Open Project’s page, where I will post more concrete project ideas in the future.

I suppose I’ll also occasionally still be on Mastodon, BlueSky, and Twitter.

How to Slow Down a Program? And Why it Can Be Useful.

Most research on programming language performance asks a variation of a single question: how can we make some specific program faster? Sometimes we may even investigate how we can use less memory. This means a lot of research focuses solely on reducing the amount of resources needed to achieve some computational goal.

So, why on earth might we be interested in slowing down programs then?

Slowing Down Programs is Surprisingly Useful!

Making programs slower can be useful to find race conditions, to simulate speedups, and to assess how accurate profilers are.

To detect race conditions, we may want to use an approach similar to fuzzing. Instead of exploring a program’s implementation by varying its input, we can explore different instruction interleavings, thread or event schedules, by slowing down program parts to change timings. This approach allows us to identify concurrency bugs and is used by CHESS, WAFFLE, and NACD.

The Coz profiler is an example of how slowing down programs can be used to simulate speedup. With Coz, we can estimate whether an optimization is beneficial before implementing it. Coz simulates it by slowing down all other program parts. The part we think might be optimizable stays at the same speed it was before, but is now virtually sped up, which allows us to see whether it gives enough of a benefit to justify a perhaps lengthy optimization project.

And, as mentioned before, we can also use it to assess how accurate profilers are. Though, I’ll leave this for the next blog posts. :)

The current approaches to slowing down programs for these use cases are rather coarse-grained though. Race detection often adapts the scheduler or uses, for example, APIs such as Thread.sleep(). Similarly, Coz pauses the execution of the other threads. Work on measuring whether profilers give actionable results, inserts bytecodes into Java programs to compute Fibonacci numbers.

By using more fine-grained slowdowns, we think we could make race detection, speedup estimation, and profiler accuracy assessments more precise. Thus, we looked into inserting slowdown instructions into basic blocks.

Which x86 Instructions Allow us to Consistently Slow Down Basic Blocks?

Let’s assume we run on some x86 processor, and we are looking at programs from the perspective of processors.

When running a benchmark like Towers, the OpenJDK’s HotSpot JVM may compile it to x86 instructions like this:

1
2
3
4
5
6
7
mov dword ptr [rsp+0x18], r8d
mov dword ptr [rsp], ecx
mov qword ptr [rsp+0x20], rsi
mov ebx, dword ptr [rsi+0x10]
mov r9d, edx
cmp edx, 0x1
jnz 0x... <Block 55>	

This is one of the basic blocks produced by HotSpot’s C2 compiler. For our purposes, it suffices to see that there are some memory accesses with the mov instructions, and we end up checking whether the edx register contains the value 1. If that’s not the case, we jump to Block 55. Otherwise, execution continues in the next basic block. A key property of a basic block is that there’s no control flow inside of it, which means once it starts executing, all of its instructions will execute.

Though, how can we slow it down?

x86 has many many different instructions one could try to insert into the block, which each will probably consume CPU cycles. However, modern CPUs try to execute as many instructions as possible at the same time using out-of-order execution. This means, instructions in our basic block that do not directly depend on each other might be executed at the same time. For instance, the first three mov instructions access neither the same register nor memory location. This means the order in which they are executed here does not matter. Though, which optimizations CPUs apply depends on the program and the specific CPU generation, or rather microarchitecture.

To find suitable instructions to slow down basic blocks, we experimented only on an Intel Core i5-10600 CPU, which has the Comet Lake-S microarchitecture. On other microarchitectures, things can be very different.

For the slowdown that we want, we can use nop or mov regX, regX instructions on Comet Lake-S. This mov would move the value from register X to itself, so basically does nothing. These two instructions give us a slowdown that is small enough to slow down most blocks accurately to a desired target speed, and the slowdown seems to affect only the specific block it is meant for.

Our basic block from earlier would then perhaps end up with nop instructions interleaved after each instruction. In practice, the number of instructions we need to insert depends on how much time a basic block takes in the program. Though, for illustration, it might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
mov dword ptr [rsp+0x18], r8d
nop
mov dword ptr [rsp], ecx
nop
mov qword ptr [rsp+0x20], rsi
nop
mov ebx, dword ptr [rsi+0x10]
nop
mov r9d, edx
nop
cmp edx, 0x1
nop
jnz 0x... <Block 55>	

We tried six different candidates, including a push-pop sequence, to get a better impression of how Comet Lake-S deals with them. For more details of how and what we tried, please have a look at our short paper below, which we will present at the VMIL workshop.

When inserting these instructions into basic blocks, so that each individual basic block takes about twice as much time as before, we end up with a program that indeed is overall twice as slow, as one would hope. Even better, when we look at the Towers benchmark with the async-profiler for HotSpot, and compare the proportions of run time it attributes to each method, the slowed-down and the normal version match almost perfectly, as illustrated below. The same is not true for the other candidates we looked at.

Figure 1: A scatter plot per slowdown instruction with the median run-time percentage for the top six Java methods of Towers. The X=Y diagonal indicates that a method’s run‐time percentage remains the same with and without slowdown.

The paper has a few more details, including a more detailed analysis of the slowdown each candidate introduces, how precise the slowdown is for all basic blocks in the benchmark, and whether it makes a difference when we put the slowdown all at the beginning, interleaved, or at the end.

Of course, this work is merely a stepping stone to more interesting things, which I will look at in a bit more detail in the next post.

Until then, the paper is linked below, and questions, pointers, and suggestions are welcome on Mastodon, BlueSky, or Twitter.

Abstract

Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler’s accuracy. Yet, slowing down a program is complicated because today’s CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program’s performance behavior to avoid introducing bias.

We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.

  • Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling
    H. Burchell, S. Marr; In Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, VMIL'25, p. 8, ACM, 2025.
  • Paper: PDF
  • DOI: 10.1145/3759548.3763374
  • BibTex: bibtex
    @inproceedings{Burchell:2025:SlowCandidates,
      abstract = {Slowing down programs has surprisingly many use cases: it helps finding race conditions, enables speedup estimation, and allows us to assess a profiler's accuracy. Yet, slowing down a program is complicated because today's CPUs and runtime systems can optimize execution on the fly, making it challenging to preserve a program's performance behavior to avoid introducing bias.
      
      We evaluate six x86 instruction candidates for controlled and fine-grained slowdown including NOP, MOV, and PAUSE. We tested each candidate’s ability to achieve an overhead of 100%, to maintain the profiler-observable performance behavior, and whether slowdown placement within basic blocks influences results. On an Intel Core i5-10600, our experiments suggest that only NOP and MOV instructions are suitable. We believe these experiments can guide future research on advanced developer tooling that utilizes fine-granular slowdown at the machine-code level.},
      author = {Burchell, Humphrey and Marr, Stefan},
      blog = {https://stefan-marr.de/2025/08/how-to-slow-down-a-program/},
      booktitle = {Proceedings of the 17th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages},
      doi = {10.1145/3759548.3763374},
      isbn = {979-8-4007-2164-9/2025/10},
      keywords = {Benchmarking HotSpot ISA Instructions Java MeMyPublication assembly evaluation myown slowdown x86},
      location = {Singapore},
      month = oct,
      pages = {8},
      pdf = {https://stefan-marr.de/downloads/vmil25-burchell-marr-evaluating-candidate-instructions-for-reliable-program-slowdown-at-the-compiler-level.pdf},
      publisher = {{ACM}},
      series = {VMIL'25},
      title = {{Evaluating Candidate Instructions for Reliable Program Slowdown at the Compiler Level: Towards Supporting Fine-Grained Slowdown for Advanced Developer Tooling}},
      year = {2025},
      month_numeric = {10}
    }
    

Older Posts

Subscribe via RSS