Eventful GHC

Alp Mestanogullari - 2019-09-24

What can we do when it takes GHC a lot of time to compile a given module? Where is it spending its time? Where can we start to get a 10,000 feet view of what GHC is doing? This blog post covers one possible answer, using the eventlog mechanism.

(This post has also been published on the Well-Typed blog.)

Eventlog ?

GHC has a mechanism that allows us to record all sorts of “events” to an eventlog (a simple binary file), during a program’s execution, attaching a timestamp to it to later allow tools to reconstruct as much of a program’s execution as the events allow. This includes RTS events (garbage collection, HEC/thread activity) but also user defined events, where “user” designates the author of a Haskell library or program; Debug.Trace in base provides functions that anyone can use to emit events, in addition to the ones that the RTS itself will emit.

Those functions are implemented in terms of primitive operations that are backed by GHC’s runtime system. Its design and implementation is discussed, along with other topics, in Parallel Performance Tuning for Haskell by Don Jones Jr., Simon Marlow and Satnam Singh.

After the program’s execution, one can then use libraries like ghc-events or tools like ghc-events-analyze, or threadscope to consume the eventlog in order to gain some insights into where time was spent.

While profiling lets us gather more detailed information about where time is spent and where allocations are made, it requires rebuilding our program’s code and using a dedicated RTS flavour. The code generated by GHC for a profiled program is quite different, adding lots of instrumentation to support profiling the execution of Haskell code. The extra code generated to support profiling can get in the way of some optimisations and can therefore drastically affect the performance of a program. On the other hand, generating the eventlog for a program only requires re-linking it (against a flavour of the RTS that supports tracing) with -eventlog and running it with +RTS -l. The eventlog mechanism also has much lower impact on runtime performance, since emitting events is merely about putting a few values in a buffer, that the RTS will then regularly flush to the eventlog file or whatever the destination is. The aforementionned paper has some precise numbers on the overhead of the eventlog mechanism, but it’s of course quite low.

This can therefore be an interesting solution when you want to get a big picture without having to rebuild your whole program and some of its dependencies. All you have to do is set up some events that cover the fragments of the program’s execution that you’re particularly interested in.

Events are in general emitted using one of the following 4 functions from Debug.Trace: - traceEvent :: String -> a -> a - traceEventIO :: String -> IO () - traceMarker :: String -> a -> a - traceMarkerIO :: String -> IO ()

The traceEvent[IO] functions should be used for all sorts of events, particularly the ones that are likely to happen a lot during the execution of your program, while traceMarker[IO] are generally used to mark certain points or phases in the execution and see that visually in the profile. This is particularly helpful with tools like eventlog2html (see the last section of this post) that allow you to visualize heap profiles, drawing those “marker events” on top so that users can get a sense of when some particularly allocations or deallocations take place, with user-supplied labels instead of trying to guess from timestamps.

For more about eventlogs in general, see GHC’s users guide or the event-log page on the GHC wiki.

GHC events

Starting with this commit, by Ben Gamari about 3 months ago, GHC started emitting eventlog entries for all calls to withTiming, which is a function we use to measure how long various parts of GHC take. We then started adding a few more withTiming calls:

688a1b89 added tracing around calls to various external tools (C compiler, linker, assembler, etc);
0c5cd771 added tracing around all the individual passes that make up the codegen pipeline (we previously only tracked codegen as a whole, single event);
e3cbe319 added tracing around the package database initialization.

And other similar patches later on. As a result, we can trace most of the execution of GHC when compiling a trivial hello world module:

-- hello.hs
main :: IO ()
main = putStrLn "hello, world!"

First, you need to get a somewhat recent checkout of GHC’s master branch - e3cbe319 or newer. Then you need to build that source tree and make sure that the stage 2 GHC is linked with -eventlog:

$ ./boot && ./configure

# Build with Hadrian:
$ hadrian/build.sh -j "stage1.ghc-bin.ghc.link.opts += -eventlog"
# stage 2 ghc at: _build/stage1/bin/ghc

# Build with Make:
$ make -j GhcStage2HcOpts+=-eventlog
# stage 2 ghc at: inplace/bin/ghc-stage2

# If you have a GHC build around already, both of those commands should not
# cause a lot of work to be done: just linking the GHC executable against a
# slightly different RTS flavour! No recompilation needed.

You can then build any module, library or program with the resulting stage 2 executable as you would normally do. Except that if you pass +RTS -l with one of -v2 or -ddump-timings, GHC will produce an eventlog at ghc.eventlog with all the standard RTS events, but also events for each pass in GHC’s pipeline. Let’s see this in action by compiling `hello.hs’ from earlier.

# use inplace/bin/ghc-stage2 if you built GHC with Make
$ _build/stage1/bin/ghc -ddump-timings hello.hs -o hello +RTS -l

We now have an eventlog. But what can we do with it? You’ve got several options:

The ghc-events library and program, used by all the other options, provide primitives for decoding the eventlog and extracting data from it. The accompanying program provides various commands for querying eventlog files.
The threadscope program lets you visualize eventlogs with a GTK+ frontend.
ghc-events-analyze produces per-label totals, SVG visualizations and per-label timing data given an eventlog file as input.
… and maybe other options that I’m not remembering or aware of at all.

I personally use ghc-events-analyze a lot, since it can produce pretty SVG pictures that I can then embed in GitLab comments or… blog posts. :-)

To get the timings and SVG picture, you need to install ghc-events-analyze, and then ask it to process ghc.eventlog with the right “markers”:

$ ghc-events-analyze --timed --timed-txt --totals \
                     --start "GHC:started:" --stop "GHC:finished:" \
		     ghc.eventlog

# --timed:     produce SVG visualization of the eventlog, in ghc.timed.svg
# --timed-txt: produce per-label groupings of timings that report when the events
               were emitted, in ghc.timed.txt
# --totals:    produce per-label totals, reporting how much time was spent in
               a given label, in total
# --start:     all events that we're interested in are wrapped in
# --stop       GHC:started:... / GHC:finished:... events, so we just tell
               ghc-events-analyze that it should be looking for those markers
	       and report about those events.

Here are the totals that I get for our hello world program compilation.

GC                                389461503ns   0.389s

USER EVENTS (user events are corrected for GC)
 systool:linker                  2386891920ns   2.387s
 systool:cc                       801347228ns   0.801s
 systool:as                       145128851ns   0.145s
 Renamer/typechecker [Main]        45709853ns   0.046s
 initializing package database     20877412ns   0.021s
 CodeGen [Main]                    20754058ns   0.021s
 CoreTidy [Main]                    8262122ns   0.008s
 NCG                                7566252ns   0.008s
 Chasing dependencies               2441212ns   0.002s
 Cmm pipeline                       2040174ns   0.002s
 Desugar [Main]                     1657607ns   0.002s
 STG -> Cmm                         1103737ns   0.001s
 Simplifier [Main]                  1045768ns   0.001s
 Cmm -> Raw Cmm                      319442ns   0.000s
 Parser [Main]                       286350ns   0.000s
 CorePrep [Main]                      88795ns   0.000s
TOTAL                            3445520781ns   3.446s

THREAD EVENTS
1                                    460138ns   0.000s
IOManager on cap 0:2                 670044ns   0.001s
TimerManager:3                       143699ns   0.000s
4                                 153050783ns   0.153s
weak finalizer thread:5               43518ns   0.000s
weak finalizer thread:6               10677ns   0.000s
weak finalizer thread:7               33126ns   0.000s
weak finalizer thread:8               23787ns   0.000s
weak finalizer thread:9                1534ns   0.000s
weak finalizer thread:10               8142ns   0.000s
weak finalizer thread:11               1352ns   0.000s
weak finalizer thread:12              10080ns   0.000s
weak finalizer thread:13              10603ns   0.000s
weak finalizer thread:14               8767ns   0.000s
weak finalizer thread:15              10849ns   0.000s
16                                    69745ns   0.000s
17                                   307420ns   0.000s
18                                    83017ns   0.000s
19                                    76866ns   0.000s
weak finalizer thread:20              10083ns   0.000s
21                                    96582ns   0.000s
22                                   103373ns   0.000s
23                                    62562ns   0.000s
24                                    97655ns   0.000s
weak finalizer thread:25              11676ns   0.000s
26                                   238116ns   0.000s
27                                   245821ns   0.000s
weak finalizer thread:28               8268ns   0.000s
29                                    80235ns   0.000s
30                                    86759ns   0.000s
31                                    21571ns   0.000s
weak finalizer thread:32                396ns   0.000s
TOTAL                             156087244ns   0.156s

And the SVG image:

(click here for full size picture)

We can see two time windows where we don’t have an “active” label:

towards the beginning, where GHC likely parses settings, sets everything up and reads the input file;
after code generation and before calling the assembler, where it turns out we spend most of the time getting some information about the C compiler being used.

We otherwise have a pretty good picture of what GHC is doing the rest of the time. It might be nice in the future to add a few more events, e.g to track when we’re running Template Haskell code.

If you want GHC to only emit so-called “user events” (= non-RTS events) to the eventlog, pass +RTS -l-au instead of +RTS -l. More options can be found in the relevant section of the GHC user manual.

Wrapping up

Of course, if you’re looking into a performance problem, once you narrow down the part of the execution that is taking too long, you will quite likely start using slightly fancier profiling methods, nothing new there. You can however emit your own events when working on GHC, so as to collect some simple timing data about those codepaths that you are working on, without affecting GHC’s performance much. You simply have to use compiler/main/ErrUtils.hs:withTiming:

withTiming <action to get dynflags> (text "<label>") <forcing function> $
  <action to time>

where the forcing function is used to force the evaluation of the action’s result just as much as the user wants; the time it takes to perform this evaluation is included in the timings reported by withTiming. Your <label> would then show up in the eventlog reports, if the corresponding code path is entered. See Note [withTiming] for more explanations about the withTiming function and how to use it.

All the eventlog markers that withTiming emits can then be integrated in heap profiles using eventlog2html, written by Matthew Pickering; see here for an example. If you put your mouse on one of the vertical lines, you will see the label of the phase that begins or ends at that moment.

Before ending this post, I would like to mention that we have some infrastructure to collect timings and allocations for various compilation phases in GHC, over the head.hackage package set. This can be used locally or triggered manually from the CI interface of the GHC repository, and is set up to run nightly against GHC’s master branch. The resulting data can be loaded in PostgreSQL and queried at will. We plan on collecting more data about GHC’s behaviour when compiling all those packages and on relying more on the eventlog than GHC output parsing to do so, with two major goals in mind:

have an easy way to track GHC’s performance over time and in particular provide some initial data about where a performance regression might be coming from;
expose enough data to be able to identify “real world modules” where GHC doesn’t perform as well as we would like it to, which could in turn lead to an investigation and possibly a patch that improves GHC’s performance on the said module.

Happy profiling!