An overhaul of stack management, and some performance improvements
simonmar - 2010-12-15
I just committed a patch I’ve been working on for a few days that revamps the way thread stacks are managed in the runtime system. There’ll be some nice performance improvements coming in 7.2.1 as a result of this.
Background
In GHC 7.0 and earlier, a thread is represented by a TSO object in the heap. The TSO contains:
- some thread-specific state and flags
- a couple of link fields for linking the TSO onto lists
- the stack, growing backwards from the end of the TSO
when the stack overflows, as long as the maximum stack size (+RTS -K<size>
) has not yet been reached, then the runtime system would
create a new TSO double the size of the old one, copy the stack and
the thread state into the new object, and mark the old TSO as a
ThreadRelocated
pointing to the new one. This last step is
important, because the TSO might be reachable from other places: a
ThreadId
is basically a pointer to the TSO, for instance.
This ThreadRelocated
thing is a nuisance, because it means every
time we derefernce a TSO pointer, we have to check for
ThreadRelocated
in a little loop. This loop was enshrined in a
function deRefTSO()
in the RTS, and was called from various places.
Over the years we’ve had several bugs caused by a missing deRefTSO
.
Overheads of the old way
Obviously there’s a certain amount of overhead associated with copying
the whole stack every time we need to enlarge the TSO: every stack
word is written approximately twice. What’s worse, though, is that
the entire stack is traversed during every GC, if the thread is active
(threads that haven’t run since the last GC are marked “clean” and not
traversed). If the thread has a deep stack, this makes minor GCs
expensive, which shows up as a high GC overhead. A common workaround
is to use a larger allocation area, e.g. +RTS -A4m
, but that has the
disadvantage of making the allocation area larger than the L2 cache,
which also hurts performance.
A better way: stack chunks
The solution to the repeated traversal problem is to identify the parts of the stack that are unmodified since the last GC, and not traverse them. The GC already does a lot of this kind of thing - it’s called a “write barrier”, and it’s a basic requirement for generational GC.
There are two ways we could do this: either do a card-marking trick like we do for mutable arrays, or we could divide the stack up into multiple objects and mark each object as either clean or dirty. The card-marking option would have been fairly complicated to administer, so we discarded that idea. In contrast, having multiple objects containing parts of the stack was relatively straightforward.
Perhaps we should have done this ages ago. I put it off because I thought it would be complex - we have lots of places in the RTS that traverse stacks. It turns out that the majority of these places don’t care about stack chunks, which is nice.
Death to ThreadRelocated
The obvious solution to the ThreadRelocated
problem is to separate
the stack from the TSO, and that’s exactly what we’ve done. Now the
TSO contains the thread state only, and it points to a separate STACK
object for the topmost stack chunk, containing the stack pointer and the stack itself. The result is that a fragile invariant goes away, which is a very good thing.
Why didn’t we do this before? Well, in fact I did try making this change a long time ago, and found that it made performance worse for some thread benchmarks. I put it down to the extra cache-line fill required to access the stack pointer in the separate STACK object. However, this time the results look pretty good - I’m not entirely sure why, maybe it’s the combination of doing this with stack chunks, or maybe I did something wrong last time, who knows (or cares!).
Performance
Here are some performance figures from a small collection of thread benchmarks (http://darcs.haskell.org/nofib/smp). This is comparing yesterday’s HEAD with my working branch:
--------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed
--------------------------------------------------------------------------------
callback001 +289.3% -13.5% +0.8% +1.2%
callback002 +302.0% -13.5% -1.3% -1.3%
sieve +329.2% -0.2% -1.6% -1.3%
threads001 +292.4% +0.0% +4.2% +4.5%
threads003 +298.9% -0.5% -36.5% -36.4%
threads006 +304.1% -2.2% -66.4% -66.4%
threads007 +409.7% +0.0% +2.4% +2.5%
--------------------------------------------------------------------------------
Min +289.3% -13.5% -66.4% -66.4%
Max +409.7% +0.0% +4.2% +4.5%
Geometric Mean +316.3% -4.5% -19.3% -19.2%
Ignore the size figures, it’s because the HEAD build was using
-split-objs
.
We can see that several benchmarks display no difference or get very slightly worse, but a couple (threads003 and threads006) see a big jump in performance. threads003 is an old variant on threadring, and threads006 is a microbenchmark for throwTo. I don’t fully understand the difference in performance here, but typically I don’t investigate when things go faster, only when they go slower.
GHC itself got a little faster: 1-2% when compiling Cabal.
But what you really wanted to know was… what about threadring, right? Well, it currently goes a few percent slower with the new changes. However, I notice that it’s allocating a lot more than before, so I think there’s something suspicious going on. I shall investigate…
Update: some more results
!BinaryTrees, from the shootout:
- before: 64.26s
- after: 17.61s
So I guess the problem with !BinaryTrees was more to do with repeated stack traversals than it was to do with a low mortality rate. Nice that we can get good performance from this program without tweaking the heap settings at all now.
Hackage benchmarks, from David Peixotto’s fibon benchmark suite:
--------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
--------------------------------------------------------------------------------
Agum +306.5% +3.5% +0.8% +0.9% +0.0%
Bzlib ----- ----- ----- ----- -----
Cpsa +125.1% +2.1% -2.2% -2.3% +0.0%
Crypto ----- ----- ----- ----- -----
Fgl +408.9% +0.1% +2.0% +2.0% -5.4%
Fst +137.7% -0.0% -4.4% -4.3% +0.0%
Funsat +151.4% +0.2% +1.7% +1.8% -2.0%
Gf 13990k 13885653k 9.03 9.60 116736k
HaLeX +195.9% +0.0% +1.2% +1.1% +0.0%
Happy +135.0% +3.3% -5.3% -5.4% -13.7%
Hgalib +236.7% +1.0% +0.7% +0.7% +0.0%
Palindromes +243.1% -10.1% +3.4% +3.4% +2.1%
Pappy +148.9% +0.5% +3.7% +3.8% -17.7%
QuickCheck +139.5% -0.1% -1.6% -1.6% +0.0%
Regex +182.1% +0.0% +3.5% +3.5% -21.4%
Simgi +159.3% +0.0% +6.9% +6.9% +0.0%
TernaryTrees +519.4% -2.0% -32.4% -32.7% -14.3%
Xsact +246.4% +1.2% +0.9% +0.9% -0.6%
--------------------------------------------------------------------------------
Min +125.1% -10.1% -32.4% -32.7% -21.4%
Max +519.4% +3.5% +6.9% +6.9% +2.1%
Geometric Mean +207.6% -0.1% -1.9% -1.9% -5.2%
One or two nice ones in there: !TernaryTrees, Happy, Fst all look good. Simgi probably deserves further investigation.
Notice how the overall memory size is reducing too, by 5% on average.
Update 2: threadring results
threadring 10000000, without -threaded
:
- 6.12.3: 1.50s
- 7.0.1: 2.05s
- HEAD with stack patches: 1.67s
So the stack patches recovered almost all the performance lost between 6.12.3 and 7.0.1 on non-threaded threadring. The performance lost in 7.0.1 was due to the BLACKHOLE changes and some other changes to fix pathological cases of bad MVar
performance.
More interesting are the -threaded
numbers:
- 6.12.3: 5.86s
- 7.0.1: 3.00s
- HEAD with stack patches: 2.31s
The new code is very clearly winning here. I spent a little while with the perf tool on Linux trying to understand the results, but wasn’t able to fully account for it, sadly.