Analysis Fragmentation
In a previous post, [David] explained how he analysed a memory usage issue which turned out to be caused by fragmentation. At the time of writing the exact cause of the fragmentation was unknown and difficult to analyse. The only thing that we could work out was the extent of the problem without formulating a strategy to fix it.
Charactising Fragmentation
In the previous post David explained that the primary kind of fragmentation which we are interested in was block fragmentation. Block-level fragmentation is the wasted space between blocks in a megablock. In Haskell applications this is typically caused by pinned objects which can’t be moved by the garbage collector.
For pinned data, it is easy to get into situations where a small pinned object can end up retaining a large amount of memory. For example, a single pinned block, which only contains one live pinned object, can end up retaining an entire megablock. The copying GC is not capable of reusing freed space in pinned blocks and unlike unpinned blocks, fragmented blocks are not compacted during the process of copying during garbage collection.
Fragmentation is worse if there are a lot of short-lived small allocations as the surviving longer lived allocations will reside in emptier blocks once the rest of the previously live data is collected.
Block Statistics
As a bit of a warm-up let’s first consider how to write a program which analyses how many blocks, pinned blocks and megablocks are allocated. This will be enough to see that we have a problem with fragmentation.
p1 :: Debuggee -> IO ()
= do
p1 e
pause e<- run e precacheBlocks
bs
summariseBlocks bs resume e
p1
is an example script which pauses a process, queries it for the list of blocks and then prints a simple summary of the blocks. The precacheBlocks
function fetches the list of blocks from the paused process and additionally populates a cache which makes subsequently decoding closures much faster.
-- | Print a summary of the given raw blocks
-- This is useful to see how many MBlocks and how many pinned blocks there
-- are.
summariseBlocks :: [RawBlock] -> IO ()
= do
summariseBlocks bs putStrLn ("TOTAL BLOCKS: " ++ show $ length bs)
putStrLn ("PINNED BLOCKS: " ++ show $ (length $ filter isPinnedBlock bs))
putStrLn ("MBLOCK: " ++ show n_mblocks)
where
n_mblocks :: Int
= length (nub (map (blockMBlock . rawBlockAddr) bs)) n_mblocks
The summary first prints the total number of blocks. Secondly, the total number of pinned blocks and thirdly the total number of megablocks.
One iteration of the benchmark we are using creates 2000 connections to the server and then releases all the connections. Looking at the heap profile for this case, the live memory usage as reported by GHC is the same before and after the benchmark runs. Analysis by David has showed that the OS reports a large difference in memory usage before and afterwards which he attributed to fragmentation.
Running p1
on our benchmark, it confirms our suspicions that the heap is badly fragmented after running one iteration. Before the benchmark runs, the summary reports:
TOTAL BLOCKS: 10944
PINNED BLOCKS: 90
MBLOCK: 60
After the benchmark, the number of megablocks and blocks have both increased, as well as the number of pinned blocks.
TOTAL BLOCKS: 11172
PINNED BLOCKS: 299
MBLOCK: 163
Now there are 100 more live megablocks and 200 more live pinned blocks. Each megablock is 1mb big so this accounts for the 100mb of discrepency between the memory usage reported by the OS and memory usage reported by the heap profiler.
Now the fragmentation can be seen with our own eyes! Time to delve a little deeper.
Block Histogram
In order to characterise fragmented memory we will write a custom heap analysis using ghc-debug. The heap analysis works by traversing the heap starting from the root objects and recording the block which each pinned object is allocated into. Once the traversal has finished, the total size of the live objects in a block is divided by the maximum size for the block. In an ideal world, you would want each block to be as full as possible. In a fragmented heap, most pinned blocks are less than 10% full. The code for this custom analysis is straightforward to write using some library functions.
In the GHC.Debug.Profile
module there is a helper function closureCensusBy
which simplifies implementing a parallel heap census.
closureCensusBy :: forall k v . (Semigroup v, Ord k)
=> (ClosurePtr -> SizedClosure -> DebugM (Maybe (k, v)))
-> [ClosurePtr]
-> DebugM (Map.Map k v)
To closureCensusBy
you provide a continuation which will be used to classify each closure as it is visited during the traversal. If the classification function returns Just
a key-value pair then the result is added into the census summary. The result is returned in a Map
which is segmented by the reported keys.
For the block census, the classification needs to achieve two things, firstly check if the object is allocated into a pinned block and then add the size to the appropiate bucket if it is pinned.
The result of this analysis is going to be a map from the block pointer to PinnedCensusStats
, which is normal statistics about a census (total size, count of objects and so on) along with actual pinned objects we found in each pinned block. This will be useful later. In order to use it with the generic closureCensusBy
it needs to be an instance of Semigroup
.
data CensusStats = CS { n :: !Count, cssize :: !Size, csmax :: !(Max Size) }
newtype PinnedCensusStats = PinnedCensusStats (CensusStats, [(ClosurePtr, SizedClosure)]) deriving (Semigroup)
Then onto the analysis function, which provides the right continuation to closureCensusBy
to perform the previously described census. The function takes a list of all blocks in the program which is filtered to just leave the pinned blocks. If a visited closure is in one of these pinned blocks then the value is added into the census.
censusPinnedBlocks :: [RawBlock]
-> [ClosurePtr]
-> DebugM (Map.Map BlockPtr PinnedCensusStats)
= closureCensusBy go
censusPinnedBlocks bs where
pbs :: Set.Set BlockPtr
= Set.fromList (map rawBlockAddr (filter isPinnedBlock bs))
pbs
go :: ClosurePtr -> SizedClosure
-> DebugM (Maybe (BlockPtr, PinnedCensusStats))
=
go cp d let cs :: CensusStats
= CensusStats (Count 1) (dcSize d) (Max (dcSize d))
cs
bp :: BlockPtr
= applyBlockMask cp
bp
in return $ if bp `Set.member` pbs
then Just (bp, PinnedCensusStats (cs, [(cp, d)]))
else Nothing
It only took a short amount of code in order to write this custom analysis to look into fragmentation. This function is now assembled into our second analysis script:
p2 :: Debuggee -> IO ()
= do
p2 e
pause e<- run e $ do
census <- precacheBlocks
bs <- gcRoots
roots
censusPinnedBlocks bs roots
resume e printBlockHistogram census
Once the fragmentation per pinned block is calculated, the information is displayed in a histogram which shows the percentage each block is utilised. The printBlockHistogram
function is provided by ghc-debug to print the census but omitted for brevity.
Before the benchmark runs, most blocks are quite full. Fragmentation hasn’t occurred yet. Some blocks are bigger than 100% because large objects are also pinned and may be larger than a block.
0.0%-10.0%: 7
10.0%-20.0%: 1
20.0%-30.0%: 3
30.0%-40.0%: 3
40.0%-50.0%: 3
50.0%-60.0%: 2
60.0%-70.0%: 7
70.0%-80.0%: 15
80.0%-90.0%: 16
90.0%-100.0%: 11
200.0%-210.0%: 2
510.0%-520.0%: 1
5290.0%-5300.0%: 2
After the benchmark has finished running, there are many more blocks which are less than 10% full.
0.0%-10.0%: 174
10.0%-20.0%: 5
20.0%-30.0%: 2
30.0%-40.0%: 5
40.0%-50.0%: 6
50.0%-60.0%: 18
60.0%-70.0%: 9
70.0%-80.0%: 12
80.0%-90.0%: 16
90.0%-100.0%: 11
150.0%-160.0%: 2
200.0%-210.0%: 2
510.0%-520.0%: 2
5290.0%-5300.0%: 2
This information gives us a bigger handle on what’s going on with our memory. Now it’s time to work out what live data is in the badly fragmented blocks and whether there is anything we can do to stop the blocks being retained.
Analysing Fragmentation
Now we have a better handle on the level of fragmentation and can observe exactly how badly fragmented the heap is, we are in a position to improve the situation.
One way to improve the fragmentation is to stop retaining as many pinned objects. We’ll concentrate on the badly fragmented blocks, as those will yield the biggest gains. Therefore we want to record which live objects live in the worst fragmented blocks and then look at why they are being retained. With this precise knowledge and some domain knowledge we should be able to work out if we can do anything about the fragmentation.
Remembering the previous section, the censusPinnedBlocks
function already returned the address of objects which were resident in each pinned block. Now it’s time to use that information.
- Find all the objects allocated in the blocks which are less than 10 % used.
- For a sample of 5 objects, find one path from a GC root which retains each of them. This gives us a good idea about where the objects came from.
The information about the objects which live in the fragmented blocks is already present in the result returned by censusPinnedBlocks
. The pointers in the fragmented blocks are identified and 5 are sampled for which we will compute a retainer path.
It was straightforward to implement another traversal mode in ghc-debug
which computed the retainer paths for a specific object. Looking at these paths is enlightening as we can finally see exactly where the problem is. The findRetainersOf
function takes an optional search limit, a list of targets to find paths to and a list of roots to start looking from. The result is a list of paths from a root to a target.
findRetainersOf :: Maybe Int -- ^ Maximum number of paths to return
-> [ClosurePtr] -- ^ List of target closures
-> [ClosurePtr] -- ^ List of roots to start from
-> DebugM [[ClosurePtr]]
Given the raw paths, we can perform some further requests in order to dereference all the closures along the path and attach source location to as many closures as possible. The result is then printed so we can see what retains the fragmented objects.
Here is the first retainer stack which is reported by findRetainersOf
when supplied with the list of objects living in fragmented pinned blocks and starting from the GC roots.
nl:toArray (8 bytes) "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL"
GHC.Event.IntTable:MallocPtr 0x4204d088c8 0x42069c6528
GHC.Event.IntTable:IT 0x42069c3000 0x42069c30a0 283548616920
nl:_mutVar 0x4206bf57c8
GHC.Event.Manager:STRef 0x4206bf4e88
nl:MVar 0x4206bf47b8
GHC.Event.Manager:MVar 0x42070f57e8
nl:[ 0x42070f22b0, 0x42070f22c0, 0x42070f22d0, 0x42070f22e0, 0x42070f22f0, 0x42070f2300, 0x42070f2310, 0x42070f2320, 0x42070f2330, 0x42070f2340, 0x42070f2350, 0x42070f2360, 0x42070f2370, 0x42070f2380, 0x42070f2390, 0x42070f23a0, 0x42070f23b0, 0x42070f23c0, 0x42070f23d0, 0x42070f23e0, (and more) ]
GHC.Event.Thread:_fun{0x42070f1518,0x42070f1548,0,31,77309411345,180388626452,0}
nl:Stack( 1021 )
nl:TSO
Each line is a pretty printed closure with some location information. The line starts with the module the allocation originated from (this uses the same machinery as the -hi
profiling mode) and is followed by the rendered closure. The closure on each line retains the closure above it. This stack indicates that there is a byte array consisting of 8 null bytes which is retained by something in GHC.Event.IntTable
. Further reading the callstack suggests this is related to the GHC.Event.Manager
module which is from the base libraries.
Inspecting the GHC.Event.IntTable
module it can be seen that the pinned memory is used to keep track of the size of the table. You can see it’s the tabSize
field because MallocPtr
is a constructor for ForeignPtr
.
newtype IntTable a = IntTable (IORef (IT a))
data IT a = IT { tabArr :: {-# UNPACK #-} !(Arr (Bucket a))
tabSize :: {-# UNPACK #-} !(ForeignPtr Int)
, }
It was possible to keep track of the size without using pinned memory which led to #19171 and !4740. Thanks to [Ben] for offering a quick fix for this issue. Once the issue has been identified, and the usage of pinned memory eliminated then the fragmentation characteristics of the program improve.
For each pinned object it’s possible to see where the allocation arises from in a similar fashion. At this stage it’s easiest to perform an inspection by hand as the number of cases is not so large.