Exploring Cloud Builds in Hadrian
David Eichmann - 2019-07-31
GHC’s new build system, Hadrian, now supports a cloud build option: --share
.
With that enabled, build artifacts will be cached and shared between builds.
This can improve build times, but getting this feature to function correctly is
still a work in progress. This blog post explores the requirements for
correctness, some examples in Hadrian, the current state of the project, and how
developers can use cloud builds to improve iteration time.
Background
For much of its life GHC has had a build system based around Gnu Make. Though it
has served its purpose well, the make
based build system is being replaced by
a new and improved project called Hadrian. Hadrian is a build system written in
Haskell and based on the Shake library. With Hadrian
comes cleaner code and better maintainability, but also some interesting
features of Shake such as cloud builds. Cloud builds are enabled with the
--share
option. A cloud build keeps a cache of build
artifacts that can be shared between builds eliminating a lot of duplicate work that had
already been done on a previous build. This feature seems particularly useful in
the context of continuous integration (CI). Merge Requests made via
GitLab must all pass a comprehensive set of tests
on the CI server (the CI pipeline). This involves building GHC multiple times
and running the test suite. Due to limited resources, long build times, and the
large number of tested build configurations, the CI pipeline typically takes 6 to
8 hours to complete. Keep in mind that each merge request is typically only
making a small change to the compiler. By utilizing Shake’s cloud build system
to cache build artifacts between CI runs, one would expect a large speedup.
After an initial effort,
Andrey Mokhov achieved a successful cloud build and rebuild of GHC with Hadrian.
After Andrey’s changes were merged I did a cloud build (using the--share
option) of GHC to fill up the cloud build cache, deleted the build directory,
then timed a rebuild. This resulted in a 10x speedup of the rebuild compared
to a clean non-cloud build, a very positive result. On my machine this took a
25 minute build
down to only 2.5 minutes! However, this is an overly optimistic test: no source files
have changed, so we expect full cache utilization. In practice we expect source
files to have changed and so should expect less cache utilization. Regardless,
the prospect of using cloud builds to improve CI times looked very promising.
Unfortunately, the correctness of cloud builds came into question, and the last
thing we want in CI is an incorrect build. We needed a better understanding of
the requirements for correct cloud builds.
Build rules
To understand the problems at hand, one must understand the basics of how a
build system like Hadrian is implemented using Shake. Put simply, you must
create a set of build rules that map file name patterns to Action
s that
perform some IO to generate the corresponding file or files i.e. the “targets” of the rule. In particular, Action
is a
monad and the need :: [FilePath] -> Action ()
function is used to dynamically
specify the rule’s inputs (i.e. dependencies) that must be built before the action
continues. Here is an example of a typical build rule that generates a file with
the extension .output
:
"*.output" %> \ outputFileName -> do
-- Input file is just the output file with ".input" as the extension.
let inputFileName = outputFileName -<.> "input"
-- Here we declare the dependency on the input file. Shake will then
-- ensure it exists and is up to date before this Action continues.
need [inputFileName]
-- Do some IO to generate the output file from the input file.
compile inputFileName outputFileName
-- In some cases we generate more files than just the target,
-- `outputFileName`, of the rule. We still need those files to be
-- cached for cloud builds, so they must be declared here using the
-- `produces :: [FileName] -> Action ()` function. In this case
-- `compile` generated some other extra file.
let extraFileName = outputFileName -<.> "extra"
produces [extraFileName]
Evaluating Cloud Build Correctness
Let’s consider what it means to have a correct build system. A correct build
system is one that always produces a correct set of build artifacts according to
some specification. In the
case of Hadrian, while not formally specified, this would include producing a
functioning ghc
binary that reflects the build inputs (e.g. source files and
build flavour). We can break correctness down into three specific definitions:
- Clean Build Correctness: Start with no existing build artifacts. Running a build will succeed and produce correct build artifacts.
- Incremental Build Correctness: Start with existing build artifacts (e.g. by doing a full or partial build). Possibly change some build inputs. Running a build will succeed and produce correct build artifacts.
- Cloud Build Correctness: Start with an existing and populated cloud cache and possibly existing build artifacts (e.g. by doing a cloud build then possibly deleting build artifacts). Possibly change some build inputs. Running a cloud build will succeed and produce correct build artifacts.
For each definition above, a requirement for correctness is that the build produces correct build artifacts. Another way of thinking about this is that correct clean, incremental, and cloud builds should always produce an equal set of build artifacts given the same build inputs. In practice, “equal” ignores some insignificant differences due to nondeterminism in GHC and other build tools.
A main source of incremental and cloud build correctness bugs relates to build
rules declaring an incorrect set of inputs via need
. In an incremental build,
the rule inputs indicate when the rule needs to be rerun to produce fresh build
artifacts. If an input is undeclared and that input changes, it could result in
the rule failing to rerun and hence stale build artifacts. In cloud builds, the
rule inputs are used as keys into the cache. If an input is undeclared and that
input changes, it could result in an (incorrect) cache hit and again stale build
artifacts.
In this blog post we only focus on declaring the correct set of inputs. Rules
are assumed to otherwise correctly call produces
and generate build artifacts
for that rule. Note, Andrey Mokhov’s
initial effort largely
focused on the use of produces
.
Let’s take a closer look at what we mean by “input”. The natural definition of inputs, which we call “direct inputs”, is the set of all files used (i.e. read) by the rule. We can also define an “indicating inputs set”. Consider comparing build artifacts of two clean builds where some build inputs where changed. Here “changed” means the files differ between the first and second build. An indicating inputs set for a rule is a set of files such that:
a rule’s output may change only if at least one of the rule’s indicating inputs has changed.
The
direct inputs is always a valid indicating inputs set. There can be other valid
indicating inputs sets that need not be a subset or equal to the direct inputs.
An indicating inputs set may include files outside the set of direct inputs (the
libffi
rules are a
good example). Generally we try to pick a minimal or close to minimal set of
indicating inputs in order to keep the rule implementation simple.
On that note, the invariants to achieve the three levels of correctness defined above are as follows. Understanding these invariants is fundamental to evaluating incremental and cloud build correctness.
- Clean Build: All rules
need
enough to build all direct inputs (this can be via direct or transitively triggered rules). - Incremental Build: Clean build invariants + All rules
need
a valid indicating inputs set. - Cloud Build: Clean and incremental invariants + All rules
produces
all outputs (excluding the rule targets(s)).
Dependency Lint Errors
Now that we understand the requirements for correctness, how can we find
correctness issues in Hadrian? Shake has a great linting feature (enabled with
--lint-fsatrace
) that utilizes the Filesystem Access Tracer tool
(fsatrace) to observe file accesses
within a rule. This feature results in a linting error when any file is read
without first being declared with need
. In other words, we get linting errors
when there is an omitted need
for a direct input. Zero linting errors would
imply that all direct inputs are need
ed and since the direct inputs are a
valid indicating input set, this would imply incremental build correctness.
Great! So how many lint errors do we have? At the start of my endeavor, about
400,000 lint errors! That’s 400,000 files used by various rules that forgot to
need
them first.
Are all lint errors indicative of a correctness issue? Not necessarily. Remember, lint errors identify omitted direct inputs, but we are concerned with indicating inputs. As such, lint errors must be investigated individually. The general workflow is:
- Pick a lint error. The error which contains the rule and the omitted input file paths.
- Choose an indicating input set for the rule.
- Ensure all indicating inputs are
need
ed in that rule. - The rule is now correct, so
trackAllow
any direct inputs not in the indicating inputs set (this disables lint errors for those files).
Challenges
So far there is nothing too daunting about this task. Even the large number of lint errors is not too bad when you consider that individual fixes can resolve large numbers of lint errors if the rule is run many times. It sounds like we have our work cut out for us, but let’s look at two cases that prove tricky to resolve.
GHC Dependency Generation and CPP Inputs
Imagine we have a Haskell module that uses CPP:
-- A.hs
{-# LANGUAGE CPP #-}
#include "b.h"
-- B.h
#include "c.h"
-- C.h
= putStrLn "Hello CPP" main
Now when we build A.hs
we must make sure to need
the CPP includes i.e.
the rule for A.hs
has direct inputs A.hs
, B.h
, and C.h
. We ideally want to discover
the .h
inputs dynamically so that if we, e.g., add new #include
directives,
we don’t need to update our build system rules. In Hadrian we do this by generating a
.dependencies
file generated using GHC’s dependency generation options
-M -include-cpp-deps
. This is done in a
rule for the .dependencies
file. Now the rule to build A.hs
first need
s
the .dependencies
file, then dynamically needs the dependencies (i.e. inputs) listed within.
In this case A.hs
, B.h
, and C.h
. The rules look something like this:
".dependencies" %> _ -> do
"A.hs"]
need ["A.hs" -- Runs `ghc -M -include-cpp-deps A.hs ...`
ghcBuidDeps
"A.o", "A.hi"] &%> _ -> do
[".dependencies"]
need [<- getInputsFromFile "A.o" ".dependencies"
inputs
need inputs"A.hs" ghcBuild
So what is the problem here? The rule for ["A.o", "A.hi"]
is fine as it tracks
the correct inputs. What about the ".dependencies"
rule? Running
ghc -M -include-cpp-deps A.hs ...
reads A.hs
then traverses all the
#include
ed files, hence the direct inputs of .dependencies
are A.hs
, B.h
,
and C.h
. This can’t be reduced to a smaller indicating inputs set; any of
those files can change the output of .dependencies
. Well, this is awkward! The
inputs of .dependencies
are exactly the inputs that we are trying
to discover in the first place. If we now include D.h
from C.h
and
recompile, A.hs
will recompile (it need
s C.h
which has changed), but
.dependencies
will not be recalculated so ["A.o", "A.hi"]
will not track
the new D.h
input. Now change D.h
and rebuild. A.hs
will not be rebuilt (i.e. the ["A.o", "A.hi"]
rulewill not rerun)
even though the D.h
input has changed.
A possible solution to this is to use Shake’s needed
action in the .dependencies
rule to
declare the inputs after running GHC’s dependency generation. This has a
couple of issues:
- GHC’s dependency generation also outputs the source file and other Haskell interface
files that are imported from this module. Unlike the CPP includes, these are
not indicating inputs of the
.dependencies
file. The distinction could be made by observing file extensions,.hi
/.hs
vs others, though this is not foolproof as CPP includes have no restriction on file extension. - This works fine if the CPP dependencies are source files in the source tree,
but doesn’t work if one of the dependencies needs to be generated by the build
system: we would need to
need
them a priori to ensure they exist and so can be read to discover any further dependencies.
Currently this is an unresolved issue in Hadrian. It is hard to say how likely this issue is to surface as a bug in incremental/cloud builds.
Compiling Haskell Modules
Consider a rule for a file X.o
that compiles X.hs
with ghc
. We use
ghc -M
which returns the following Makefile fragment:
X.o : X.hs X.hi Y.hi
The Y.hi
is there because module X
imports module Y
. We conclude that
direct inputs = indicating inputs = { X.hs, X.hi, Y.hi }
So we implement the
rule like this (normally this is done dynamically using .dependencies
as
described above, but we will list the dependencies statically here to illustrate
the problem better):
"X.o" %> \ _ -> do
"X.hs", "X.hi", "Y.hi"]
need ["X.o" buildWithGhc
This seems correct, but running the rule with --lint-fsatrace
complains that
the rule used a file Z.hi
without need
ing it. Upon further inspection,
module Y
imports module Z
. It turns out ghc
often uses a subset of
transitively imported modules. This means the set of direct inputs is larger
than what we originally though, it contains transitive interface files.
Unfortunately there doesn’t seem to be any feasible way to predict which
transitive interface files ghc
will use at compile time. Fortunately we only
need to need
indicating inputs, so we can stay optimistic and try to show that
transitive interface files (.hi files) are not indicating inputs. This can be
done by showing that a change in any transitive interface files results in a change in
the non-transitive interface files (this is explained more in more detail on the
wiki:
change({transitive interface files})
→ change({indicating inputs} \ {transitive interface files})
= change({ Z.hi })
→ change({ X.hs, X.hi, Y.hi })
Where →
means “logically implies”. With some insight from an experienced GHC developer we see that all interface files
contain “a list of the fingerprints of everything it used when it last compiled
the file” (from the user
guide).
Making the practical assumption that the fingerprint’s hashing function is
injective, we know that change({ transitive interface files })
which implies
change({ X.hi })
and trivially change({ X.hs, X.hi, Y.hi })
and hence it is safe to not need
transitive interface files. Hence, we can need
the direct interface files reported by ghc -M
and trackAllow
all other interface
files.
In this particular case, the lint errors were a non-issue: there were no omitted indicating inputs. This is good, but establishing correctness required in-depth knowledge of the contents and invariants of interface files as well as a solid understanding of how to manipulate the indicating inputs set.
Limitations
Caching Staged Builds
Having seen two examples, we can see that fixing these lint errors isn’t entirely trivial. So the cost of enabling caching has increased, but the benefits still sound great: a large reduction in CI build times. Unfortunately this expectation is compromised by the fact that GHC is usually built in two stages. The majority of the second stage is built using the GHC binary produced in the first stage. Stage 1 GHC is less featureful than stage 2 and so takes only one-quarter to one-third of the total build time. The key observation is that stage 1 GHC is a dependency for most of stage 2. This means that any change to GHC will result in cache misses for the majority of stage 2, severely degrading cache utilization. This should not be understated.
A possible solution is to bypass the 2 stage build process with the help of
Hadrian’s --freeze1
option. This freezes the stage 1 GHC. By keeping stage 1
frozen across multiple CI runs, stage 2 can now better utilize the cache. While
this sounds like a good idea at first, we strongly want to avoid incorrect
builds on CI. The correctness freezing stage 1 is questionable:
- Bugs may appear or be hidden by the frozen stage 1 compiler.
- Stage 2 GHC is linked against the
base
package built by stage 1 GHC. This could cause bugs or build failures if the base package or the interface file format differs between stage 1 and stage 2 GHC (i.e. when using--freeze1
).
On a positive note, stage 1 still takes a significant portion of time to build, and does not suffer from this cache utilization issue. A reduction in stage 1 compile times is already something to be celebrated.
Changes in Hadrian Build Rules
An important note is that Hadrian is currently seeing regular updates and
changes to its build rules. Hadrian is under version control with the rest of
GHC, so when changing between commits you’re also potentially changing the
version of Hadrian. Cloud/incremental builds may be incorrect or fail when
Hadrian rules change between builds. While there are ways to protect against
this i.e. by making the Hadrian version a dependency of all rules, Hadrian
currently doesn’t do this. As a result it is not safe to run cloud/incremental
builds across changes to Hadrian. As an example, doing a cloud build of commit
A
then commit B
gives a failed build complaining of a missing
GHC.Version
Haskell module. The build rules surrounding that module were
changed between commit A
and B
. Hence the cache directory
should be cleared when Hadrian has changed compared to previous cloud builds.
Alternatively you can maintain separate cache directories based on the version
of hadrian e.g. the latest commit that changed the hadrian directory:
git rev-list -n1 HEAD -- ./hadrian
.
Relocatable Build Artifacts
We want to take advantage of the cloud cache regardless of the absolute filesystem path of the build directory and GHC source directory. Unfortunately some build artifacts depend on these paths. In other words, the build artifacts are not relocatable. This makes cloud builds impractical when the build and source paths are not held constant: it would result in poor cache utilization. Issue #16956 documents this problem further.
What This Means for GHC Developers
Compared with the make build system, potential issues are much more visible due to the fsatrace linting feature. Considering Hadrian is a port of the make system, now with may lint errors fixed, it’s reasonable to expect similar if not more confidence in Hadrian’s incremental builds as with make’s incremental builds. More confidence in incremental builds is a welcome improvement and the most impactful benefit of this work.
We can expect a similar level of confidence in cloud builds assuming that
hadrian rules are unchanged between builds. If you’re planning on running the
same builds multiple times then using the cloud build feature can benefit you.
Simply use the --share
command line option when running Hadrian (or
--share=DIR
to specify a cache location). Consider the use case of swapping
between build flavours. A simple solution would be to maintain separate build
directories per flavour, but you may not want to manage these directories
manually or perhaps you expect the flavours to share a significant amount of
build artifacts from the cloud build cache. Here I’ve run through all steps in
order with cloud build enabled then all steps again with cloud build disabled
(i.e. just as incremental builds). Before step 1, I always start with a clean
build and cache directory. You can see that cloud builds are much faster on the
second build:
Action (flavour) | Cloud Build Time | Non-Cloud Build Time | |
---|---|---|---|
1 | build a39a3cd663 (default) | 24m37.775s | 23m49.780s |
2 | build a39a3cd663 (quick) | 11m55.601s | 11m16.476s |
3 | build a39a3cd663 (default) | 1m23.450s | 18m58.530s |
4 | build a39a3cd663 (quick) | 1m 2.805s | 11m23.408s |
Conclusion
The project of supporting correct cloud builds resulted in numerous
accomplishments. The number of fsatrace lint error has been reduced from 400,000
to 90,000 thanks numerous bug fixes in Hadrian. We now have
greater confidence in incremental and cloud builds. GHC developers can benefit from
cloud builds locally in some narrow use cases, and they frequently benefit from
improved incremental builds. We’ve developed a clear vocabulary for reasoning
about the degrees of build system correctness and what these require. Use of the
fsatrace linting feature will be invaluable to understanding and fixing newly
surfaced Hadrian bugs in the future. GHC’s -include-cpp-deps
option was
implemented as part of this project, enabling more precise dependency tracking
in Hadrian and users’ build systems. This project also had a part to play in
recent ideas to simplify Haskell interface files (see issue
#16885).
Still, there is plenty of room to further improve incremental and cloud builds. This blog post is in part a call for help from the community. The remaining work is challenging and interesting, requires little compiler expertise, and yet can have a significant impact on GHC developers. #16926 is a top level “getting started” issue for continuing this project. Most of the remaining work is focused on resolving fsatrace lint errors summarized in issue #16400. That issue also explains how to get started using fsatrace linting. Specific issues of interest include:
- #16868 Unexpected CPP dependencies in later (non CPP) build stages of GHC.
- #16815 Undeclared
hs-boot
inputs when generating.dependencies
files. - #16983 Undeclared inputs
for
_build/generated/platformConstants/platformConstants
. - #16956 Some build artifacts are are sensitive to build directory location. This destroys cloud build cache utilization when using multiple build directories.
It’s highly recommended that you read through and contribute to the Developing Hadrian wiki page which goes into more detail about incremental and cloud build correctness.