Improvements in GHC's testsuite infrastructure
Ben Gamari - 2019-07-08
GHC’s testsuite is our first line of defense against correctness regressions. However, as is often the case, the infrastructure that keeps it running has been long neglected. Our recent efforts in enforcing a CI-cleanliness in all GHC builds has resulted in a few bits of work that I thought would be nice to share.
Improving testsuite driver maintainability
GHC’s testsuite is a collection of Makefiles and Haskell programs all glued together with a clump of Python known as the testsuite driver. The testsuite driver is a clever albeit quirky piece of software which, despite its implementation language, displays the marks of a codebase written by a functional programmer. It defines a small embedded DSL for describing tests , the user-facing interface of which is generally declarative. For instance, a simple test definition might look like:
'T13618', normal, compile_and_run, ['-v0']) test(
To provide this succinct language, the implementation relies on a number of clever tricks, often involving global variables, functions returning functions, and good helping of mutation.
Of course, all of these clever tricks are implemented in a dynamically-typed
language in a codebase that has evolved over the course of nearly 20 years (the
first
commit
to testlib.py
was by Simon Marlow in 2003). In its nearly 5kLoC
implementation one will find integers used as booleans, magic
strings, undeclared global variables, variables that are sometimes thread-local
yet elsewhere considered global, heaps of string concatenation, and numerous
other curiosities. Needless to say, comprehending, let alone
modifying, the testsuite driver has been getting harder and harder with every
accrued feature.
In recent years the Python community has gradually awoken to the
promise of statically-checked types. Python 3’s type annotation syntax in
conjunction with the mypy typechecker has given us a
path to bringing order to this cleverness. mypy
implements a pleasantly complete
type system in which most any Haskell 98 user would feel at home, supporting
for newtypes, sums, and parametric polymorphism.
Over the last few weeks the driver has grown hundreds of lines of type
annotations, all checked by mypy
. Not only has this made the codebase
significantly more readable but the process unearthed a few latent bugs as well.
Of course, to ensure that the testsuite driver doesn’t regress, it is now
typechecked
during the lint
stage of GHC’s CI pipeline.
In addition to typechecking, we took this opportunity to perform some
long-overdue refactoring to use modern Python interfaces (e.g. pathlib
instead of strings, bool
instead of int
, use of None
where appropriate)
and added numerous assertions (revealing yet more unnoticed bugs, some of which
were silently causing tests to be inappropriately skipped or run).
Fragile and broken tests
GHC’s testsuite has a sizeable configuration space, with over 30 “ways” in which tests may be run (e.g. normal, profiled, with the threaded RTS, etc.) and a few “speeds” which select a subset of tests to be run.
For most of its existence GHC’s CI infrastructure (in its various forms) has
tested only a small subset of the testsuite tests (namely the normal
speed
with around a dozen of the ways enabled). While we have periodically looked
at the full slow
testsuite output, rarely were we able to make significant
headway in fixing the many issues we found.
Recently our new CI infrastructure has placed a renewed emphasis on
improving testsuite coverage by regularly (e.g. at least on a nightly basis)
testing the entire testsuite (as well as nofib
, our performance testsuite).
This, of course, meant fixing the hundreds of failing tests in the full
testsuite run.
These failures generally broke down into a few classes:
- tests that are themselves buggy
- tests on which GHC has regressed (but we hadn’t noticed due to the test not being run in the default testsuite configuration)
- tests which are broken on in certain ways
- tests which fail non-deterministically in some or all ways
To handle cases (1-3) GHC has long had an expect_broken
test modifier to
mark a test as known to be broken due to a particular ticket, e.g.:
'T13366',
test('darwin'), expect_broken(16083))],
[when(opsys(
compile_and_run,'-lstdc++ -v0']) [
This modifier causes the test to be run, failing if it somehow successfully finishes (to ensure that we notice if the test is inadvertently fixed). Moreover, it encodes the fact that #16083 is the ticket where the breakage is documented.
Tracking fragile test outcomes
Sadly, the expect_broken
mechanism is not appropriate for fragile tests,
which may pass or fail nondeterministically.
For this case we have introduced a new fragile
modifier which runs the test but
merely takes note of whether it passed. We can then report this information in
two places:
- In the testsuite report printed at the end of the run, ensuring that it shows up in the testsuite log
- In the JUnit report produced by the testsuite run.
While fragile
is a step up from simply skipping broken tests, we clearly want
to ensure that the desigation is eventually removed from tests that have been
stabilized (intentionally or not).
For a few weeks we tried checking for such tests by hand: manually examining
recent testsuite logs and looking for patterns of fragile tests which routinely
pass. However, not only was this time-consuming but it was also quite
error-prone as some of our fragile tests pass over 90% of the time.
For this reason we have developed a tiny bit of automation to help with this process: a GitLab webhook ingests the JUnit output from each testsuite run into a relational database for later analysis. This turns the previously-arduous task of identifying no-longer-fragile tests into a simple SQL query:
SELECT *
FROM results_view AS x
WHERE message = 'fragile'
AND NOT EXISTS (SELECT *
FROM results_view
WHERE test_name != x.test_name
AND other->'reason' = 'fragile fail'
AND date > now() - interval '4 day';
)
Diagnosing fragile tests
One of the challenges in fixing fragile tests is characterising how they fail. Frequently, fragile tests have several failure modes. Knowing the differences and similarities between these modes can be remarkably helpful in localizing the root cause of the failure. Moreover, it’s not uncommon for failure modes to be shared by multiple fragile tests. Unfortunately, identifying and correlating these modes can be quite time-consuming, especially with infrequently-failing tests.
To aid in this we extended the testsuite driver’s JUnit output to include failing test output. This is then persisted in the test tracking database described above for later reference. This information has already proven to be incredibly useful.
An unexpected clean-up
One unexpected source of improvement in our testsuite performance recently
arose from our recent
addition of an Alpine
Linux CI target (#14502). Alpine, unlike the other Linux platforms that we test
on, uses the musl
C runtime implementation. Due to differences in musl
’s
file flushing semantics, testing on Alpine immediately shed light on a few
subtly fragile testcases which we had previously noticed but had not yet
investigated.
This was a helpful reminder that writing and testing software for portability is not only good for users but it can also improve correctness by shedding light on subtly-flawed assumptions.
Closing
In this post we discussed a few of measures we have taken to improve and maintain GHC’s correctness while lowering maintenance overhead. By tackling long-standing technical debt and putting in place tools to correlate test failures across testsuite runs, testcases, and time, we have gained significantly deeper insight into when, where, and how our tests are failing.
We look forward to sharing similar steps being taken on the compiler performance front in a future post.