Status update: GHC on Apple M1 hardware
bgamari - 2021-03-09
A few months ago, Apple released the latest round of Apple hardware built upon their M1 ARM implementation. Today, existing x86-64 GHC releases can be used on M1 hardware using Rosetta emulation. In this post I will describe recent work in GHC to enable native use of the compiler on M1 hardware, as well as some welcome improvements to GHC’s general ARM support coming in GHC 9.2.
To run natively on Apple M1 hardware running macOS, GHC needs three things:
- the ability to generate code in the ARM instruction set
- support for Darwin’s ARM application binary interface (ABI)
- various build system changes to accomodate the new platform
GHC has had (1) for many years now through the LLVM code generation backend. By far the largest chunk of work necessary for minimal M1 support has been (2) and (3).
Below I will discuss efforts that have gone into each of these aspects. However, readers with less interest in compiler development may skip to the final section for a brief overview of the state of play.
The memory ordering problem
While GHC has long had ARM support, it has until recently been something of a second-class citizen and exhibited some degree of instability. One major cause of this was runtime’s ad-hoc treatment of memory consistency. GHC’s runtime was written well-before memory consistency models were widely understood, and nearly two decades before the standardization of C11 atomics.
Consequently, until recently GHC’s runtime relied on a mix of volatile
variables and explicit memory barriers to enforce memory consistency. However,
ensuring correctness in a large concurrent system like GHC’s runtime is
extraordinarily difficult. As a result, numerous bugs lurked. To make matter
worse, bugs generally did not affect platforms with strong memory models (e.g.
x86) and only manifested on platforms like ARM where they are considerably
harder to debug.
To address this, I merged a large refactoring of GHC’s lock-free memory paths in GHC 9.0. This patch moved the runtime to use standard acquire-release orderings in place of our previous explicit barriers. In addition to being easier to audit, this change enabled the use of ThreadSanitizer to idenitfy data-races within the runtime system. Using ThreadSanitizer, I was able to identify and fix over a dozen distinct data races within the runtime. As a result, we can now have considerably greater confidence in the correctness of GHC on today’s increasingly wide, out-of-order ARM implementations.
The ABI problem
Point (2) above has been particularly thorny due to the interaction
between Darwin’s ABI and GHC’s historically-simplistic way of representing
sub-word size integer types (e.g. Word8
). The problem arises from the
definition of sub-word size types in base
. For instance, Word8
is defined
as
data Word8 = W8# Word#
This means that, from the perspective of the compiler, Word8
is simply a
word-sized (e.g. 64-bit) value; the fact that only the bottom 8-bits contains
useful information has no representation in the type system. Until now this
scheme has served us fine. However, this changes with Darwin/ARM,
which defines a calling convention which is sensitive to argument width.
This means that foreign calls like,
import ccall "f" f :: Word8 -> IO () foreign
are treated incorrectly under this scheme on ARM/Darwin.
The long-term plan
Fixing the above issue correctly requires teaching GHC about the representation
of sub-word size types. Thankfully, the first step in this direction was
already taken several releases ago with the introduction of unlifted
sub-word-size integer types (as described in
GHC Proposal 74).
However, there is still a fair amount of work that remains. First, we must
rework the lifted types defined by base
to take advantage of these
sub-word-size types:
-- today we have:
data Word8 = W8# Word#
-- in the glorious future we will have:
data Word8 = W8# Word8#
However, this change in isolation ends breaking quite a bit of code since many
of GHC’s primops primops are still defined in terms of Word#
. For instance,
in GHC 9.0 the primop for reading a Word8
from a ByteArray
has the type:
# :: ByteArray# -> Int# -> Word# indexWord8Array
Consequently, a common idiom like W8# (indexWord8Array# byteArray n)
will break if we were to merely make above change to Word8
. It turns out that we
can mitigate most of this by changing the primop definitions as well. That is,
-- today:
# :: ByteArray# -> Int# -> Word#
indexWord8Array
-- the glorious future:
# :: ByteArray# -> Int# -> Word8# indexWord8Array
Note only does this type better reflect the operation’s meaning, but it also
eliminates a significant fraction of the churn required by the lifted-type
change. The return type can help here because most code doesn’t actually care
what the type being returned is, so long as it can be boxed right away by the
corresponding constructor (W8#
in this case). We just need the constructor
and primops to agree to preserve that level of compatibility.
Improving consistency of word-sized integer types
Darwin/ARM’s size-sensitive ABI also provides considerable incentive to fix a
long-standing wart in our treatment of Word32#
and Word64
: Word64#
(resp.
Word32#
) are only available on 32-bit (resp. 64-bit) platforms (#11953).
This requires the user to awkwardly rely on CPP to select between Word64#
(resp. Word32#
) and Word#
for 64-bit (resp. 32-bit) wide integers,
depending upon the host platform.
Making all sized integer types always available on all architectures allows
considerable simplification of many bits of base
and other core libraries.
However, it also also reveals an awkward property of our existing primop naming.
Specifically to convert between the fixed-width integer types and Word#
GHC 8.10
provided the following family of primops:
# :: GHC.Prim.Word# -> GHC.Prim.Word8#
GHC.Exts.narrowWord8# :: GHC.Prim.Word8# -> GHC.Prim.Word#
GHC.Exts.extendWord8
# :: GHC.Prim.Word# -> GHC.Prim.Word16#
GHC.Exts.narrowWord16# :: GHC.Prim.Word16# -> GHC.Prim.Word#
GHC.Exts.extendWord16
# :: GHC.Prim.Int# -> GHC.Prim.Int8#
GHC.Exts.narrowInt8# :: GHC.Prim.Int8# -> GHC.Prim.Int#
GHC.Exts.extendInt8
# :: GHC.Prim.Int# -> GHC.Prim.Int16#
GHC.Exts.narrowInt16# :: GHC.Prim.Int16# -> GHC.Prim.Int# GHC.Exts.extendInt16
However, if we were to add such operations for, e.g., Word64#
we would end up with:
# :: GHC.Prim.Word# -> GHC.Prim.Word64#
GHC.Exts.narrowWord64# :: GHC.Prim.Word64# -> GHC.Prim.Word# GHC.Exts.extendWord64
Consider the case of narrowWord64#
on a 32-bit platform (where Word#
is 32-bits wide):
the operation’s semantics make its argument wider, yet its name suggests it
makes the value narrower (N.B. on a 64-bit platform this operation is a no-op).
We felt that this is too confusing to be allowed to stand. Consequently the
narrow
and extend
primops will be deprecated in 9.2 and eventually removed in GHC
9.8. They will be replaced by a set of more clearly named operations:
# :: GHC.Prim.Word# -> GHC.Prim.Word64#
GHC.Exts.wordToWord64# :: GHC.Prim.Word64# -> GHC.Prim.Word#
GHC.Exts.word64ToWord
# :: GHC.Prim.Word# -> GHC.Prim.Word32#
GHC.Exts.wordToWord32# :: GHC.Prim.Word32# -> GHC.Prim.Word#
GHC.Exts.word32ToWord
-- et cetera
The above changes are documented in the Wiki and will ship in GHC 9.2.1. While we would generally prefer to make these changes through the usual GHC proposals process, we deemed that due to the urgency of M1 support and relatively confined impact of the changes, users would be better served by quickly moving ahead (since otherwise proper M1 support likely would not have happened until 9.4). On the whole, this new story will be considerably more pleasant for all and fixes a number of related issues (#17375, #17377).
The short-term plan
Of course, the changes described above would be inappropriate to include in a minor GHC release. However, we feel that it is important that M1 support is shipped well before the release of GHC 9.2.1. For this reason, Moritz Angerman, has been working tirelessly over the past several months to backport the M1 changes to the GHC 8.10 branch, using a less principled approach for working around the calling convention issue.
This work will be released shortly in the form of a GHC 8.10.5 release. We also expect that this work will also make it into GHC 9.0.2 or 9.0.3.
A cherry on top: A new NCG for ARM
While LLVM has served us well on ARM, it is not known for its speed. For instance, on x86-64 a GHC bootstrap build using the LLVM takes roughly twice as long as a similar build using the native code generator. As Apple’s new hardware will mean that ARM will gain considerable adoption by developers, we thought such a sizeable compile-time tax was unacceptable.
To address this, Moritz Angerman picked up the long-standing (and quite large) task of implementing a native ARM backend for GHC (as enjoyed by x86-64 and PowerPC). Early indications suggest that this backend will considerably reduce compilation time on ARM platforms, bringing times in line with what we see on x86-64. This work will ship in GHC 9.2.1.
Summary
To summarize, over two years of work to improve the state of GHC on ARM will be culminating in the coming months. Specifically:
GHC 8.10.4 solidified ARM support by revamping the runtime’s treatment of memory consistency.
GHC 8.10.5 will be out in the coming weeks with initial Darwin/ARM support
GHC 9.0.2 will be out a few weeks later, also with Darwin/ARM support
GHC 9.2.1 will be released in June 2021 sporting Darwin/ARM support, revamped sized-integer types in
base
, and considerably faster compilation thanks to Moritz’s ARM NCG backend.
On the whole, this has been a very long road but we are quickly approach its end. This is almost entirely due to a few people who deserve recognition here: First, Moritz Angerman not just for his incredible work on the NCG and Darwin support, but also for his help wrangling CI and tireless work on the thankless task of backporting. Second, thanks to John Ericson for his work in pushing through the various primop cleanups (and knock-on changes in downstream libraries) necessitated by the M1 ABI issue.
Lastly, thanks to Davean Scies and Simspace for their help and support in hosting a set of M1 CI runners.