text-2.0.2: An efficient packed Unicode text type.
Copyright (c) 2009 2010 2011 Bryan O'Sullivan(c) 2009 Duncan Coutts(c) 2008 2009 Tom Harper(c) 2021 Andrew Lelechenko BSD-style bos@serpentine.com portable Trustworthy Haskell2010

Data.Text.Encoding

Description

Functions for converting Text values to and from ByteString, using several standard encodings.

To gain access to a much larger family of encodings, use the text-icu package.

Synopsis

# Decoding ByteStrings to Text

All of the single-parameter functions for decoding bytestrings encoded in one of the Unicode Transformation Formats (UTF) operate in a strict mode: each will throw an exception if given invalid input.

Each function has a variant, whose name is suffixed with -With, that gives greater control over the handling of decoding errors. For instance, decodeUtf8 will throw an exception, but decodeUtf8With allows the programmer to determine what to do on a decoding error.

## Total Functions

These functions facilitate total decoding and should be preferred over their partial counterparts.

Decode a ByteString containing Latin-1 (aka ISO-8859-1) encoded text.

decodeLatin1 is semantically equivalent to Data.Text.pack . Data.ByteString.Char8.unpack

This is a total function. However, bear in mind that decoding Latin-1 (non-ASCII) characters to UTf-8 requires actual work and is not just buffer copying.

Decode a ByteString containing ASCII text.

This is a total function which returns a pair of the longest ASCII prefix as Text, and the remaining suffix as ByteString.

Important note: the pair is lazy. This lets you check for errors by testing whether the second component is empty, without forcing the first component (which does a copy). To drop references to the input bytestring, force the prefix (using seq or BangPatterns) and drop references to the suffix.

### Properties

• If (prefix, suffix) = decodeAsciiPrefix s, then encodeUtf8 prefix <> suffix = s.
• Either suffix is empty, or head suffix > 127.

Since: text-2.0.2

Decode a ByteString containing UTF-8 encoded text.

Any invalid input bytes will be replaced with the Unicode replacement character U+FFFD.

Decode a ByteString containing UTF-8 encoded text.

If the input contains any invalid UTF-8 data, the relevant exception will be returned, otherwise the decoded text.

Decode a ByteString containing 7-bit ASCII encoded text.

This is a total function which returns either the ByteString converted to a Text containing ASCII text, or Nothing.

Use decodeASCIIPrefix to retain the longest ASCII prefix for an invalid input instead of discarding it.

Since: text-2.0.2

### Controllable error handling

Decode a ByteString containing UTF-8 encoded text.

Surrogate code points in replacement character returned by OnDecodeError will be automatically remapped to the replacement char U+FFFD.

Decode text from little endian UTF-16 encoding.

Decode text from big endian UTF-16 encoding.

Decode text from little endian UTF-32 encoding.

Decode text from big endian UTF-32 encoding.

### Stream oriented decoding

The streamDecodeUtf8 and streamDecodeUtf8With functions accept a ByteString that represents a possibly incomplete input (e.g. a packet from a network stream) that may not end on a UTF-8 boundary.

1. The maximal prefix of Text that could be decoded from the given input.
2. The suffix of the ByteString that could not be decoded due to insufficient input.
3. A function that accepts another ByteString. That string will be assumed to directly follow the string that was passed as input to the original function, and it will in turn be decoded.

To help understand the use of these functions, consider the Unicode string "hi ☃". If encoded as UTF-8, this becomes "hi \xe2\x98\x83"; the final '☃' is encoded as 3 bytes.

Now suppose that we receive this encoded string as 3 packets that are split up on untidy boundaries: ["hi \xe2", "\x98", "\x83"]. We cannot decode the entire Unicode string until we have received all three packets, but we would like to make progress as we receive each one.

ghci> let s0@(Some _ _ f0) = streamDecodeUtf8 "hi \xe2"
ghci> s0
Some "hi " "\xe2" _


We use the continuation f0 to decode our second packet.

ghci> let s1@(Some _ _ f1) = f0 "\x98"
ghci> s1
Some "" "\xe2\x98"


We could not give f0 enough input to decode anything, so it returned an empty string. Once we feed our second continuation f1 the last byte of input, it will make progress.

ghci> let s2@(Some _ _ f2) = f1 "\x83"
ghci> s2
Some "\x2603" "" _


If given invalid input, an exception will be thrown by the function or continuation where it is encountered.

Decode, in a stream oriented way, a lazy ByteString containing UTF-8 encoded text.

Since: text-1.0.0.0

data Decoding Source #

A stream oriented decoding result.

Since: text-1.0.0.0

Constructors

 Some !Text !ByteString (ByteString -> Decoding)

#### Instances

Instances details
 Source # Instance detailsDefined in Data.Text.Encoding MethodsshowList :: [Decoding] -> ShowS Source #

### Incremental UTF-8 decoding

The functions decodeUtf8Chunk and decodeUtf8More provide more control for error-handling and streaming.

• Those functions return an UTF-8 prefix of the given ByteString up to the next error. For example this lets you insert or delete arbitrary text, or do some stateful operations before resuming, such as keeping track of error locations. In contrast, the older stream-oriented interface only lets you substitute a single fixed Char for each invalid byte in OnDecodeError.
• That prefix is encoded as a StrictBuilder, so you can accumulate chunks before doing the copying work to construct a Text, or you can output decoded fragments immediately as a lazy Text.

For even lower-level primitives, see validateUtf8Chunk and validateUtf8More.

Decode a chunk of UTF-8 text. To be continued with decodeUtf8More.

See decodeUtf8More for details on the result.

### Properties

decodeUtf8Chunk = decodeUtf8More startUtf8State


Given:

decodeUtf8Chunk chunk = (builder, rest, ms)


builder is a prefix and rest is a suffix of chunk.

encodeUtf8 (strictBuilderToText builder) <> rest = chunk


Since: text-2.0.2

Decode another chunk in an ongoing UTF-8 stream.

Returns a triple:

1. A StrictBuilder for the decoded chunk of text. You can accumulate chunks with (<>) or output them with toText.
2. The undecoded remainder of the given chunk, for diagnosing errors and resuming (presumably after skipping some bytes).
3. Just the new state, or Nothing if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).

### Properties

Given:

(pre, suf, ms) = decodeUtf8More s chunk

1. If the output pre is nonempty (alternatively, if length chunk > length suf)

s2b pre append suf = p2b s append chunk


where

s2b = encodeUtf8 . toText
p2b = partUtf8ToByteString

2. If the output pre is empty (alternatively, if length chunk = length suf)

suf = chunk
3. Decoding chunks separately is equivalent to decoding their concatenation.

Given:

(pre1, suf1, Just s1) = decodeUtf8More s chunk1
(pre2, suf2,     ms2) = decodeUtf8More s1 chunk2
(pre3, suf3,     ms3) = decodeUtf8More s (chunk1 B.append chunk2)


we have:

s2b (pre1 <> pre2) = s2b pre3
ms2 = ms3


data Utf8State Source #

State of decoding a ByteString in UTF-8. Enables incremental decoding (validateUtf8Chunk, validateUtf8More, decodeUtf8Chunk, decodeUtf8More).

Since: text-2.0.2

#### Instances

Instances details
 Source # Instance detailsDefined in Data.Text.Internal.Encoding MethodsshowList :: [Utf8State] -> ShowS Source # Source # Instance detailsDefined in Data.Text.Internal.Encoding Methods

Initial Utf8State.

Since: text-2.0.2

A delayed representation of strict Text.

Since: text-2.0.2

#### Instances

Instances details
 Source # Instance detailsDefined in Data.Text.Internal.StrictBuilder Methods Source # Concatenation of StrictBuilder is right-biased: the right builder will be run first. This allows a builder to run tail-recursively when it was accumulated left-to-right. Instance detailsDefined in Data.Text.Internal.StrictBuilder Methodsstimes :: Integral b => b -> StrictBuilder -> StrictBuilder Source #

Use StrictBuilder to build Text.

Since: text-2.0.2

Copy Text in a StrictBuilder

Since: text-2.0.2

## Partial Functions

These functions are partial and should only be used with great caution (preferably not at all). See Data.Text.Encoding for better solutions.

Decode a ByteString containing 7-bit ASCII encoded text.

This is a partial function: it checks that input does not contain anything except ASCII and copies buffer or throws an error otherwise.

Decode a ByteString containing UTF-8 encoded text that is known to be valid.

If the input contains any invalid UTF-8 data, an exception will be thrown that cannot be caught in pure code. For more control over the handling of invalid data, use decodeUtf8' or decodeUtf8With.

This is a partial function: it checks that input is a well-formed UTF-8 sequence and copies buffer or throws an error otherwise.

Decode text from little endian UTF-16 encoding.

If the input contains any invalid little endian UTF-16 data, an exception will be thrown. For more control over the handling of invalid data, use decodeUtf16LEWith.

Decode text from big endian UTF-16 encoding.

If the input contains any invalid big endian UTF-16 data, an exception will be thrown. For more control over the handling of invalid data, use decodeUtf16BEWith.

Decode text from little endian UTF-32 encoding.

If the input contains any invalid little endian UTF-32 data, an exception will be thrown. For more control over the handling of invalid data, use decodeUtf32LEWith.

Decode text from big endian UTF-32 encoding.

If the input contains any invalid big endian UTF-32 data, an exception will be thrown. For more control over the handling of invalid data, use decodeUtf32BEWith.

### Stream oriented decoding

Decode, in a stream oriented way, a ByteString containing UTF-8 encoded text that is known to be valid.

If the input contains any invalid UTF-8 data, an exception will be thrown (either by this function or a continuation) that cannot be caught in pure code. For more control over the handling of invalid data, use streamDecodeUtf8With.

Since: text-1.0.0.0

# Encoding Text to ByteStrings

Encode text using UTF-8 encoding.

Encode text using little endian UTF-16 encoding.

Encode text using big endian UTF-16 encoding.

Encode text using little endian UTF-32 encoding.

Encode text using big endian UTF-32 encoding.

# Encoding Text using ByteString Builders

Encode text to a ByteString Builder using UTF-8 encoding.

Since: text-1.1.0.0

Encode text using UTF-8 encoding and escape the ASCII characters using a BoundedPrim.

Use this function is to implement efficient encoders for text-based formats like JSON or HTML.

Since: text-1.1.0.0

# ByteString validation

These functions are for validating ByteStrings as encoded text.

Validate a ByteString as UTF-8-encoded text. To be continued using validateUtf8More.

See also validateUtf8More for details on the result of this function.

validateUtf8Chunk = validateUtf8More startUtf8State


### Properties

Given:

validateUtf8Chunk chunk = (n, ms)

• The prefix is valid UTF-8. In particular, it should be accepted by this validation:

validateUtf8Chunk (take n chunk) = (n, Just startUtf8State)


Validate another ByteString chunk in an ongoing stream of UTF-8-encoded text.

Returns a pair:

1. The first component n is the end position, relative to the current chunk, of the longest prefix of the accumulated bytestring which is valid UTF-8. n may be negative: that happens when an incomplete code point started in a previous chunk and is not completed by the current chunk (either that code point is still incomplete, or it is broken by an invalid byte).
2. The second component ms indicates the following:

• if ms = Nothing, the remainder of the chunk contains an invalid byte, within four bytes from position n;
• if ms = Just s', you can carry on validating another chunk by calling validateUtf8More with the new state s'.

### Properties

Given:

validateUtf8More s chunk = (n, ms)

• If the chunk is invalid, it cannot be extended to be valid.

ms = Nothing
==> validateUtf8More s (chunk <> more) = (n, Nothing)

• Validating two chunks sequentially is the same as validating them together at once:

ms = Just s'
==> validateUtf8More s (chunk <> more) = first (length chunk +) (validateUtf8More s' more)