Unicode String in OCaml's standard distribution.

In this post, I want to argue, against the (apparent) conventional wisdom of OCaml community, that we need abstract Unicode String (we call it Text) module and related stuff in the standard distribution. First I discuss what are problems of the current practice, then propose the interface of the new modules, in particular what should be equality and comparison of Unicode Strings. Finally, I will discuss how to implement such interfaces.

Assumption

In this post, I assume that Uchar (abstract type for Unicode character) is already in the standard distribution. The reason for this is explained eloquently in the thread of https://github.com/ocaml/ocaml/pull/80#issuecomment-48966322. Also I assume that immutable strings will be widely accepted.

Background: What are problems in the current practice

Apparently, the current practice of using Unicode string in OCaml, is to use UTF-8 encoded byte string. However, this practice is inherently unsafe and unintuitive.

This practice is unsafe because of the possibility of malformed UTF-8 encoding. This can be caused from programming mistakes to deliberate injection. This causes many security problems UTF-8 security exploit. If UTF-8 decoder is optimized by using unsafe access to underlining byte strings, malformed UTF-8 causes crash, too.

For the defence to this problem, one can validate the string before using it. However, the current solution does not guarantee that the validation occurs. Also, this solution degrades performance since validation is O(n) operation.

This practice is also unintuitive, because, many functions on string are now returns garbage.

  1. s.[i] returns a garbage.
  2. String.length also returns garbage.
  3. String.sub returns also garbage, possible malformed UTF-8 strings.
  4. String.iter does not function any more.

The opinion that these functions are not useful for Unicode strings, misses the points. This does not prevent that the beginner, or legacy code uses these functions and causes unexpected problems.

Proposed interface

There are handful people who believe that supporting Unicode in programming language is incredibly difficult. This is not true. The Unicode standard is flexible enough that you do not need to implement everything (Conformance). Moreover, Unicode standard does not require anything to the language design (Implementation Guideline). What the language have to do, is to provide the programmers the right abstraction. As for Unicode string, I believe that it means to provide the abstract Unicode string type, which is an abstract sequence of Uchar.

So, I want to propose the addition of Unicode string module in standard distribution of OCaml. It is important to have this in standard distribution, because

  1. Type for human readable text is too important to left out from the standard distribution, in particular from the beginner's perspective

  2. This enhances interpolability of Unicode processing libraries

  3. This suppresses the current dangerous practice that raw UTF-8 encoded string is used for Unicode string.

Unicode string must be an abstract data type, so that we have freedom to choose its implementation.

Unicode string (Text) module

Thus, we arrive the abstract type for Unicode string.

module Text : sig

    type t
    ...

end

Text.t is abstract type. So, we can enforce invariant when we create its instance.

val of_string : string -> t option
val of_string_exn : string -> t

If the argument is malformed, they return None or raise exception.

To access contents, using iterator is better since Text.t may not support fast random access.

type iterator

val first : t -> iterator

val last : t -> iterator

val nth : t -> int -> iterator option
val nth_exn : t -> int -> iterator

val next : iterator -> iterator option
val next_exn : iterator -> iterator
val prev : iterator -> iterator option
val prev_exn : iterator -> iterator

val value : iterator -> uchar

However, index is conceptually easier to understand when you specify the location in text.

val sub : t -> pos:int -> len:int -> t

Finally, we have fold.

val fold : t -> 'a -> ('a -> uchar -> 'a) -> 'a

Equality and comparison

Now, we come to a delicate issue, equality and comparison of Unicode string. Unlike ASCII, Unicode string have multiple concepts of equality. First, we have an equality as sequences of Uchar, which we call code-point equality. This is simplest and easiest to implement among the equalities and comparisons of Unicode string.

Different sequences of Uchar can have the exactly same meaning for human. This is mainly because of the existence combined character. In Unicode, ä can be represented by one Unicode character ä, or a sequence of Unicode character a + ̈. This equivalence is called canonical equivalence.

There is also an equivalence called compatibility equivalence. There are characters which have different code numbers in Unicode, but have the same meaning according to the Unicode philosophy. For example, full width katakana ア and half width katakana ァ has the same meaning, but since they are encoded as different characters in a Japanese legacy encoding, the distinction as code points is retained. Compatibility equivalence removes this distinction.

Further, there is a concept called collation, which is equivalence based on a cultural convention of the given locale. Collation has the concept of strength, which distinguishes the several levels of equivalent relations.

Among this multitude of equivalent relations, I propose that our Unicode module in standard distribution supports the code point equality as a standard way to compare Unicode strings. There are several reasons for this.

  1. There is no operation (besides using physical equality) which can distinguish code-point equivalent strings. In other word, the code-point equality is the Leibniz equality over Unicode strings.

    This is important properties for the principle of least surprise. It would be very surprising that two equal (in the default equality) strings behave differently in some operation (such as accessing Unicode characters which are contained).

  2. Code-point equality and comparison are useful for low-level operations for Unicode strings. For example, they are useful for hash table and binary search tree (Set).

  3. Code-point equality and comparison are easiest to implement, and do not require large tables to look-up.

Of course, the final results which are presented to humans, are respect to canonical equivalence (at least). Higher level protocols define their own equalities (for example, Precis frameworks for internal protocols). This is well-known fact and it is best to left programmers to choose which equality and comparison are correct.

Unicode string literal

To write Unicode strings nicely in the program, we need Unicode string literals. Instead of introducing another syntax for Unicode string literal (which creates more clatters in the syntax), I propose that we use the current "..." notation for Unicode strings. For the backward compatibility, "..." literals represent byte strings as default (at least now) and mean Unicode strings only when it is activated. I will discuss how this can be implemented later.

I propose not to support Unicode string literal in pattern matching, because of ambiguity of equality between Unicode strings.

Implementation

Although what I want to argue is to introduce abstract Unicode string, how can we implement easily and efficiently such abstract type is important question.

Unicode string

As for Unicode string, I suggest UTF-8 encoded byte string as the internal representation (but this fact must be totally hidden from the users). There are several reasons.

  1. This makes conversion to/from byte string easy and very fast. Since now strings are immutable, there is no need to copy strings during such conversions.

  2. This makes interface to C very simple. We can just treat our Unicode strings as byte strings in C

  3. Implementation is easy, comparing to other data structures such as Rope.

UTF-8 does not support fast random access, but if iterator is provided, this is not a major problem. Space efficiency is slightly inferior comparing UTF-16, since it requires up to 4 bytes to encode entire BMP (basic multilingual plane, where most characters with active use are located), while UTF-16 takes only 2 bytes. However, I believe that the considerations above override this slight space inefficiency.

Rope is suitable for fast concatenation. I experiment this option in ucorelib 0.2.0, but the code becomes very complicated. Using imperative string buffer (as in the current stdlib) is much simpler and equally efficient.

Unicode string literal

As I stated above, it is desirable to change the meaning of "..." literals by directives. I propose to use open directive for this purpose. To support this, we change the parser so that it insert a function, say _string_literal before each string literals. Then, open directives can change its meaning.

let s = "abcde" (* byte string *)

open Unicode

let t = "abcde" (* Unicode string *)

Conclusion

In this post, I argue that we need to have an Unicode string in the standard distribution of OCaml. Also, I propose that how we write Unicode string literals. Finally, I discuss that the easy and simple way to implement these proposals.

In my opinion, if OCaml want to stay as a cutting edge programming language, we need a standard way to represent Unicode strings. I hope this post stimulates the discussion and leads the further improvement of OCaml.