UTF-8 AND UNICODE

Overview

Traditionally Jim Tcl has support strings, including binary strings containing nulls, however it has had no support for multi-byte character encodings.

In some fields, such as when dealing with the web or other user-generated content, support for multi-byte character encodings is necessary. In these cases it is very useful for Jim Tcl to be able to process strings as multi-byte character strings rather than simply binary bytes.

Supporting multiple character encodings and translation between those encodings is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support for UTF-8, as probably the most common general purpose multi-byte encoding.

UTF-8 support is optional. It can be enabled at compile time with:

$ ./configure --enable-utf8

When UTF-8 support is enabled, most string-related commands become UTF-8 aware, including string match, split, glob, scan and format.

UTF-8 encoding has many advantages, but one of the complications is that characters can take a variable number of bytes. Thus the addition of string bytelength which returns the number of bytes in a string, while string length returns the number of characters.

If UTF-8 support is not enabled, all commands treat bytes as characters and string bytelength returns the same value as string length.

Unicode vs UTF-8

It is important to understand that Unicode is an abstract representation of the concept of a “character”, while UTF-8 is an encoding of Unicode into bytes. Thus the Unicode codepoint U+00B5 is encoded in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from ASCII which the same name is used interchangeably between a character set and an encoding.

Unicode Escapes

Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters in strings. This can be done with the \uNNNN Unicode escape. This syntax is compatible with Tcl and is enabled even if UTF-8 is disabled.

Like Tcl, currently only 16-bit Unicode characters can be encoded.

UTF-8 Properties

Due to the design of the UTF-8 encoding, many (most) commands continue to work with UTF-8 strings. This is due to the following properties of UTF-8:

Invalid UTF-8 Sequences

Some UTF-8 character sequences are invalid, such as those beginning with 0xff, those which represent character sequences longer than 3 bytes (greater than U+FFFF), and those which end prematurely, such as a lone 0xc2.

In these situations, the offending bytes are treated as single characters. For example, the following returns 2.

string bytelength \xff\xff

Commands Supporting UTF-8

The following commands have been enhanced to support UTF-8 strings.

String Matching

Commands such as string match, lsearch -glob, array names and others use string pattern matching rules. These commands support UTF-8. For example:

string match a\[\ua0-\ubf\]b "a\u00a3b"

format and scan

format %c allows a unicode codepoint to be be encoded. For example, the following will return a string with two bytes and one character. The same as \ub5

format %c 0xb5

format respects widths as character widths, not byte widths. For example, the following will return a string with three characters, not three bytes.

format %.3s \ub5\ub6\ub7\ub8

Similarly, scan ... %c allows a UTF-8 to be decoded to a unicode codepoint. The following will set a to 181 (0xb5) and b to 65 (0x41).

scan \u00b5A %c%c a b

scan %s will also accept a character class, including unicode ranges.

String and Character Classes

string is has not been extended to classify UTF-8 characters. Therefore, the following will return 0, even though the string may be considered to be alphabetic.

string is alpha \ub5Test

This does not affect the string classes ascii, control, digit, double, integer or xdigit.

Similarly, UTF-8 character classes are not supported. Thus [:alpha:] will match [a-zA-Z] but not non-ASCII alphabetic characters.

Case Mapping and Conversion

Jim provides a simplified unicode case mapping. This means that case conversion and comparison will not increase or decrease the number of characters in a string.

string toupper will convert any lowercase letters to their uppercase equivalent. Any character which is not a letter or has no uppercase equivalent is left unchanged. Similarly for string tolower.

Commands which perform case insensitive matches, such as string compare -nocase and lsearch -nocase fold both strings to uppercase before comparison.

Regular Expressions

Typically systems do not provide a UTF-8 capable regex implementation, therefore when UTF-8 support is enabled, the built-in regex implementation is used which includes UTF-8 support. Both strings and patterns support UTF-8.

Case Insensitivity

Case folding is much more complex under Unicode than under ASCII. For example it is possible for a character to change the number of bytes required for representation when converting from one case to another. Jim Tcl supports only “simple” case folding, where case is folded only where the number of bytes does not change.

Case folding tables are automatically generated from the official unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt

Working with Binary Data and non-UTF-8 encodings

Almost all Jim commands will work identically with binary data and UTF-8 encoded data, including read, gets, puts and string eq. It is only certain string manipulation commands which will operate differently. For example, string index will return UTF-8 characters, not bytes.

If it is necessary to manipulate strings containing binary, non-ASCII data (bytes >= 0x80), there are several options.

  1. Build Jim without UTF-8 support
  2. Arrange to encode and decode binary data or data in other encodings to UTF-8 before manipulation.
  3. Use string bytelength, string byterange, pack, unpack and binary