suggest change


In .NET strings System.String are sequence of characters System.Char, each character is an UTF-16 encoded code-unit. This distinction is important because spoken language definition of character and .NET (and many other languages) definition of character are different.

One character, which should be correctly called grapheme, it’s displayed as a glyph and it is defined by one or more Unicode code-points. Each code-point is then encoded in a sequence of code-units. Now it should be clear why a single System.Char does not always represent a grapheme, let’s see in real world how they’re different:

There are much more issues about text handling, see for example How can I perform a Unicode aware character by character comparison? for a broader introduction and more links to related arguments.

In general when dealing with international text you may use this simple function to enumerate text elements in a string (avoiding to break Unicode surrogates and encoding):

public static class StringExtensions
    public static IEnumerable<string> EnumerateCharacters(this string s)
        if (s == null)
            return Enumerable.Empty<string>();

        var enumerator = StringInfo.GetTextElementEnumerator(s.Normalize());
        while (enumerator.MoveNext())
            yield return (string)enumerator.Value;

Feedback about page:

Optional: your email if you want me to get back to you:

Table Of Contents