Working with Text

«« Previous
Next »»

Nearly all programs with user interfaces manipulate text. In an international market the text your programs display must conform to the rules of languages from around the world. The Java programming language provides a number of classes that help you handle text in a locale-independent manner.

01. Checking Character Properties

You can categorize characters according to their properties. For instance, X is an uppercase letter and 4 is a decimal digit. Checking character properties is a common way to verify the data entered by end users. If you are selling books online, for example, your order entry screen should verify that the characters in the quantity field are all digits.

Developers who aren't used to writing global software might determine a character's properties by comparing it with character constants. For instance, they might write code like this:

char ch;

// This code is WRONG!

// check if ch is a letter
if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))
    // ...

// check if ch is a digit
if (ch >= '0' && ch <= '9')
    // ...

// check if ch is a whitespace
if ((ch == ' ') || (ch =='\n') || (ch == '\t'))
    // ...

The preceding code is wrong because it works only with English and a few other languages. To internationalize the previous example, replace it with the following statements:

char ch;
// ...

// This code is OK!

if (Character.isLetter(ch))
    // ...

if (Character.isDigit(ch))
    // ...

if (Character.isSpaceChar(ch))
    // ...

The Character methods rely on the Unicode Standard for determining the properties of a character. Unicode is a 16-bit character encoding that supports the world's major languages. In the Java programming language char values represent Unicode characters. If you check the properties of a char with the appropriate Character method, your code will work with all major languages. For example, the Character.isLetter method returns true if the character is a letter in Chinese, German, Arabic, or another language.

The following list gives some of the most useful Character comparison methods. The Character API documentation fully specifies the methods.

◈ isDigit
◈ isLetter
◈ isLetterOrDigit
◈ isLowerCase
◈ isUpperCase
◈ isSpaceChar
◈ isDefined

The Character.getType method returns the Unicode category of a character. Each category corresponds to a constant defined in the Character class. For instance, getType returns the Character.UPPERCASE_LETTER constant for the character A. The following example shows how to use getType and the Character category constants. All of the expressions in these if statements are true:

if (Character.getType('a') == Character.LOWERCASE_LETTER)
    // ...

if (Character.getType('R') == Character.UPPERCASE_LETTER)
    // ...

if (Character.getType('>') == Character.MATH_SYMBOL)
    // ...

if (Character.getType('_') == Character.CONNECTOR_PUNCTUATION)
    // ...

02. Comparing Strings

Applications that sort through text perform frequent string comparisons. For example, a report generator performs string comparisons when sorting a list of strings in alphabetical order.

If your application audience is limited to people who speak English, you can probably perform string comparisons with the String.compareTo method. The String.compareTo method performs a binary comparison of the Unicode characters within the two strings. For most languages, however, this binary comparison cannot be relied on to sort strings, because the Unicode values do not correspond to the relative order of the characters.

Fortunately the Collator class allows your application to perform string comparisons for different languages. In this section, you'll learn how to use the Collator class when sorting text.

2.1 Performing Locale-Independent Comparisons

Collation rules define the sort sequence of strings. These rules vary with locale, because various natural languages sort words differently. You can use the predefined collation rules provided by the Collator class to sort strings in a locale-independent manner.

To instantiate the Collator class invoke the getInstance method. Usually, you create a Collator for the default Locale, as in the following example:

Collator myDefaultCollator = Collator.getInstance();

You can also specify a particular Locale when you create a Collator, as follows:

Collator myFrenchCollator = Collator.getInstance(Locale.FRENCH);

The getInstance method returns a RuleBasedCollator, which is a concrete subclass of Collator. The RuleBasedCollator contains a set of rules that determine the sort order of strings for the locale you specify. These rules are predefined for each locale. Because the rules are encapsulated within the RuleBasedCollator, your program won't need special routines to deal with the way collation rules vary with language.

You invoke the method to perform a locale-independent string comparison. The compare method returns an integer less than, equal to, or greater than zero when the first string argument is less than, equal to, or greater than the second string argument. The following table contains some sample calls to

Example Return Value   Explanation"abc", "def") -1 "abc" is less than "def""rtf", "rtf")  the two strings are equal"xyz", "abc")  "xyz" is greater than "abc" 

You use the compare method when performing sort operations. The sample program called CollatorDemo uses the compare method to sort an array of English and French words. This program shows what can happen when you sort the same list of words with two different collators:

Collator fr_FRCollator = Collator.getInstance(new Locale("fr","FR"));
Collator en_USCollator = Collator.getInstance(new Locale("en","US"));

The method for sorting, called sortStrings, can be used with any Collator. Notice that the sortStrings method invokes the compare method:

public static void sortStrings(Collator collator, String[] words) {
    String tmp;
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) { 
            if ([i], words[j]) > 0) {
                tmp = words[i];
                words[i] = words[j];
                words[j] = tmp;

The English Collator sorts the words as follows:


According to the collation rules of the French language, the preceding list is in the wrong order. In French péché should follow pêche in a sorted list. The French Collator sorts the array of words correctly, as follows:


2.2 Customizing Collation Rules

The previous section discussed how to use the predefined rules for a locale to compare strings. These collation rules determine the sort order of strings. If the predefined collation rules do not meet your needs, you can design your own rules and assign them to a RuleBasedCollator object.

Customized collation rules are contained in a String object that is passed to the RuleBasedCollator constructor. Here's a simple example:

String simpleRule = "< a < b < c < d";
RuleBasedCollator simpleCollator =  new RuleBasedCollator(simpleRule);

For the simpleCollator object in the previous example, a is less than b, which is less that c, and so forth. The method references these rules when comparing strings. The full syntax used to construct a collation rule is more flexible and complex than this simple example. For a full description of the syntax, refer to the API documentation for the RuleBasedCollator class.

The example that follows sorts a list of Spanish words with two collators. Full source code for this example is in

The RulesDemo program starts by defining collation rules for English and Spanish. The program will sort the Spanish words in the traditional manner. When sorting by the traditional rules, the letters ch and ll and their uppercase equivalents each have their own positions in the sort order. These character pairs compare as if they were one character. For example, ch sorts as a single letter, following cz in the sort order. Note how the rules for the two collators differ:

String englishRules = (
    "< a,A < b,B < c,C < d,D < e,E < f,F " +
    "< g,G < h,H < i,I < j,J < k,K < l,L " +
    "< m,M < n,N < o,O < p,P < q,Q < r,R " +
    "< s,S < t,T < u,U < v,V < w,W < x,X " +
    "< y,Y < z,Z");

String smallnTilde = new String("\u00F1");    // ñ
String capitalNTilde = new String("\u00D1");  // Ñ

String traditionalSpanishRules = (
    "< a,A < b,B < c,C " +
    "< ch, cH, Ch, CH " +
    "< d,D < e,E < f,F " +
    "< g,G < h,H < i,I < j,J < k,K < l,L " +
    "< ll, lL, Ll, LL " +
    "< m,M < n,N " +
    "< " + smallnTilde + "," + capitalNTilde + " " +
    "< o,O < p,P < q,Q < r,R " +
    "< s,S < t,T < u,U < v,V < w,W < x,X " +
    "< y,Y < z,Z");

The following lines of code create the collators and invoke the sort routine:

try {
    RuleBasedCollator enCollator = new RuleBasedCollator(englishRules);
    RuleBasedCollator spCollator =
        new RuleBasedCollator(traditionalSpanishRules);

    sortStrings(enCollator, words);

    sortStrings(spCollator, words);
} catch (ParseException pe) {
    System.out.println("Parse exception for rules");

The sort routine, called sortStrings, is generic. It will sort any array of words according to the rules of any Collator object:

public static void sortStrings(Collator collator, String[] words) {
    String tmp;
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) {
            if ([i], words[j]) > 0) {
                tmp = words[i];
                words[i] = words[j];
                words[j] = tmp;

When sorted with the English collation rules, the array of words is as follows:


Compare the preceding list with the following, which is sorted according to the traditional Spanish rules of collation:


2.3 Improving Collation Performance

Sorting long lists of strings is often time consuming. If your sort algorithm compares strings repeatedly, you can speed up the process by using the CollationKey class.

A CollationKey object represents a sort key for a given String and Collator. Comparing two CollationKey objects involves a bitwise comparison of sort keys and is faster than comparing String objects with the method. However, generating CollationKey objects requires time. Therefore if a String is to be compared just once, offers better performance.

The example that follows uses a CollationKey object to sort an array of words. Source code for this example is in

The KeysDemo program creates an array of CollationKey objects in the main method. To create a CollationKey, you invoke the getCollationKey method on a Collator object. You cannot compare two CollationKey objects unless they originate from the same Collator. The main method is as follows:

static public void main(String[] args) {
    Collator enUSCollator = Collator.getInstance(new Locale("en","US"));
    String [] words = {

    CollationKey[] keys = new CollationKey[words.length];

    for (int k = 0; k < keys.length; k ++) {
        keys[k] = enUSCollator. getCollationKey(words[k]);


The sortArray method invokes the CollationKey.compareTo method. The compareTo method returns an integer less than, equal to, or greater than zero if the keys[i] object is less than, equal to, or greater than the keys[j] object. Note that the program compares the CollationKey objects, not the String objects from the original array of words. Here is the code for the sortArray method:

public static void sortArray(CollationKey[] keys) {
    CollationKey tmp;

    for (int i = 0; i < keys.length; i++) {
        for (int j = i + 1; j < keys.length; j++) {
            if (keys[i].compareTo(keys[j]) > 0) {
                tmp = keys[i];
                keys[i] = keys[j];
                keys[j] = tmp;

The KeysDemo program sorts an array of CollationKey objects, but the original goal was to sort an array of String objects. To retrieve the String representation of each CollationKey, the program invokes getSourceString in the displayWords method, as follows:

static void displayWords(CollationKey[] keys) {
    for (int i = 0; i < keys.length; i++) {

The displayWords method prints the following lines:


03. Unicode

Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the Latin character A. The Unicode standard was initially designed using 16 bits to encode characters because the primary machines were 16-bit PCs.

When the specification for the Java language was created, the Unicode standard was accepted and the char primitive was defined as a 16-bit data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF.

Because 16-bit encoding supports 216 (65,536) characters, which is insufficient to define all characters in use throughout the world, the Unicode standard was extended to 0x10FFFF, which supports over one million characters. The definition of a character in the Java programming language could not be changed from 16 bits to 32 bits without causing millions of Java applications to no longer run properly. To correct the definition, a scheme was developed to handle characters that could not be encoded in 16 bits.

The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.

This lesson includes the following sections:

3.1 Terminology

A character is a minimal unit of text that has semantic value.

A character set is a collection of characters that might be used by multiple languages. For example, the Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.

A coded character set is a character set where each character is assigned a unique number.

A code point is a value that can be used in a coded character set. A code point is a 32-bit int data type, where the lower 21 bits represent a valid code point value and the upper 11 bits are 0.

A Unicode code unit is a 16-bit char value. For example, imagine a String that contains the letters "abc" followed by the Deseret LONG I, which is represented with two char values. That string contains four characters, four code points, but five code units.

To express a character in Unicode, the hexadecimal value is prefixed with the string U+. The valid code point range for the Unicode standard is U+0000 to U+10FFFF, inclusive. The code point value for the Latin character A is U+0041. The character € which represents the Euro currency, has the code point value U+20AC. The first letter in the Deseret alphabet, the LONG I, has the code point value U+10400.

The following table shows code point values for several characters:

Character Unicode Code Point   Glyph 
Latin A U+0041
Latin sharp S   U+00DF 
Han for East   U+6771
Deseret, LONG I  U+10400 

As previously described, characters that are in the range U+10000 to U+10FFFF are called supplementary characters. The set of characters from U+0000 to U+FFFF are sometimes referred to as the Basic Multilingual Plane (BMP).

3.2 Supplementary Characters as Surrogates

To support supplementary characters without changing the char primitive data type and causing incompatibility with previous Java programs, supplementary characters are defined by a pair of code point values that are called surrogates. The first code point is from the high surrogates range of U+D800 to U+DBFF, and the second code point is from the low surrogates range of U+DC00 to U+DFFF. For example, the Deseret character LONG I, U+10400, is defined with this pair of surrogate values: U+D801 and U+DC00.

3.3 Character and String API

The Character class encapsulates the char data type. For the J2SE release 5, many methods were added to the Character class to support supplementary characters. This API falls into two categories: methods that convert between char and code point values and methods that verifiy the validity of or map code points.

This section describes a subset of the available methods in the Character class.

Conversion Methods and the Character Class

The following table includes the most useful conversion methods, or methods that facilitate conversion, in the Character class. The codePointAt and codePointBefore methods are included in this list because text is generally found in a sequence, such as a String, and these methods can be used to extract the desired substring.

Method(s) Description 
toChars(int codePoint, char[] dst, int dstIndex)
toChars(int codePoint) 
Converts the specified Unicode code point to its UTF-16 representation and places it in a char array. Sample usage: Character.toChars(0x10400) 
toCodePoint(char high, char low)
toCodePoint(CharSequence, int)
toCodePoint(char[], int, int)
Converts the specified parameters to its supplementary code point value. The different methods accept different input formats. 
codePointAt(char[] a, int index)
codePointAt(char[] a, int index, int limit)
codePointAt(CharSequence seq, int index)
Returns the Unicode code point at the specified index. The third method takes a CharSequence and the second method enforces an upper limit on the index. 
codePointBefore(char[] a, int index)
codePointBefore(char[] a, int index, int start)
codePointBefore(CharSequence seq, int index)
codePointBefore(char[], int, int)
Returns the Unicode code point before the specified index. The third method accepts a CharSequence and the other methods accept a char array. The second method enforces a lower limit on the index.
charCount(int codePoint) Returns the value 1 for characters that can be represented by a single char. Returns the value 2 for supplementary characters that require two chars. 

Verification and Mapping Methods in the Character Class

Some of the previous methods that used the char primitive data type, such as isLowerCase(char) and isDigit(char), were supplanted by methods that support supplementary characters, such as isLowerCase(int) and isDigit(int). The previous methods are supported but do not work with supplementary characters. To create a global application and ensure that your code works seamlessly with any language, it is recommended that you use the newer forms of these methods.

Note that, for performance reasons, most methods that accept a code point do not verify the validity of the code point parameter. You can use the isValidCodePoint method for that purpose.

The following table lists some of the verification and mapping methods in the Character class.

Method(s) Description 
isValidCodePoint(int codePoint) Returns true if the code point is within the range of 0x0000 to 0x10FFFF, inclusive.
isSupplementaryCodePoint(int codePoint)  Returns true if the code point is within the range of 0x10000 to 0x10FFFF, inclusive.
isHighSurrogate(char)  Returns true if the specified char is within the high surrogate range of \uD800 to \uDBFF, inclusive.
isLowSurrogate(char)  Returns true if the specified char is within the low surrogate range of \uDC00 to \uDFFF, inclusive.
isSurrogatePair(char high, char low)  Returns true if the specified high and low surrogate code values represent a valid surrogate pair.
codePointCount(CharSequence, int, int)
codePointCount(char[], int, int) 
Returns the number of Unicode code points in the CharSequence, or char array.
Returns true if the specified Unicode code point is a lowercase or uppercase character.
isDefined(int)  Returns true if the specified Unicode code point is defined in the Unicode standard.
Returns true if the specified character or Unicode code point is permissible as the first character in a Java identifier.
Returns true if the specified Unicode code point is a letter, a digit, or a letter or digit.
getDirectionality(int) Returns the Unicode directionality property for the given Unicode code point.
Character.UnicodeBlock.of(int codePoint) Returns the object representing the Unicode block that contains the given Unicode code point or returns null if the code point is not a member of a defined block.

Methods in the String Classes

The String, StringBuffer, and StringBuilder classes also have contructors and methods that work with supplementary characters. The following table lists some of the commonly used methods. 

Constructor or Methods Description 
String(int[] codePoints, int offset, int count) Allocates a new String instance that contains characters from a subarray of a Unicode code point array.
String.codePointAt(int index)
StringBuffer.codePointAt(int index)
StringBuilder.codePointAt(int index)
Returns the Unicode code point at the specified index.
String.codePointBefore(int index)
StringBuffer.codePointBefore(int index)
StringBuilder.codePointBefore(int index)
Returns the Unicode code point before the specified index. 
String.codePointCount(int beginIndex, int endIndex)
StringBuffer.codePointCount(int beginIndex, int endIndex)
StringBuilder.codePointCount(int beginIndex, int endIndex)
Returns the number of Unicode code points in the specified range. 
StringBuffer.appendCodePoint(int codePoint)
StringBuilder.appendCodePoint(int codePoint)
Appends the string representation of the specified code point to the sequence. 
String.offsetByCodePoints(int index, int codePointOffset)
StringBuffer.offsetByCodePoints(int index, int codePointOffset)
StringBuilder.offsetByCodePoints(int index, int codePointOffset)
Returns the index that is offset from the given index by the given number of code points 

3.4 Sample Usage

This page contains some code snippets that show you several commono scenarios.

Creating a String from a Code Point

String newString(int codePoint) {
    return new String(Character.toChars(codePoint));

Creating a String from a Code Point - Optimized for BMP Characters

The Character.toChars method creates an temporary array that is used once and then discarded. If this negatively affects performance, you can use the following approach that is optimizied for BMP characters (characters that are represented by a single char value). In this method, toChars is invoked only for supplementary characters.

String newString(int codePoint) {
    if (Character.charCount(codePoint) == 1) {
        return String.valueOf(codePoint);
    } else {
        return new String(Character.toChars(codePoint));

Creating String Objects in Bulk

To create a large number of strings, the bulk version of the previous snippet reuses the array used by the toChars method. This method creates a separate String instance for each code point and is optimized for BMP characters.

String[] newStrings(int[] codePoints) {
    String[] result = new String[codePoints.length];
    char[] codeUnits = new char[2];
    for (int i = 0; i < codePoints.length; i++) {
        int count = Character.toChars(codePoints[i], codeUnits, 0);
        result[i] = new String(codeUnits, 0, count);
    return result;

Generating Messages

The formatting API supports supplementary characters. The following example is a simple way to generate a message.

// recommended
System.out.printf("Character %c is invalid.%n", codePoint);

This following approach is simple and avoids concatenation, which makes the text more difficult to localize as not all languages insert numeric values into a string in the same order as English.

// not recommended
System.out.println("Character " + String.valueOf(char) + " is invalid.");

3.5 Design Considerations

To write code that works seamlessly for any language using any script, there are a few things to keep in mind.

Consideration Reason 
Avoid methods that use the char data type. Avoid using the char primitive data type or methods that use the char data type, because code that uses that data type does not work for supplementary characters. For methods that take a char type parameter, use the corresponding int method, where available. For example, use the Character.isDigit(int) method rather than Character.isDigit(char) method. 
Use the isValidCodePoint method to verify code point values. A code point is defined as an int data type, which allows for values outside of the valid range of code point values from 0x0000 to 0x10FFFF. For performance reasons, the methods that take a code point value as a parameter do not check the validity of the parameter, but you can use the isValidCodePoint method to check the value. 
Use the codePointCount method to count characters.  The String.length() method returns the number of code units, or 16-bit char values, in the string. If the string contains supplementary characters, the count can be misleading because it will not reflect the true number of code points. To get an accurate count of the number of characters (including supplementary characters), use the codePointCount method.
Use the String.toUpperCase(int codePoint) and String.toLowerCase(int codePoint) methods rather than the Character.toUpperCase(int codePoint) or Character.toLowerCase(int codePoint) methods. While the Character.toUpperCase(int) and Character.toLowerCase(int) methods do work with code point values, there are some characters that cannot be converted on a one-to-one basis. The lowercase German character ß, for example, becomes two characters, SS, when converted to uppercase. Likewise, the small Greek Sigma character is different depending on the position in the string. The Character.toUpperCase(int) and Character.toLowerCase(int) methods cannot handle these types of cases; however, the String.toUpperCase and String.toLowerCase methods handle these cases correctly. 
Be careful when deleting characters. When invoking the StringBuilder.deleteCharAt(int index) or StringBuffer.deleteCharAt(int index) methods where the index points to a supplementary character, only the first half of that character (the first char value) is removed. First, invoke the Character.charCount method on the character to determine if one or two char values must be removed.
Be careful when reversing characters in a sequence. When invoking the StringBuffer.reverse() or StringBuilder.reverse() methods on text that contains supplementary characters, the high and low surrogate pairs are reversed which results in incorrect and possibly invalid surrogate pairs. 

04. Detecting Text Boundaries

Applications that manipulate text need to locate boundaries within the text. For example, consider some of the common functions of a word processor: highlighting a character, cutting a word, moving the cursor to the next sentence, and wrapping a word at a line ending. To perform each of these functions, the word processor must be able to detect the logical boundaries in the text. Fortunately you don't have to write your own routines to perform boundary analysis. Instead, you can take advantage of the methods provided by the BreakIterator class.

01. About the BreakIterator Class

The BreakIterator class is locale-sensitive, because text boundaries vary with language. For example, the syntax rules for line breaks are not the same for all languages. To determine which locales the BreakIterator class supports, invoke the getAvailableLocales method, as follows:

Locale[] locales = BreakIterator.getAvailableLocales();

You can analyze four kinds of boundaries with the BreakIterator class: character, word, sentence, and potential line break. When instantiating a BreakIterator, you invoke the appropriate factory method:

◈ getCharacterInstance
◈ getWordInstance
◈ getSentenceInstance
◈ getLineInstance

Each instance of BreakIterator can detect just one type of boundary. If you want to locate both character and word boundaries, for example, you create two separate instances.

A BreakIterator has an imaginary cursor that points to the current boundary in a string of text. You can move this cursor within the text with the previous and the next methods. For example, if you've created a BreakIterator with getWordInstance, the cursor moves to the next word boundary in the text every time you invoke the next method. The cursor-movement methods return an integer indicating the position of the boundary. This position is the index of the character in the text string that would follow the boundary. Like string indexes, the boundaries are zero-based. The first boundary is at 0, and the last boundary is the length of the string. The following figure shows the word boundaries detected by the next and previous methods in a line of text:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning
This figure has been reduced to fit on the page. 
Click the image to view it at its natural size.

You should use the BreakIterator class only with natural-language text. To tokenize a programming language, use the StreamTokenizer class.

02. Character Boundaries

You need to locate character boundaries if your application allows the end user to highlight individual characters or to move a cursor through text one character at a time. To create a BreakIterator that locates character boundaries, you invoke the getCharacterInstance method, as follows:

BreakIterator characterIterator =

This type of BreakIterator detects boundaries between user characters, not just Unicode characters.

A user character may be composed of more than one Unicode character. For example, the user character ü can be composed by combining the Unicode characters \u0075 (u) and \u00a8 (¨). This isn't the best example, however, because the character ü may also be represented by the single Unicode character \u00fc. We'll draw on the Arabic language for a more realistic example.

In Arabic the word for house is:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

This word contains three user characters, but it is composed of the following six Unicode characters:

String house = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";

The Unicode characters at positions 1, 3, and 5 in the house string are diacritics. Arabic requires diacritics because they can alter the meanings of words. The diacritics in the example are nonspacing characters, since they appear above the base characters. In an Arabic word processor you cannot move the cursor on the screen once for every Unicode character in the string. Instead you must move it once for every user character, which may be composed by more than one Unicode character. Therefore you must use a BreakIterator to scan the user characters in the string.

The sample program BreakIteratorDemo, creates a BreakIterator to scan Arabic characters. The program passes this BreakIterator, along with the String object created previously, to a method named listPositions:

BreakIterator arCharIterator = BreakIterator.getCharacterInstance(
                                   new Locale ("ar","SA"));
listPositions (house, arCharIterator);

The listPositions method uses a BreakIterator to locate the character boundaries in the string. Note that the BreakIteratorDemo assigns a particular string to the BreakIterator with the setText method. The program retrieves the first character boundary with the first method and then invokes the next method until the constant BreakIterator.DONE is returned. The code for this routine is as follows:

static void listPositions(String target, BreakIterator iterator) {
    int boundary = iterator.first();

    while (boundary != BreakIterator.DONE) {
        System.out.println (boundary);
        boundary =;

The listPositions method prints out the following boundary positions for the user characters in the string house. Note that the positions of the diacritics (1, 3, 5) are not listed:


03. Word Boundaries

You invoke the getWordIterator method to instantiate a BreakIterator that detects word boundaries:

BreakIterator wordIterator =

You'll want to create such a BreakIterator when your application needs to perform operations on individual words. These operations might be common word- processing functions, such as selecting, cutting, pasting, and copying. Or, your application may search for words, and it must be able to distinguish entire words from simple strings.

When a BreakIterator analyzes word boundaries, it differentiates between words and characters that are not part of words. These characters, which include spaces, tabs, punctuation marks, and most symbols, have word boundaries on both sides.

The example that follows, which is from the program BreakIteratorDemo, marks the word boundaries in some text. The program creates the BreakIterator and then calls the markBoundaries method:

Locale currentLocale = new Locale ("en","US");

BreakIterator wordIterator =

String someText = "She stopped. " +
    "She said, \"Hello there,\" and then went " +

markBoundaries(someText, wordIterator);

The markBoundaries method is defined in This method marks boundaries by printing carets (^) beneath the target string. In the code that follows, notice the while loop where markBoundaries scans the string by calling the next method:

static void markBoundaries(String target, BreakIterator iterator) {

    StringBuffer markers = new StringBuffer();
    markers.setLength(target.length() + 1);
    for (int k = 0; k < markers.length(); k++) {
        markers.setCharAt(k,' ');

    int boundary = iterator.first();

    while (boundary != BreakIterator.DONE) {
        boundary =;


The output of the markBoundaries method follows. Note where the carets (^) occur in relation to the punctuation marks and spaces:

She stopped.  She said, "Hello there," and then
^  ^^      ^^ ^  ^^   ^^^^    ^^    ^^^^  ^^   ^

went on.
^   ^^ ^^

The BreakIterator class makes it easy to select words from within text. You don't have to write your own routines to handle the punctuation rules of various languages; the BreakIterator class does this for you.

The extractWords method in the following example extracts and prints words for a given string. Note that this method uses Character.isLetterOrDigit to avoid printing "words" that contain space characters.

static void extractWords(String target, BreakIterator wordIterator) {

    int start = wordIterator.first();
    int end =;

    while (end != BreakIterator.DONE) {
        String word = target.substring(start,end);
        if (Character.isLetterOrDigit(word.charAt(0))) {
        start = end;
        end =;

The BreakIteratorDemo program invokes extractWords, passing it the same target string used in the previous example. The extractWords method prints out the following list of words:


04. Sentence Boundaries

You can use a BreakIterator to determine sentence boundaries. You start by creating a BreakIterator with the getSentenceInstance method:

BreakIterator sentenceIterator =

To show the sentence boundaries, the program uses the markBoundaries method, which is discussed in the section Word Boundaries. The markBoundaries method prints carets (^) beneath a string to indicate boundary positions. Here are some examples:

She stopped.  She said, "Hello there," and then went on.
^             ^                                         ^

He's vanished!  What will we do?  It's up to us.
^               ^                 ^             ^

Please add 1.5 liters to the tank.
^                                 ^

05. Line Boundaries

Applications that format text or that perform line wrapping must locate potential line breaks. You can find these line breaks, or boundaries, with a BreakIterator that has been created with the getLineInstance method:

BreakIterator lineIterator =

This BreakIterator determines the positions in a string where text can break to continue on the next line. The positions detected by the BreakIterator are potential line breaks. The actual line breaks displayed on the screen may not be the same.

The two examples that follow use the markBoundaries method of to show the line boundaries detected by a BreakIterator. The markBoundaries method indicates line boundaries by printing carets (^) beneath the target string.

According to a BreakIterator, a line boundary occurs after the termination of a sequence of whitespace characters (space, tab, new line). In the following example, note that you can break the line at any of the boundaries detected:

She stopped.  She said, "Hello there," and then went on.
^   ^         ^   ^     ^      ^     ^ ^   ^    ^    ^  ^

Potential line breaks also occur immediately after a hyphen:

There are twenty-four hours in a day.
^     ^   ^      ^    ^     ^  ^ ^   ^

The next example breaks a long string of text into fixed-length lines with a method called formatLines. This method uses a BreakIterator to locate the potential line breaks. The formatLines method is short, simple, and, thanks to the BreakIterator, locale-independent. Here is the source code:

static void formatLines(
    String target, int maxLength,
    Locale currentLocale) {

    BreakIterator boundary = BreakIterator.
    int start = boundary.first();
    int end =;
    int lineLength = 0;

    while (end != BreakIterator.DONE) {
        String word = target.substring(start,end);
        lineLength = lineLength + word.length();
        if (lineLength >= maxLength) {
            lineLength = word.length();
        start = end;
        end =;

The BreakIteratorDemo program invokes the formatLines method as follows:

String moreText =
    "She said, \"Hello there,\" and then " +
    "went on down the street. When she stopped " +
    "to look at the fur coats in a shop + "
    "window, her dog growled. \"Sorry Jake,\" " +
    "she said. \"I didn't know you would take " +
    "it personally.\"";

formatLines(moreText, 30, currentLocale);

The output from this call to formatLines is:

She said, "Hello there," and
then went on down the
street. When she stopped to
look at the fur coats in a
shop window, her dog
growled. "Sorry Jake," she
said. "I didn't know you
would take it personally."

05. Converting Latin Digits to Other Unicode Digits

By default, when text contains numeric values, those values are displayed using Latin (European) digits. When other Unicode digit shapes are preferred, use the java.awt.font.NumericShaper class. The NumericShaper API enables you to display a numeric value represented internally as an ASCII value in any Unicode digit shape.

The following code snippet, from the ArabicDigits example, shows how to use a NumericShaper instance to convert Latin digits to Arabic digits. The line that determines the shaping action is bolded.

ArabicDigitsPanel(String fontname) {
    HashMap map = new HashMap();
    Font font = new Font(fontname, Font.PLAIN, 60);
    map.put(TextAttribute.FONT, font);

    FontRenderContext frc = new FontRenderContext(null, false, false);
    layout = new TextLayout(text, map, frc);

// ...

public void paint(Graphics g) {
    Graphics2D g2d = (Graphics2D)g;
    layout.draw(g2d, 10, 50);

The NumericShaper instance for Arabic digits is fetched and placed into a HashMap for the TextLayout.NUMERIC_SHAPING attribute key. The hash map is passed to the TextLayout instance. After rendering the text in the paint method, the digits are displayed in the desired script. In this example, the Latin digits, 0 through 9, are drawn as Arabic digits.

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

The previous example uses the NumericShaper.ARABIC constant to retrieve the desired shaper, but the NumericShaper class provides constants for many languages. These constants are defined as bit masks and are referred to as the NumericShaper bit mask-based constants.

Enum-Based Range Constants

An alternative way to specify a particular set of digits is to use the NumericShaper.Range enumerated type (enum). This enum, introduced in the Java SE 7 release, also provides a set of constants. Although these constants are defined using different mechanisms, the NumericShaper.ARABIC bit mask is functionally equivalent to the NumericShaper.Range.ARABIC enum, and there is a corresponding getShaper method for each constant type:

◈ getShaper(int singleRange)
◈ getShaper(NumericShaper.Range singleRange)

The ArabicDigitsEnum example is identical to the ArabicDigits example, except it uses the NumericShaper.Range enum to specify the language script:

ArabicDigitsEnumPanel(String fontname) {
    HashMap map = new HashMap();
    Font font = new Font(fontname, Font.PLAIN, 60);
    map.put(TextAttribute.FONT, font);
    FontRenderContext frc = new FontRenderContext(null, false, false);
    layout = new TextLayout(text, map, frc);

Both getShaper methods accept a singleRange parameter. With either constant type, you can specify a range of script-specific digits. The bit mask-based constants can be combined using the OR operand, or you can create a set of NumericShaper.Range enums. The following shows how to define a range using each constant type:

NumericShaper.MONGOLIAN | NumericShaper.THAI |

You can query the NumericShaper object to determine which ranges it supports using either the getRanges method for bit mask-based shapers or the getRangeSet method for enum-based shapers.

Note: You can use either the traditional bit masked-based constants or the Range enum-based constants. Here are some considerations when deciding which to use:

◈ The Range API requires JDK 7 or later.
◈ The Range API covers more Unicode ranges than the bit-masked API.
◈ The bit-mask API is a bit faster than the Range API.

Rendering Digits According to Language Context

The ArabicDigits example was designed to use the shaper for a specific language, but sometimes the digits must be rendered according to the language context. For example, if the text that precedes the digits uses the Thai script, Thai digits are preferred. If the text is displayed in Tibetan, Tibetan digits are preferred.

You can accomplish this using one of the getContextualShaper methods:

◈ getContextualShaper(int ranges)
◈ getContextualShaper(int ranges, int defaultContext)
◈ getContextualShaper(Set<NumericShaper.Range> ranges)
◈ getContextualShaper(Set<NumericShaper.Range> ranges, NumericShaper.Range defaultContext)

The first two methods use the bit-mask constants, and the last two use the enum constants. The methods that accept a defaultContext parameter enable you to specify the initial shaper that is used when numeric values are displayed before text. When no default context is defined, any leading digits are displayed using Latin shapes.

The ShapedDigits example shows how shapers work. Five text layouts are displayed:
  1. The first layout uses no shaper; all digits are displayed as Latin.
  2. The second layout shapes all digits as Arabic digits, regardless of language context.
  3. The third layout employs a contextual shaper that uses Arabic digits. The default context is defined to be Arabic.
  4. The fourth layout employs a contextual shaper that uses Arabic digits, but the shaper does not specify a default context.
  5. The fifth layout employs a contextual shaper that uses the ALL_RANGES bit mask, but the shaper does not specify a default context.
Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

The following lines of code show how the shapers, if used, are defined:
  1. No shaper is used.
  2. NumericShaper arabic = NumericShaper.getShaper(NumericShaper.ARABIC);
  3. NumericShaper contextualArabic = NumericShaper.getContextualShaper(NumericShaper.ARABIC, NumericShaper.ARABIC);
  4. NumericShaper contextualArabicASCII = NumericShaper.getContextualShaper(NumericShaper.ARABIC);
  5. NumericShaper contextualAll = NumericShaper.getContextualShaper(NumericShaper.ALL_RANGES);

05. Converting Non-Unicode Text

In the Java programming language char values represent Unicode characters. Unicode is a 16-bit character encoding that supports the world's major languages. You can learn more about the Unicode standard at the Unicode Consortium Web site .

Few text editors currently support Unicode text entry. The text editor we used to write this section's code examples supports only ASCII characters, which are limited to 7 bits. To indicate Unicode characters that cannot be represented in ASCII, such as ö, we used the \uXXXX escape sequence. Each X in the escape sequence is a hexadecimal digit. The following example shows how to indicate the ö character with an escape sequence:

String str = "\u00F6";
char c = '\u00F6';
Character letter = new Character('\u00F6');

A variety of character encodings are used by systems around the world. Currently few of these encodings conform to Unicode. Because your program expects characters in Unicode, the text data it gets from the system must be converted into Unicode, and vice versa. Data in text files is automatically converted to Unicode when its encoding matches the default file encoding of the Java Virtual Machine. You can identify the default file encoding by creating an OutputStreamWriter using it and asking for its canonical name:

OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());

If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.

This section discusses the APIs you use to translate non-Unicode text into Unicode. Before using these APIs, you should verify that the character encoding you wish to convert into Unicode is supported. The list of supported character encodings is not part of the Java programming language specification. Therefore the character encodings supported by the APIs may vary with platform.

The material that follows describes two techniques for converting non-Unicode text to Unicode. You can convert non-Unicode byte arrays into String objects, and vice versa. Or you can translate between streams of Unicode characters and byte streams of non-Unicode text.

5.1 Byte Encodings and Strings

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.

The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file

The StringConverter program starts by creating a String containing Unicode characters:

String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");

When printed, the String named original appears as:


To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:

try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();

    String roundTrip = new String(utf8Bytes, "UTF8");
    System.out.println("roundTrip = " + roundTrip);
    printBytes(utf8Bytes, "utf8Bytes");
    printBytes(defaultBytes, "defaultBytes");
catch (UnsupportedEncodingException e) {

The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.

The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, Here is the printBytes method:

public static void printBytes(byte[] array, String name) {
    for (int k = 0; k < array.length; k++) {
        System.out.println(name + "[" + k + "] = " + "0x" +

The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:

utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

5.2 Character and Byte Streams

The package provides classes that allow you to convert between Unicode character streams and byte streams of non-Unicode text. With the InputStreamReader class, you can convert byte streams to character streams. You use the OutputStreamWriter class to translate character streams into byte streams. The following figure illustrates the conversion process:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

When you create InputStreamReader and OutputStreamWriter objects, you specify the byte encoding that you want to convert. For example, to translate a text file in the UTF-8 encoding into Unicode, you create an InputStreamReader as follows:

FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis, "UTF8");

If you omit the encoding identifier, InputStreamReader and OutputStreamWriter rely on the default encoding. You can determine which encoding an InputStreamReader or OutputStreamWriter uses by invoking the getEncoding method, as follows:

InputStreamReader defaultReader = new InputStreamReader(fis);
String defaultEncoding = defaultReader.getEncoding();

The example that follows shows you how to perform character-set conversions with the InputStreamReader and OutputStreamWriter classes. The full source code for this example is in This program displays Japanese characters. Before trying it out, verify that the appropriate fonts have been installed on your system. If you are using the JDK software that is compatible with version 1.1, make a copy of the file and then replace it with the file.

The StreamConverter program converts a sequence of Unicode characters from a String object into a FileOutputStream of bytes encoded in UTF-8. The method that performs the conversion is called writeOutput:

static void writeOutput(String str) {
    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
    catch (IOException e) {

The readInput method reads the bytes encoded in UTF-8 from the file created by the writeOutput method. An InputStreamReader object converts the bytes from UTF-8 into Unicode and returns the result in a String. The readInput method is as follows:

static String readInput() {
    StringBuffer buffer = new StringBuffer();
    try {
        FileInputStream fis = new FileInputStream("test.txt");
        InputStreamReader isr = new InputStreamReader(fis, "UTF8");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = > -1) {
        return buffer.toString();
    catch (IOException e) {
        return null;

The main method of the StreamConverter program invokes the writeOutput method to create a file of bytes encoded in UTF-8. The readInput method reads the same file, converting the bytes back into Unicode. Here is the source code for the main method:

public static void main(String[] args) {
    String jaString = new String("\u65e5\u672c\u8a9e\u6587\u5b57\u5217");
    String inputString = readInput();
    String displayString = jaString + " " + inputString;
    new ShowString(displayString, "Conversion Demo");

The original string (jaString) should be identical to the newly created string (inputString). To show that the two strings are the same, the program concatenates them and displays them with a ShowString object. The ShowString class displays a string with the Graphics.drawString method. The source code for this class is in When the StreamConverter program instantiates ShowString, the following window appears. The repetition of the characters displayed verifies that the two strings are identical:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

06. Normalizer's API

Normalization is the process by which you can perform certain transformations of text to make it reconcilable in a way which it may not have been before. Let's say, you would like searching or sorting text, in this case you need to normalize that text to account for code points that should be represented as the same text.

What can be normalized? The normalization is applicable when you need to convert characters with diacritical marks, change all letters case, decompose ligatures, or convert half-width katakana characters to full-width characters and so on.

In accordance with the Unicode Standard Annex #15 the Normalizer's API supports all of the following four Unicode text normalization forms that are defined in the java.text.Normalizer.Form:

◈ Normalization Form D (NFD): Canonical Decomposition
◈ Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition0
◈ Normalization Form KD (NFKD): Compatibility Decomposition
◈ Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

Let's examine how the latin small letter "o" with diaeresis can be normalized by using these normalization forms:

Original word NFC  NFD  NFKC  NFKD 
"schön" "schön"  "scho\u0308n"  "schön"  "scho\u0308n"

You can notice that an original word is left unchanged in NFC and NFKC. This is because with NFD and NFKD, composite characters are mapped to their canonical decompositions. But with NFC and NFKC, combining character sequences are mapped to composites, if possible. There is no composite for diaeresis, so it is left decomposed in NFC and NFKC.

In the code example,, which is represented later, you can also notice another normalization feature. The half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents. However, they are not canonical equivalents.

To be sure that you really need to normalize the text you may use the isNormalized method to determine if the given sequence of char values is normalized. If this method returns false, it means that you have to normalize this sequence and you should use the normalize method which normalizes a char values according to the specified normalization form. For example, to transform text into the canonical decomposed form you will have to use the following normalize method:

normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFD);

Also, the normalize method rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.

07. Working with Bidirectional Text with JTextComponent Class

This section discusses how to work with bidirectional text with the JTextComponent class. Bidirectional text is text that contains text that runs in two directions, left-to-right and right-to-left. An example of bidirectional text is Arabic text (which runs right-to-left) that contain numbers (which run left-to-right). It is more difficult to display and manage bidirectional text; however the JTextComponent handles these issues for you.

The following topics are covered:

◈ Determining Directionality of Bidirectional Text
◈ Displaying and Moving Carets
◈ Hit Testing
◈ Highlighting Selections
◈ Setting Component Orientation

Determining Directionality of Bidirectional Text

The sample, which is based on, displays bidirectional text in a JTextPane object. In most cases, the Java platform can determine the directionality of bidirectional Unicode text:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

Explicitly Specifying Text Run Direction in JTextComponent Objects

You can specify the run direction of the Document object of a JTextComponent object. For example, the following statement specifies that the text in the JTextPane object textPane runs right-to-left:


Alternatively, you can specify the component orientation of a particular Swing component based on locale. For example, the following statements specify that the component orientation of the object textPane is based on the ar-SA locale:

Locale arabicSaudiArabia = 
    new Locale.Builder().setLanguage("ar").setRegion("SA").build();


Because the run direction of the Arabic language is right-to-left, the run direction of the text contained in the textPane object is right-to-left also.

Displaying and Moving Carets

In editable text, a caret is used to graphically represent the current insertion point, the position in the text where new characters will be inserted. In the sample, the caret contains a small triangle that points toward the direction where an inserted character will be displayed.

By default, a JTextComponent object creates a keymap (of type Keymap) that is shared by all JTextComponent instances as the default keymap. A keymap lets an application bind key strokes to action. A default keymap (for JTextComponent objects that support caret movement) includes the binding between caret movement forward and backward with the left and right arrow keys, which supports caret movement through bidirectional text.

Hit Testing

Often, a location in device space must be converted to a text offset. For example, when a user clicks the mouse on selectable text, the location of the mouse is converted to a text offset and used as one end of the selection range. Logically, this is the inverse of positioning a caret.

You can attach a caret listener to an instance of an JTextComponent. A caret listener enables you to handle caret events, which occur when the caret moves or when the selection in a text component changes. You attach a caret listener with the addCaretListener method. 

Highlighting Selections

A selected range of characters is represented graphically by a highlight region, an area in which glyphs are displayed with inverse video or against a different background color.

JTextComponent objects implement logical highlighting. This means that the selected characters are always contiguous in the text model, and the highlight region is allowed to be discontiguous. The following is an example of logical highlighting:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

Setting Component Orientation

Swing's layout managers understand how locale affects a UI; it is not necessary to create a new layout for each locale. For example, in a locale where text flows right to left, the layout manager will arrange components in the same orientation.

The sample has been localized for English, United States; English, United Kingdom; French, France; French, Canada; and Arabic, Saudi Arabia.

The following uses the en-US locale:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

The following uses the ar-SA locale:

Oracle Database Tutorials and Materials, Oracle Database Certifications, Oracle Database Learning

Note that the components have been laid out in the same direction as the corresponding locale: left-to-right for en-US and right-to-left for ar-SA. The sample calls the methods applyComponentOrientation and getOrientation to specify the direction of its components by locale:

private static JFrame frame;

// ...

private static void createAndShowGUI(Locale currentLocale) {

    // Create and set up the window.
    // ...
    // Add contents to the window.
    // ...
    // ...

The sample requires the following resource files:

◈ resources/
◈ resources/
◈ resources/

«« Previous
Next »»