Home | History | Annotate | Download | only in text
      1 /*
      2  * Copyright (C) 2013 The Android Open Source Project
      3  *
      4  * Licensed under the Apache License, Version 2.0 (the "License");
      5  * you may not use this file except in compliance with the License.
      6  * You may obtain a copy of the License at
      7  *
      8  *      http://www.apache.org/licenses/LICENSE-2.0
      9  *
     10  * Unless required by applicable law or agreed to in writing, software
     11  * distributed under the License is distributed on an "AS IS" BASIS,
     12  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     13  * See the License for the specific language governing permissions and
     14  * limitations under the License.
     15  */
     16 
     17 package android.text;
     18 
     19 import android.view.View;
     20 
     21 import static android.text.TextDirectionHeuristics.FIRSTSTRONG_LTR;
     22 
     23 import java.util.Locale;
     24 
     25 /**
     26  * Utility class for formatting text for display in a potentially opposite-directionality context
     27  * without garbling. The directionality of the context is set at formatter creation and the
     28  * directionality of the text can be either estimated or passed in when known.
     29  *
     30  * <p>To support versions lower than {@link android.os.Build.VERSION_CODES#JELLY_BEAN_MR2},
     31  * you can use the support library's {@link android.support.v4.text.BidiFormatter} class.
     32  *
     33  * <p>These APIs provides the following functionality:
     34  * <p>
     35  * 1. Bidi Wrapping
     36  * When text in one language is mixed into a document in another, opposite-directionality language,
     37  * e.g. when an English business name is embedded in some Hebrew text, both the inserted string
     38  * and the text surrounding it may be displayed incorrectly unless the inserted string is explicitly
     39  * separated from the surrounding text in a "wrapper" that:
     40  * <p>
     41  * - Declares its directionality so that the string is displayed correctly. This can be done in
     42  *   Unicode bidi formatting codes by {@link #unicodeWrap} and similar methods.
     43  * <p>
     44  * - Isolates the string's directionality, so it does not unduly affect the surrounding content.
     45  *   Currently, this can only be done using invisible Unicode characters of the same direction as
     46  *   the context (LRM or RLM) in addition to the directionality declaration above, thus "resetting"
     47  *   the directionality to that of the context. The "reset" may need to be done at both ends of the
     48  *   string. Without "reset" after the string, the string will "stick" to a number or logically
     49  *   separate opposite-direction text that happens to follow it in-line (even if separated by
     50  *   neutral content like spaces and punctuation). Without "reset" before the string, the same can
     51  *   happen there, but only with more opposite-direction text, not a number. One approach is to
     52  *   "reset" the direction only after each string, on the theory that if the preceding opposite-
     53  *   direction text is itself bidi-wrapped, the "reset" after it will prevent the sticking. (Doing
     54  *   the "reset" only before each string definitely does not work because we do not want to require
     55  *   bidi-wrapping numbers, and a bidi-wrapped opposite-direction string could be followed by a
     56  *   number.) Still, the safest policy is to do the "reset" on both ends of each string, since RTL
     57  *   message translations often contain untranslated Latin-script brand names and technical terms,
     58  *   and one of these can be followed by a bidi-wrapped inserted value. On the other hand, when one
     59  *   has such a message, it is best to do the "reset" manually in the message translation itself,
     60  *   since the message's opposite-direction text could be followed by an inserted number, which we
     61  *   would not bidi-wrap anyway. Thus, "reset" only after the string is the current default. In an
     62  *   alternative to "reset", recent additions to the HTML, CSS, and Unicode standards allow the
     63  *   isolation to be part of the directionality declaration. This form of isolation is better than
     64  *   "reset" because it takes less space, does not require knowing the context directionality, has a
     65  *   gentler effect than "reset", and protects both ends of the string. However, we do not yet allow
     66  *   using it because required platforms do not yet support it.
     67  * <p>
     68  * Providing these wrapping services is the basic purpose of the bidi formatter.
     69  * <p>
     70  * 2. Directionality estimation
     71  * How does one know whether a string about to be inserted into surrounding text has the same
     72  * directionality? Well, in many cases, one knows that this must be the case when writing the code
     73  * doing the insertion, e.g. when a localized message is inserted into a localized page. In such
     74  * cases there is no need to involve the bidi formatter at all. In some other cases, it need not be
     75  * the same as the context, but is either constant (e.g. urls are always LTR) or otherwise known.
     76  * In the remaining cases, e.g. when the string is user-entered or comes from a database, the
     77  * language of the string (and thus its directionality) is not known a priori, and must be
     78  * estimated at run-time. The bidi formatter can do this automatically using the default
     79  * first-strong estimation algorithm. It can also be configured to use a custom directionality
     80  * estimation object.
     81  */
     82 public final class BidiFormatter {
     83 
     84     /**
     85      * The default text direction heuristic.
     86      */
     87     private static TextDirectionHeuristic DEFAULT_TEXT_DIRECTION_HEURISTIC = FIRSTSTRONG_LTR;
     88 
     89     /**
     90      * Unicode "Left-To-Right Embedding" (LRE) character.
     91      */
     92     private static final char LRE = '\u202A';
     93 
     94     /**
     95      * Unicode "Right-To-Left Embedding" (RLE) character.
     96      */
     97     private static final char RLE = '\u202B';
     98 
     99     /**
    100      * Unicode "Pop Directional Formatting" (PDF) character.
    101      */
    102     private static final char PDF = '\u202C';
    103 
    104     /**
    105      *  Unicode "Left-To-Right Mark" (LRM) character.
    106      */
    107     private static final char LRM = '\u200E';
    108 
    109     /*
    110      * Unicode "Right-To-Left Mark" (RLM) character.
    111      */
    112     private static final char RLM = '\u200F';
    113 
    114     /*
    115      * String representation of LRM
    116      */
    117     private static final String LRM_STRING = Character.toString(LRM);
    118 
    119     /*
    120      * String representation of RLM
    121      */
    122     private static final String RLM_STRING = Character.toString(RLM);
    123 
    124     /**
    125      * Empty string constant.
    126      */
    127     private static final String EMPTY_STRING = "";
    128 
    129     /**
    130      * A class for building a BidiFormatter with non-default options.
    131      */
    132     public static final class Builder {
    133         private boolean mIsRtlContext;
    134         private int mFlags;
    135         private TextDirectionHeuristic mTextDirectionHeuristic;
    136 
    137         /**
    138          * Constructor.
    139          *
    140          */
    141         public Builder() {
    142             initialize(isRtlLocale(Locale.getDefault()));
    143         }
    144 
    145         /**
    146          * Constructor.
    147          *
    148          * @param rtlContext Whether the context directionality is RTL.
    149          */
    150         public Builder(boolean rtlContext) {
    151             initialize(rtlContext);
    152         }
    153 
    154         /**
    155          * Constructor.
    156          *
    157          * @param locale The context locale.
    158          */
    159         public Builder(Locale locale) {
    160             initialize(isRtlLocale(locale));
    161         }
    162 
    163         /**
    164          * Initializes the builder with the given context directionality and default options.
    165          *
    166          * @param isRtlContext Whether the context is RTL or not.
    167          */
    168         private void initialize(boolean isRtlContext) {
    169             mIsRtlContext = isRtlContext;
    170             mTextDirectionHeuristic = DEFAULT_TEXT_DIRECTION_HEURISTIC;
    171             mFlags = DEFAULT_FLAGS;
    172         }
    173 
    174         /**
    175          * Specifies whether the BidiFormatter to be built should also "reset" directionality before
    176          * a string being bidi-wrapped, not just after it. The default is true.
    177          */
    178         public Builder stereoReset(boolean stereoReset) {
    179             if (stereoReset) {
    180                 mFlags |= FLAG_STEREO_RESET;
    181             } else {
    182                 mFlags &= ~FLAG_STEREO_RESET;
    183             }
    184             return this;
    185         }
    186 
    187         /**
    188          * Specifies the default directionality estimation algorithm to be used by the BidiFormatter.
    189          * By default, uses the first-strong heuristic.
    190          *
    191          * @param heuristic the {@code TextDirectionHeuristic} to use.
    192          * @return the builder itself.
    193          */
    194         public Builder setTextDirectionHeuristic(TextDirectionHeuristic heuristic) {
    195             mTextDirectionHeuristic = heuristic;
    196             return this;
    197         }
    198 
    199         private static BidiFormatter getDefaultInstanceFromContext(boolean isRtlContext) {
    200             return isRtlContext ? DEFAULT_RTL_INSTANCE : DEFAULT_LTR_INSTANCE;
    201         }
    202 
    203         /**
    204          * @return A BidiFormatter with the specified options.
    205          */
    206         public BidiFormatter build() {
    207             if (mFlags == DEFAULT_FLAGS &&
    208                     mTextDirectionHeuristic == DEFAULT_TEXT_DIRECTION_HEURISTIC) {
    209                 return getDefaultInstanceFromContext(mIsRtlContext);
    210             }
    211             return new BidiFormatter(mIsRtlContext, mFlags, mTextDirectionHeuristic);
    212         }
    213     }
    214 
    215     //
    216     private static final int FLAG_STEREO_RESET = 2;
    217     private static final int DEFAULT_FLAGS = FLAG_STEREO_RESET;
    218 
    219     private static final BidiFormatter DEFAULT_LTR_INSTANCE = new BidiFormatter(
    220             false /* LTR context */,
    221             DEFAULT_FLAGS,
    222             DEFAULT_TEXT_DIRECTION_HEURISTIC);
    223 
    224     private static final BidiFormatter DEFAULT_RTL_INSTANCE = new BidiFormatter(
    225             true /* RTL context */,
    226             DEFAULT_FLAGS,
    227             DEFAULT_TEXT_DIRECTION_HEURISTIC);
    228 
    229     private final boolean mIsRtlContext;
    230     private final int mFlags;
    231     private final TextDirectionHeuristic mDefaultTextDirectionHeuristic;
    232 
    233     /**
    234      * Factory for creating an instance of BidiFormatter for the default locale directionality.
    235      *
    236      */
    237     public static BidiFormatter getInstance() {
    238         return new Builder().build();
    239     }
    240 
    241     /**
    242      * Factory for creating an instance of BidiFormatter given the context directionality.
    243      *
    244      * @param rtlContext Whether the context directionality is RTL.
    245      */
    246     public static BidiFormatter getInstance(boolean rtlContext) {
    247         return new Builder(rtlContext).build();
    248     }
    249 
    250     /**
    251      * Factory for creating an instance of BidiFormatter given the context locale.
    252      *
    253      * @param locale The context locale.
    254      */
    255     public static BidiFormatter getInstance(Locale locale) {
    256         return new Builder(locale).build();
    257     }
    258 
    259     /**
    260      * @param isRtlContext Whether the context directionality is RTL or not.
    261      * @param flags The option flags.
    262      * @param heuristic The default text direction heuristic.
    263      */
    264     private BidiFormatter(boolean isRtlContext, int flags, TextDirectionHeuristic heuristic) {
    265         mIsRtlContext = isRtlContext;
    266         mFlags = flags;
    267         mDefaultTextDirectionHeuristic = heuristic;
    268     }
    269 
    270     /**
    271      * @return Whether the context directionality is RTL
    272      */
    273     public boolean isRtlContext() {
    274         return mIsRtlContext;
    275     }
    276 
    277     /**
    278      * @return Whether directionality "reset" should also be done before a string being
    279      * bidi-wrapped, not just after it.
    280      */
    281     public boolean getStereoReset() {
    282         return (mFlags & FLAG_STEREO_RESET) != 0;
    283     }
    284 
    285     /**
    286      * Returns a Unicode bidi mark matching the context directionality (LRM or RLM) if either the
    287      * overall or the exit directionality of a given string is opposite to the context directionality.
    288      * Putting this after the string (including its directionality declaration wrapping) prevents it
    289      * from "sticking" to other opposite-directionality text or a number appearing after it inline
    290      * with only neutral content in between. Otherwise returns the empty string. While the exit
    291      * directionality is determined by scanning the end of the string, the overall directionality is
    292      * given explicitly by a heuristic to estimate the {@code str}'s directionality.
    293      *
    294      * @param str String after which the mark may need to appear.
    295      * @param heuristic The text direction heuristic that will be used to estimate the {@code str}'s
    296      *                  directionality.
    297      * @return LRM for RTL text in LTR context; RLM for LTR text in RTL context;
    298      *     else, the empty string.
    299      *
    300      * @hide
    301      */
    302     public String markAfter(String str, TextDirectionHeuristic heuristic) {
    303         final boolean isRtl = heuristic.isRtl(str, 0, str.length());
    304         // getExitDir() is called only if needed (short-circuit).
    305         if (!mIsRtlContext && (isRtl || getExitDir(str) == DIR_RTL)) {
    306             return LRM_STRING;
    307         }
    308         if (mIsRtlContext && (!isRtl || getExitDir(str) == DIR_LTR)) {
    309             return RLM_STRING;
    310         }
    311         return EMPTY_STRING;
    312     }
    313 
    314     /**
    315      * Returns a Unicode bidi mark matching the context directionality (LRM or RLM) if either the
    316      * overall or the entry directionality of a given string is opposite to the context
    317      * directionality. Putting this before the string (including its directionality declaration
    318      * wrapping) prevents it from "sticking" to other opposite-directionality text appearing before
    319      * it inline with only neutral content in between. Otherwise returns the empty string. While the
    320      * entry directionality is determined by scanning the beginning of the string, the overall
    321      * directionality is given explicitly by a heuristic to estimate the {@code str}'s directionality.
    322      *
    323      * @param str String before which the mark may need to appear.
    324      * @param heuristic The text direction heuristic that will be used to estimate the {@code str}'s
    325      *                  directionality.
    326      * @return LRM for RTL text in LTR context; RLM for LTR text in RTL context;
    327      *     else, the empty string.
    328      *
    329      * @hide
    330      */
    331     public String markBefore(String str, TextDirectionHeuristic heuristic) {
    332         final boolean isRtl = heuristic.isRtl(str, 0, str.length());
    333         // getEntryDir() is called only if needed (short-circuit).
    334         if (!mIsRtlContext && (isRtl || getEntryDir(str) == DIR_RTL)) {
    335             return LRM_STRING;
    336         }
    337         if (mIsRtlContext && (!isRtl || getEntryDir(str) == DIR_LTR)) {
    338             return RLM_STRING;
    339         }
    340         return EMPTY_STRING;
    341     }
    342 
    343     /**
    344      * Estimates the directionality of a string using the default text direction heuristic.
    345      *
    346      * @param str String whose directionality is to be estimated.
    347      * @return true if {@code str}'s estimated overall directionality is RTL. Otherwise returns
    348      *          false.
    349      */
    350     public boolean isRtl(String str) {
    351         return mDefaultTextDirectionHeuristic.isRtl(str, 0, str.length());
    352     }
    353 
    354     /**
    355      * Formats a string of given directionality for use in plain-text output of the context
    356      * directionality, so an opposite-directionality string is neither garbled nor garbles its
    357      * surroundings. This makes use of Unicode bidi formatting characters.
    358      * <p>
    359      * The algorithm: In case the given directionality doesn't match the context directionality, wraps
    360      * the string with Unicode bidi formatting characters: RLE+{@code str}+PDF for RTL text, or
    361      * LRE+{@code str}+PDF for LTR text.
    362      * <p>
    363      * If {@code isolate}, directionally isolates the string so that it does not garble its
    364      * surroundings. Currently, this is done by "resetting" the directionality after the string by
    365      * appending a trailing Unicode bidi mark matching the context directionality (LRM or RLM) when
    366      * either the overall directionality or the exit directionality of the string is opposite to
    367      * that of the context. Unless the formatter was built using
    368      * {@link Builder#stereoReset(boolean)} with a {@code false} argument, also prepends a Unicode
    369      * bidi mark matching the context directionality when either the overall directionality or the
    370      * entry directionality of the string is opposite to that of the context. Note that as opposed
    371      * to the overall directionality, the entry and exit directionalities are determined from the
    372      * string itself.
    373      * <p>
    374      * Does *not* do HTML-escaping.
    375      *
    376      * @param str The input string.
    377      * @param heuristic The algorithm to be used to estimate the string's overall direction.
    378      *        See {@link TextDirectionHeuristics} for pre-defined heuristics.
    379      * @param isolate Whether to directionally isolate the string to prevent it from garbling the
    380      *     content around it
    381      * @return Input string after applying the above processing. {@code null} if {@code str} is
    382      *     {@code null}.
    383      */
    384     public String unicodeWrap(String str, TextDirectionHeuristic heuristic, boolean isolate) {
    385         if (str == null) return null;
    386         final boolean isRtl = heuristic.isRtl(str, 0, str.length());
    387         StringBuilder result = new StringBuilder();
    388         if (getStereoReset() && isolate) {
    389             result.append(markBefore(str,
    390                     isRtl ? TextDirectionHeuristics.RTL : TextDirectionHeuristics.LTR));
    391         }
    392         if (isRtl != mIsRtlContext) {
    393             result.append(isRtl ? RLE : LRE);
    394             result.append(str);
    395             result.append(PDF);
    396         } else {
    397             result.append(str);
    398         }
    399         if (isolate) {
    400             result.append(markAfter(str,
    401                     isRtl ? TextDirectionHeuristics.RTL : TextDirectionHeuristics.LTR));
    402         }
    403         return result.toString();
    404     }
    405 
    406     /**
    407      * Operates like {@link #unicodeWrap(String, TextDirectionHeuristic, boolean)}, but assumes
    408      * {@code isolate} is true.
    409      *
    410      * @param str The input string.
    411      * @param heuristic The algorithm to be used to estimate the string's overall direction.
    412      *        See {@link TextDirectionHeuristics} for pre-defined heuristics.
    413      * @return Input string after applying the above processing.
    414      */
    415     public String unicodeWrap(String str, TextDirectionHeuristic heuristic) {
    416         return unicodeWrap(str, heuristic, true /* isolate */);
    417     }
    418 
    419     /**
    420      * Operates like {@link #unicodeWrap(String, TextDirectionHeuristic, boolean)}, but uses the
    421      * formatter's default direction estimation algorithm.
    422      *
    423      * @param str The input string.
    424      * @param isolate Whether to directionally isolate the string to prevent it from garbling the
    425      *     content around it
    426      * @return Input string after applying the above processing.
    427      */
    428     public String unicodeWrap(String str, boolean isolate) {
    429         return unicodeWrap(str, mDefaultTextDirectionHeuristic, isolate);
    430     }
    431 
    432     /**
    433      * Operates like {@link #unicodeWrap(String, TextDirectionHeuristic, boolean)}, but uses the
    434      * formatter's default direction estimation algorithm and assumes {@code isolate} is true.
    435      *
    436      * @param str The input string.
    437      * @return Input string after applying the above processing.
    438      */
    439     public String unicodeWrap(String str) {
    440         return unicodeWrap(str, mDefaultTextDirectionHeuristic, true /* isolate */);
    441     }
    442 
    443     /**
    444      * Helper method to return true if the Locale directionality is RTL.
    445      *
    446      * @param locale The Locale whose directionality will be checked to be RTL or LTR
    447      * @return true if the {@code locale} directionality is RTL. False otherwise.
    448      */
    449     private static boolean isRtlLocale(Locale locale) {
    450         return (TextUtils.getLayoutDirectionFromLocale(locale) == View.LAYOUT_DIRECTION_RTL);
    451     }
    452 
    453     /**
    454      * Enum for directionality type.
    455      */
    456     private static final int DIR_LTR = -1;
    457     private static final int DIR_UNKNOWN = 0;
    458     private static final int DIR_RTL = +1;
    459 
    460     /**
    461      * Returns the directionality of the last character with strong directionality in the string, or
    462      * DIR_UNKNOWN if none was encountered. For efficiency, actually scans backwards from the end of
    463      * the string. Treats a non-BN character between an LRE/RLE/LRO/RLO and its matching PDF as a
    464      * strong character, LTR after LRE/LRO, and RTL after RLE/RLO. The results are undefined for a
    465      * string containing unbalanced LRE/RLE/LRO/RLO/PDF characters. The intended use is to check
    466      * whether a logically separate item that starts with a number or a character of the string's
    467      * exit directionality and follows this string inline (not counting any neutral characters in
    468      * between) would "stick" to it in an opposite-directionality context, thus being displayed in
    469      * an incorrect position. An LRM or RLM character (the one of the context's directionality)
    470      * between the two will prevent such sticking.
    471      *
    472      * @param str the string to check.
    473      */
    474     private static int getExitDir(String str) {
    475         return new DirectionalityEstimator(str, false /* isHtml */).getExitDir();
    476     }
    477 
    478     /**
    479      * Returns the directionality of the first character with strong directionality in the string,
    480      * or DIR_UNKNOWN if none was encountered. Treats a non-BN character between an
    481      * LRE/RLE/LRO/RLO and its matching PDF as a strong character, LTR after LRE/LRO, and RTL after
    482      * RLE/RLO. The results are undefined for a string containing unbalanced LRE/RLE/LRO/RLO/PDF
    483      * characters. The intended use is to check whether a logically separate item that ends with a
    484      * character of the string's entry directionality and precedes the string inline (not counting
    485      * any neutral characters in between) would "stick" to it in an opposite-directionality context,
    486      * thus being displayed in an incorrect position. An LRM or RLM character (the one of the
    487      * context's directionality) between the two will prevent such sticking.
    488      *
    489      * @param str the string to check.
    490      */
    491     private static int getEntryDir(String str) {
    492         return new DirectionalityEstimator(str, false /* isHtml */).getEntryDir();
    493     }
    494 
    495     /**
    496      * An object that estimates the directionality of a given string by various methods.
    497      *
    498      */
    499     private static class DirectionalityEstimator {
    500 
    501         // Internal static variables and constants.
    502 
    503         /**
    504          * Size of the bidi character class cache. The results of the Character.getDirectionality()
    505          * calls on the lowest DIR_TYPE_CACHE_SIZE codepoints are kept in an array for speed.
    506          * The 0x700 value is designed to leave all the European and Near Eastern languages in the
    507          * cache. It can be reduced to 0x180, restricting the cache to the Western European
    508          * languages.
    509          */
    510         private static final int DIR_TYPE_CACHE_SIZE = 0x700;
    511 
    512         /**
    513          * The bidi character class cache.
    514          */
    515         private static final byte DIR_TYPE_CACHE[];
    516 
    517         static {
    518             DIR_TYPE_CACHE = new byte[DIR_TYPE_CACHE_SIZE];
    519             for (int i = 0; i < DIR_TYPE_CACHE_SIZE; i++) {
    520                 DIR_TYPE_CACHE[i] = Character.getDirectionality(i);
    521             }
    522         }
    523 
    524         // Internal instance variables.
    525 
    526         /**
    527          * The text to be scanned.
    528          */
    529         private final String text;
    530 
    531         /**
    532          * Whether the text to be scanned is to be treated as HTML, i.e. skipping over tags and
    533          * entities when looking for the next / preceding dir type.
    534          */
    535         private final boolean isHtml;
    536 
    537         /**
    538          * The length of the text in chars.
    539          */
    540         private final int length;
    541 
    542         /**
    543          * The current position in the text.
    544          */
    545         private int charIndex;
    546 
    547         /**
    548          * The char encountered by the last dirTypeForward or dirTypeBackward call. If it
    549          * encountered a supplementary codepoint, this contains a char that is not a valid
    550          * codepoint. This is ok, because this member is only used to detect some well-known ASCII
    551          * syntax, e.g. "http://" and the beginning of an HTML tag or entity.
    552          */
    553         private char lastChar;
    554 
    555         /**
    556          * Constructor.
    557          *
    558          * @param text The string to scan.
    559          * @param isHtml Whether the text to be scanned is to be treated as HTML, i.e. skipping over
    560          *     tags and entities.
    561          */
    562         DirectionalityEstimator(String text, boolean isHtml) {
    563             this.text = text;
    564             this.isHtml = isHtml;
    565             length = text.length();
    566         }
    567 
    568         /**
    569          * Returns the directionality of the first character with strong directionality in the
    570          * string, or DIR_UNKNOWN if none was encountered. Treats a non-BN character between an
    571          * LRE/RLE/LRO/RLO and its matching PDF as a strong character, LTR after LRE/LRO, and RTL
    572          * after RLE/RLO. The results are undefined for a string containing unbalanced
    573          * LRE/RLE/LRO/RLO/PDF characters.
    574          */
    575         int getEntryDir() {
    576             // The reason for this method name, as opposed to getFirstStrongDir(), is that
    577             // "first strong" is a commonly used description of Unicode's estimation algorithm,
    578             // but the two must treat formatting characters quite differently. Thus, we are staying
    579             // away from both "first" and "last" in these method names to avoid confusion.
    580             charIndex = 0;
    581             int embeddingLevel = 0;
    582             int embeddingLevelDir = DIR_UNKNOWN;
    583             int firstNonEmptyEmbeddingLevel = 0;
    584             while (charIndex < length && firstNonEmptyEmbeddingLevel == 0) {
    585                 switch (dirTypeForward()) {
    586                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_EMBEDDING:
    587                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_OVERRIDE:
    588                         ++embeddingLevel;
    589                         embeddingLevelDir = DIR_LTR;
    590                         break;
    591                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING:
    592                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_OVERRIDE:
    593                         ++embeddingLevel;
    594                         embeddingLevelDir = DIR_RTL;
    595                         break;
    596                     case Character.DIRECTIONALITY_POP_DIRECTIONAL_FORMAT:
    597                         --embeddingLevel;
    598                         // To restore embeddingLevelDir to its previous value, we would need a
    599                         // stack, which we want to avoid. Thus, at this point we do not know the
    600                         // current embedding's directionality.
    601                         embeddingLevelDir = DIR_UNKNOWN;
    602                         break;
    603                     case Character.DIRECTIONALITY_BOUNDARY_NEUTRAL:
    604                         break;
    605                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT:
    606                         if (embeddingLevel == 0) {
    607                             return DIR_LTR;
    608                         }
    609                         firstNonEmptyEmbeddingLevel = embeddingLevel;
    610                         break;
    611                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT:
    612                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC:
    613                         if (embeddingLevel == 0) {
    614                             return DIR_RTL;
    615                         }
    616                         firstNonEmptyEmbeddingLevel = embeddingLevel;
    617                         break;
    618                     default:
    619                         firstNonEmptyEmbeddingLevel = embeddingLevel;
    620                         break;
    621                 }
    622             }
    623 
    624             // We have either found a non-empty embedding or scanned the entire string finding
    625             // neither a non-empty embedding nor a strong character outside of an embedding.
    626             if (firstNonEmptyEmbeddingLevel == 0) {
    627                 // We have not found a non-empty embedding. Thus, the string contains neither a
    628                 // non-empty embedding nor a strong character outside of an embedding.
    629                 return DIR_UNKNOWN;
    630             }
    631 
    632             // We have found a non-empty embedding.
    633             if (embeddingLevelDir != DIR_UNKNOWN) {
    634                 // We know the directionality of the non-empty embedding.
    635                 return embeddingLevelDir;
    636             }
    637 
    638             // We do not remember the directionality of the non-empty embedding we found. So, we go
    639             // backwards to find the start of the non-empty embedding and get its directionality.
    640             while (charIndex > 0) {
    641                 switch (dirTypeBackward()) {
    642                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_EMBEDDING:
    643                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_OVERRIDE:
    644                         if (firstNonEmptyEmbeddingLevel == embeddingLevel) {
    645                             return DIR_LTR;
    646                         }
    647                         --embeddingLevel;
    648                         break;
    649                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING:
    650                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_OVERRIDE:
    651                         if (firstNonEmptyEmbeddingLevel == embeddingLevel) {
    652                             return DIR_RTL;
    653                         }
    654                         --embeddingLevel;
    655                         break;
    656                     case Character.DIRECTIONALITY_POP_DIRECTIONAL_FORMAT:
    657                         ++embeddingLevel;
    658                         break;
    659                 }
    660             }
    661             // We should never get here.
    662             return DIR_UNKNOWN;
    663         }
    664 
    665         /**
    666          * Returns the directionality of the last character with strong directionality in the
    667          * string, or DIR_UNKNOWN if none was encountered. For efficiency, actually scans backwards
    668          * from the end of the string. Treats a non-BN character between an LRE/RLE/LRO/RLO and its
    669          * matching PDF as a strong character, LTR after LRE/LRO, and RTL after RLE/RLO. The results
    670          * are undefined for a string containing unbalanced LRE/RLE/LRO/RLO/PDF characters.
    671          */
    672         int getExitDir() {
    673             // The reason for this method name, as opposed to getLastStrongDir(), is that "last
    674             // strong" sounds like the exact opposite of "first strong", which is a commonly used
    675             // description of Unicode's estimation algorithm (getUnicodeDir() above), but the two
    676             // must treat formatting characters quite differently. Thus, we are staying away from
    677             // both "first" and "last" in these method names to avoid confusion.
    678             charIndex = length;
    679             int embeddingLevel = 0;
    680             int lastNonEmptyEmbeddingLevel = 0;
    681             while (charIndex > 0) {
    682                 switch (dirTypeBackward()) {
    683                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT:
    684                         if (embeddingLevel == 0) {
    685                             return DIR_LTR;
    686                         }
    687                         if (lastNonEmptyEmbeddingLevel == 0) {
    688                             lastNonEmptyEmbeddingLevel = embeddingLevel;
    689                         }
    690                         break;
    691                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_EMBEDDING:
    692                     case Character.DIRECTIONALITY_LEFT_TO_RIGHT_OVERRIDE:
    693                         if (lastNonEmptyEmbeddingLevel == embeddingLevel) {
    694                             return DIR_LTR;
    695                         }
    696                         --embeddingLevel;
    697                         break;
    698                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT:
    699                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC:
    700                         if (embeddingLevel == 0) {
    701                             return DIR_RTL;
    702                         }
    703                         if (lastNonEmptyEmbeddingLevel == 0) {
    704                             lastNonEmptyEmbeddingLevel = embeddingLevel;
    705                         }
    706                         break;
    707                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING:
    708                     case Character.DIRECTIONALITY_RIGHT_TO_LEFT_OVERRIDE:
    709                         if (lastNonEmptyEmbeddingLevel == embeddingLevel) {
    710                             return DIR_RTL;
    711                         }
    712                         --embeddingLevel;
    713                         break;
    714                     case Character.DIRECTIONALITY_POP_DIRECTIONAL_FORMAT:
    715                         ++embeddingLevel;
    716                         break;
    717                     case Character.DIRECTIONALITY_BOUNDARY_NEUTRAL:
    718                         break;
    719                     default:
    720                         if (lastNonEmptyEmbeddingLevel == 0) {
    721                             lastNonEmptyEmbeddingLevel = embeddingLevel;
    722                         }
    723                         break;
    724                 }
    725             }
    726             return DIR_UNKNOWN;
    727         }
    728 
    729         // Internal methods
    730 
    731         /**
    732          * Gets the bidi character class, i.e. Character.getDirectionality(), of a given char, using
    733          * a cache for speed. Not designed for supplementary codepoints, whose results we do not
    734          * cache.
    735          */
    736         private static byte getCachedDirectionality(char c) {
    737             return c < DIR_TYPE_CACHE_SIZE ? DIR_TYPE_CACHE[c] : Character.getDirectionality(c);
    738         }
    739 
    740         /**
    741          * Returns the Character.DIRECTIONALITY_... value of the next codepoint and advances
    742          * charIndex. If isHtml, and the codepoint is '<' or '&', advances through the tag/entity,
    743          * and returns Character.DIRECTIONALITY_WHITESPACE. For an entity, it would be best to
    744          * figure out the actual character, and return its dirtype, but treating it as whitespace is
    745          * good enough for our purposes.
    746          *
    747          * @throws java.lang.IndexOutOfBoundsException if called when charIndex >= length or < 0.
    748          */
    749         byte dirTypeForward() {
    750             lastChar = text.charAt(charIndex);
    751             if (Character.isHighSurrogate(lastChar)) {
    752                 int codePoint = Character.codePointAt(text, charIndex);
    753                 charIndex += Character.charCount(codePoint);
    754                 return Character.getDirectionality(codePoint);
    755             }
    756             charIndex++;
    757             byte dirType = getCachedDirectionality(lastChar);
    758             if (isHtml) {
    759                 // Process tags and entities.
    760                 if (lastChar == '<') {
    761                     dirType = skipTagForward();
    762                 } else if (lastChar == '&') {
    763                     dirType = skipEntityForward();
    764                 }
    765             }
    766             return dirType;
    767         }
    768 
    769         /**
    770          * Returns the Character.DIRECTIONALITY_... value of the preceding codepoint and advances
    771          * charIndex backwards. If isHtml, and the codepoint is the end of a complete HTML tag or
    772          * entity, advances over the whole tag/entity and returns
    773          * Character.DIRECTIONALITY_WHITESPACE. For an entity, it would be best to figure out the
    774          * actual character, and return its dirtype, but treating it as whitespace is good enough
    775          * for our purposes.
    776          *
    777          * @throws java.lang.IndexOutOfBoundsException if called when charIndex > length or <= 0.
    778          */
    779         byte dirTypeBackward() {
    780             lastChar = text.charAt(charIndex - 1);
    781             if (Character.isLowSurrogate(lastChar)) {
    782                 int codePoint = Character.codePointBefore(text, charIndex);
    783                 charIndex -= Character.charCount(codePoint);
    784                 return Character.getDirectionality(codePoint);
    785             }
    786             charIndex--;
    787             byte dirType = getCachedDirectionality(lastChar);
    788             if (isHtml) {
    789                 // Process tags and entities.
    790                 if (lastChar == '>') {
    791                     dirType = skipTagBackward();
    792                 } else if (lastChar == ';') {
    793                     dirType = skipEntityBackward();
    794                 }
    795             }
    796             return dirType;
    797         }
    798 
    799         /**
    800          * Advances charIndex forward through an HTML tag (after the opening &lt; has already been
    801          * read) and returns Character.DIRECTIONALITY_WHITESPACE. If there is no matching &gt;,
    802          * does not change charIndex and returns Character.DIRECTIONALITY_OTHER_NEUTRALS (for the
    803          * &lt; that hadn't been part of a tag after all).
    804          */
    805         private byte skipTagForward() {
    806             int initialCharIndex = charIndex;
    807             while (charIndex < length) {
    808                 lastChar = text.charAt(charIndex++);
    809                 if (lastChar == '>') {
    810                     // The end of the tag.
    811                     return Character.DIRECTIONALITY_WHITESPACE;
    812                 }
    813                 if (lastChar == '"' || lastChar == '\'') {
    814                     // Skip over a quoted attribute value inside the tag.
    815                     char quote = lastChar;
    816                     while (charIndex < length && (lastChar = text.charAt(charIndex++)) != quote) {}
    817                 }
    818             }
    819             // The original '<' wasn't the start of a tag after all.
    820             charIndex = initialCharIndex;
    821             lastChar = '<';
    822             return Character.DIRECTIONALITY_OTHER_NEUTRALS;
    823         }
    824 
    825         /**
    826          * Advances charIndex backward through an HTML tag (after the closing &gt; has already been
    827          * read) and returns Character.DIRECTIONALITY_WHITESPACE. If there is no matching &lt;, does
    828          * not change charIndex and returns Character.DIRECTIONALITY_OTHER_NEUTRALS (for the &gt;
    829          * that hadn't been part of a tag after all). Nevertheless, the running time for calling
    830          * skipTagBackward() in a loop remains linear in the size of the text, even for a text like
    831          * "&gt;&gt;&gt;&gt;", because skipTagBackward() also stops looking for a matching &lt;
    832          * when it encounters another &gt;.
    833          */
    834         private byte skipTagBackward() {
    835             int initialCharIndex = charIndex;
    836             while (charIndex > 0) {
    837                 lastChar = text.charAt(--charIndex);
    838                 if (lastChar == '<') {
    839                     // The start of the tag.
    840                     return Character.DIRECTIONALITY_WHITESPACE;
    841                 }
    842                 if (lastChar == '>') {
    843                     break;
    844                 }
    845                 if (lastChar == '"' || lastChar == '\'') {
    846                     // Skip over a quoted attribute value inside the tag.
    847                     char quote = lastChar;
    848                     while (charIndex > 0 && (lastChar = text.charAt(--charIndex)) != quote) {}
    849                 }
    850             }
    851             // The original '>' wasn't the end of a tag after all.
    852             charIndex = initialCharIndex;
    853             lastChar = '>';
    854             return Character.DIRECTIONALITY_OTHER_NEUTRALS;
    855         }
    856 
    857         /**
    858          * Advances charIndex forward through an HTML character entity tag (after the opening
    859          * &amp; has already been read) and returns Character.DIRECTIONALITY_WHITESPACE. It would be
    860          * best to figure out the actual character and return its dirtype, but this is good enough.
    861          */
    862         private byte skipEntityForward() {
    863             while (charIndex < length && (lastChar = text.charAt(charIndex++)) != ';') {}
    864             return Character.DIRECTIONALITY_WHITESPACE;
    865         }
    866 
    867         /**
    868          * Advances charIndex backward through an HTML character entity tag (after the closing ;
    869          * has already been read) and returns Character.DIRECTIONALITY_WHITESPACE. It would be best
    870          * to figure out the actual character and return its dirtype, but this is good enough.
    871          * If there is no matching &amp;, does not change charIndex and returns
    872          * Character.DIRECTIONALITY_OTHER_NEUTRALS (for the ';' that did not start an entity after
    873          * all). Nevertheless, the running time for calling skipEntityBackward() in a loop remains
    874          * linear in the size of the text, even for a text like ";;;;;;;", because skipTagBackward()
    875          * also stops looking for a matching &amp; when it encounters another ;.
    876          */
    877         private byte skipEntityBackward() {
    878             int initialCharIndex = charIndex;
    879             while (charIndex > 0) {
    880                 lastChar = text.charAt(--charIndex);
    881                 if (lastChar == '&') {
    882                     return Character.DIRECTIONALITY_WHITESPACE;
    883                 }
    884                 if (lastChar == ';') {
    885                     break;
    886                 }
    887             }
    888             charIndex = initialCharIndex;
    889             lastChar = ';';
    890             return Character.DIRECTIONALITY_OTHER_NEUTRALS;
    891         }
    892     }
    893 }