Tuesday, October 5, 2010

Text direction

Most of scripts in the world use left-to-right direction (LTR). But some scripts are right-to-left (RTL), e.g. Hebrew and Arabic. OK, the days when poor programmers had to revert Hebrew strings to show them right-to-left. But still the document layout is our responsibility. Speaking about HTML we have to define document direction (ltr or rtl) and choose between margin-left and margin-right, padding-left and padding-right.
But first we have to determine whether the current script is LTR or RTL. I typically used code like

private static boolean isRTL(Locale locale) {
    Pattern rtl = Pattern.compile("he|iw|ar");
    return rtl.matcher(locale.getLanguage()).matches();
}

But once I had several free minutes and tried to investigate what is the code that supports all languages. Fortunately I found such code in Google GWT:
 
private static final Pattern RtlLocalesRe = Pattern.compile(
 "^(ar|dv|he|iw|fa|nqo|ps|sd|ug|ur|yi|.*[-_](Arab|Hebr|Thaa|Nkoo|Tfng))" +
 "(?!.*[-_](Latn|Cyrl)($|-|_))($|-|_)");
 public static boolean isRtlLanguage(String language) {
     return language != null && RtlLocalesRe.matcher(language).find();
 }

Obviously I did not want to put GWT to my classpath only for this small function, so I have created my humble utility class and copied this code there. The full code can be found in the end of this post.
But this was not enough. I found that I need method that returns “ltr” or “rtl” for current language (see getDocumentDirection). Method getAlign that returns “right” or “left” is useful for value of CSS attribute value of align and for generating of margin and padding attributes:
 
<div style="margin-RTLUtil.getLangAlign(lang): 2em; direction: RTLUtil.getLangDirection(lang)">
</div>

OK, everything is fine when whole screen is written in one language. But some sites contain user created content. For example blogs, advertisement sites etc. Some users may post text in language that uses other direction than the site itself. This looks bad. I suggest the following solution.
As we see in previous code snippet there are very few RTL languages. All others are LTR. I did not spend too much time to investigate all languages mentioned in Google’s regular expression but definitely Hebrew and Arabic use their own script. It means that it is very easy to write code that identifies the script type using the unicode range.  But how may letters to check? Even one paragraph may use letters from several different character sets that use different directions (it is for example very common practice in job description for programmers.) I decided that 50% criterion is good enough, i.e. the text should be aligned to the left if most characters in the text  should be shown from left to right and wise versa. The following method implements this idea.
 
public static boolean isRtlScript(String text) {
    int[][] rtlRanges = new int[][] {
        new int[] {'\u0590', '\u05FF'}, // Hebrew and Yiddish
        new int[] {'\u0600', '\u06FF'}, // Arabic
        new int[] {'\u0780', '\u07BF'}, // Thaa
    };
    int rtlCount = 0;
    for (int i = 0;  i < text.length();  i++) {
        char c = text.charAt(i);
        RANGES:
        for (int j = 0;  j < rtlRanges.length;  j++) {
            if (c >= rtlRanges[j][0] && c <= rtlRanges[j][1]) {
            rtlCount++;
            break RANGES;
        }
    }
    return rtlCount > text.length() / 2;
}

This method does not support all languages mentioned in Google’s regular expression because I did not spend enough time googling these languages.
Here is the source code of utility itself and its JUnit test saved in MS Word format to satisfy this blog’s restrictions.

RtlUtil.java
RtlUtilTest.java

No comments:

Post a Comment