Hierachically Implicit Bidi

Author: Beni Cherniavsky
Version: 0
Date: 2004-07-30

There is an incomplete (but sufficient for many documents) implementation now!

Vision

Currently each document format that supports bidirectional text has special constructs to explicitly assign RTL/LTR direction to elements of the document tree. For the programmer, this means that bidi support must be explicitly specified for each format and means all tools that process or convert these formats must have special support for bidi. This inevitably results in a user experience where bidi information is fragile, easily lost or mangled in conversion.

Unlike most markup issues, bidi information should be expressible in all formats to the same degree. Expressing 3 different kinds of emphasis vs. a bold font justifies differences between formats. The information needed for complete bidi support is always the same. The common part to all formats is that they carry text. Ideally, bidi information should be encoded as part of the text. (This is the same logic that makes character sets orthogonal to formats and Unicode/UTF-8 a Good Thing.)

Indeed, the Unicode Bidi Algorithm was designed for this purpose. Unfortunately, it stopped short of a complete solution. First, it ignores the need to specify the direction of elements bigger than a paragraph. Direction of elements bigger than a paragraph was left entirely to higher-level markup formats, which alone requires the per-format support mess described above. Inside a paragraph, bidi embedding codes were specified but in higher-level formats is was allowed (to some degree even recommended) to replace them by other constructs.

Here I present a scheme which relies on the carrying format to provide structure, while inferring the direction from the text. It can be applied to any format that structures a document as a tree of elements (the specific kinds of elements are not important for it). A direction is inferred for every element implicitly (requiring at most LRM/RLM to override it). With some luck, this scheme should allow unmodified conversion/generation tools to preserve bidi information.

Design

What bidi information a document needs?

To be able to render a document you need to know:

  • The document direction, which determines general layout (paragraph alignment and indentation, bullet position, table column order, etc.). This direction can be vary for elements of the document (e.g. Hebrew text containing an english quotation of several paragraphs of).
  • The text direction determines the order of letters. This direction can vary for parts of a paragraph (e.g. inline quotations).

So assuming a hierarchical document model, consisting of nested elements (like in HTML), we have to derive a direction for every element. In complex documents LTR and RTL elements can be arbitrarily nested ("embedded" is the standard bidi term). Note that some embeddings don't have any other semantics beyond the embedding of one language inside another. A grouping element with no inherent semantics (like span or div in HTML) is needed.

Hierachically implicit encoding

HTML prodives a dir="LTR" / dir="RTL" attribute that can be set on any element. This approach is suboptimal because in most convertion/processing programs elements come and go but text stays. If we could derive the direction from the text itself, no format adaptation would be needed and programs would preserve the bidi info in most cases automatically.

The Unicode bidi algorithm defined in UAX 9 desribes the "first strong character" heuristic for implicitly determining the paragraph base direction: the first strongly directional charater (e.g. an English or Hebrew letter) sets the direction of the paragraph. In most cases it works right; in the few cases where it doesn't, one can put an implicit bidi mark (LRM/RLM) at the beginnning of the paragraph to override the decision.

This approach is good. Unfortunately the Unicode bidi algorithm doesn't extend to other levels. For encoding the direction of paragraph parts it suggests separate explicit embedding codes for RTL and LTR embeddings (these inspired the HTML dir attribute). Document parts bigger than a paragraph are not accounted for in UAX 9 at all.

The basic proposed idea is to extend the first-strong heuristic to all document elements, applying it hierarchically (how exactly will be specified below). Thus a Hebrew quotation inside an English paragraph would be automatically handled correctly as long as it's surrounded by some HTML element (e.g. <blockquote> .. </blockquote> or <em> .. </em>). Similarly, the whole document would get the direction of its title and a table would get the direction of its first cell.

Fixing ambiguities

Hierarchical direction inference should already make good guesses in most cases. There are two ways to guide it, together allowing you to express any possible bidi strucuture:

  • Add structure: wrap text in some element just for the sake of indicating embedding structure (e.g. span or div in HTML).
  • Override the direction of any element by inserting an LRM/RLM mark right at the start of the element.

Note that you don't need any of the explicit bidi control codes

Not all formats/tools give you these abilities in all point of a document. Workarounds for such cases will be given below; meanwhile let's make some choices that minimize the need for overriding in sane cases.

What kinds of ambiguities do we have?

Missing structure

The hierarchical inference relies on the provided document structure. Bidi embeddings consitute logical structure. Ideally, every embedding would always be wrapped in elements. But humans are way to lazy to do that! We type flat text and want the structure to be inferred. Without structure we are back to the problem that the Unicode bidi algorithm tries solve. In fact, for compatibility, we MUST fall back on the Unicode bidi algorithm for flat text of mixed directinality. Let's list the common cases where it fails and see how we should fix them:

Neutrals on embedding boundaries left outside:

Neutral characters on LTR/RTL and RTL/number boundaries cause the most trouble. The classic case is punctuation characters:

Logical:he said "FOO!".
Base dir:L
Levels:000000000111000
Visual:he said "OOF!".

which is obviously not what you want (the exclamation mark is part of the RTL quotation and should be on the left of it). Taking all adjacent neutrals as part of the embedding wouldn't work either (the period must stay outside). Clearly, you must indicate the boundaries of the embedding. The common recommendation is to use RLM after the exclamation mark. But giving the missing strucuture by wrapping it with an element is a cleaner solution:

Logical:he said "{FOO!}".
Base dir:L
Levels:000000000 1111 00
Visual:he said "!OOF".

This will remain descriptive even if embedded another inside RTL text (see deep

Number boundaries:

Similar things happen at the boudaries of numbers inside RTL text (which are actually just another case of embeddings):

Logical:FOO-2 FEATURES 2 kb OF RAM AND 2 quux CPUS!  IT CAN COMPUTE 3 - 2!
Base dir:R
Levels:111221111111111212211111111111121222211111111111111111111111211121
Visual:!2 - 3 ETUPMOC NAC TI  !SUPC quux 2 DNA MAR FO kb 2 SERUTAEF -2OOF

FOO-2, 2 kb and 3 - 2 were mishandled but 2 quux (which looks exactly like 2 kb in logical order) was handled correctly. Again, let's add structure:

Logical:FOO-{2} FEATURES {2 kb} OF RAM AND 2 quux CPUS!  IT CAN COMPUTE {3 - 2}!
Base dir:R
Levels:1111 2 1111111111 2222 11111111111121222211111111111111111111111 22222 1
Visual:!3 - 2 ETUPMOC NAC TI  !SUPC quux 2 DNA MAR FO 2 kb SERUTAEF -2OOF
Embeddings separated by neutrals are merged:

Consider this 2 sentences:

Logical:he said "FOO!".  "BAR!" he added.
Base dir:L
Levels:000000000111111111111000000000000
Visual:he said "RAB"  ."!OOF!" he added.

These should have been 2 separate embeddings. Let's add them:

Logical:he said "{FOO!}".  "{BAR!}" he added.
Base dir:L
Levels:000000000 1111 00000 1111 00000000000
Visual:he said "!OOF"  ."!RAB" he added.
Deep strucuture

This is a relatively rare need but it can't be neglected. LTR inside RTL inside LTR (or vice-versa) can never be inferred from flat text (well, in theory, structure could be harvested from parentheses and quote marks but that is too unreliable and the Unicode bidi algorithm didn't go that way).

There is only one possible fix: add structure by wrapping one of the embedding levels with an element. E.g. turn:

foo BAR quux BAZ quux BAN foo

into:

foo {BAR quux BAZ quux BAN} foo

or:

foo BAR {quux} BAZ {quux} BAN foo

Either should be enough.

Wrong direction

Suppose that you have given enough structure. We still need to infer the correct direction. The common case is something like:

"FOO" he said.

which is misinterpretted as RTL because it starts with an embedding. Wait! We said you've given enough embedding:

"{FOO}" he said.

This can already be interpretted correctly with the right direction inferrence rules. As will be explained below, almost any sane struture will also get correct directions. And you do have RLM/LRM at your disposal for other cases...

TODO

Account for number embeddings. COMPUTE {3 - 2} should resolve as LTR; the definition below makes it inherit the RTL direction from the containing text. This must be fixed delicately.

Interactions between elements

[This section was written before the previous. It duplicates most of the arguments given above (this document could use some more editing ;-)) but goes into details of behaviour. The described behaviour should allow the above use cases of fixing structure.]

How should a sub-element interact with it's surroundings? UAX 9 defines the following behaviour of embeddings:

  1. The paragraph's base direction guessing ignores embedding boundaries and takes the first strong character whether it's embedded or not.
  2. Embeddings are represented by levels. Adjacent embeddings are merged as a side effect of this.
  3. Implicit levels are resolved. On the inside, an embedding behaves as a sub-paragraph with appropriate base level; on the the outside it behaves as a single strongly-directional character (of the embedding's direction).
  4. The representation is still level-based, so adjacent embeddings are merged again.

The current standard for HTML bidi (described e.g. in the CSS2 spec) follows this behavior, mapping markup (normally - dir attributes) to UAX 9 embeddings and applying the above algorithm on the whole paragraph at once (allowing embedding boudaries to be quite independent from element boundaries).

I claim this is sub-optimal in several ways.

  • It's questionable whether element boundaries should be transparent for purposes of direction guessing. In other words, should {FOO} bar. have a RTL base direction? If it does, it means that an element starting with a sub-element always get the same direction and one must insert an LRM/RLM before it if he wants them to differ. So I claim that skipping sub-elements during direction guessing would do the right thing in most cases. Note that again we treat sub-elements as neutrals. However, if there is no strong character outside sub-elements (e.g. in {FOO}, {bar}, we should fall back on the first sub-element with a strong direction.

    What should we do about elements that contain no strong characters whatsoever? For paragraphs, the UAX 9 heuristic says that such a paragraph containing no strongly directional characters defaults to LTR. This makes little sense given a hierarchical document; we should try to infer the direction from the surroundings of the element. One idea is to take the containing element's direction. A better idea is to run the bidi algorithm on the containing element, treating each sub-element as if it was a neutral character and take the direction resolved for the corresponding "neutral character". This makes sense in cases like the following:

    Logical:he said: "THE NUMBERS {3, 7, 13} ARE ALL PRIME."
    Base dir:L
    Levels:0000000000111111111111 21121122 1111111111111100
    Visual:he said: "EMIRP LLA ERA 13 ,7 ,3 SREBMUN EHT."

    Here you had an implicit RTL embedding and any neutral sub-elements of it should indeed inherit an RTL direction.

  • The next thing to question is effect elements have on surrounding neutral characters. A neutral character trapped between two elements would get their direction, which might not be what you want at all:

    Logical:he said "FOO!".  "BAR!" he added.
    Base dir:L
    Levels:000000000111111111111000000000000
    Visual:he said "RAB"  ."!OOF!" he added.

    This is the effect you would expect from the following element structure:

    Logical:he said "{FOO!".  "BAR}!" he added.
    Base dir:L
    Levels:000000000 111111111111 000000000000
    Visual:he said "RAB"  ."!OOF!" he added.

    but in our case you would embed each quotation in its own element and expect the intervening punctuation not to be taken as part of the embedding:

    Logical:he said "{FOO!}".  "{BAR!}" he added.
    Base dir:L
    Levels:000000000 1111 00000 1111 00000000000
    Visual:he said "!OOF"  ."!RAB" he added.

    The point is that given the ability to express different structures explicitly, you would like the bidi algorithm to guess less (for flat text without structure information, the UAX 9 algorithm should apply as today, of course). The desired effect is achievable if elements behave as neutral characters on the outside.

    • A sub-point of this is behaviour of adjacent sub-elements. If several adjacent sub-elements have a direction opposite to that of the containing element, UAX 9 would merge them into one embedding of opposite direction:

      Logical:ab {CD}{EF} gh
      Base dir:L
      Levels:000 11  11 000
      Visual:ab FEDC gh

      (effectively assuming an embedding around both of them: ab {{CD}{EF}} gh), whereas treating each as a neutral character would maintain the containing element's order between them:

      Logical:ab {CD}{EF} gh
      Base dir:L
      Levels:000{11}{11}000
      Visual:ab DCFE gh

      It's harder here to make a case for the second behavior because the reader is left with no punctuation to separate the embeddings and the reading order is very ambiguous. But if you special-case adjacent sub-elements, you would have to define the semantics for adjacent sub-elements of different directions, which becomes very compicated; always treating sub-elements as neutrals is easy to understand.

      And again, the behavior of "merging" them is always achievable by explicitly adding an element around both, so always treating sub-elements as neutrals is more flexible.

  • Another very basic thing to question in standard bidi handling of HTML elements is that the final merging of adjacent runs of same level can break an element into non-adjacent parts if the first/last character of an element is at a deeper level than the element's base level:

    Logical:{ab CD}{XY zw}
    Base dir:L
    Levels: 00011  11000 
    Visual:ab YXDC zw

    This example shows standard HTML bidi behaviour, so both elements have LTR base direction; in our model the second element would get an RTL base direction but still the first element would be disjoined:

    Logical:{ab CD}{XY zw}
    Base dir:L
    Levels: 00011  11122 
    Visual:ab zw YXDC

    Still, it makes no sense. Opening an element outside an embedding and closing it inside it is not any better than overlapping elements: <element><bidi></element></bidi>! Bidi embeddings always correspond to some logical nesting of text, so we should require every element to be in one piece (though perhaps in different internal order) in the output:

    Logical:{ab CD}{XY zw}
    Base dir:L
    Levels:{00011><11122}
    Visual:ab DCzw YX

    Such treatement also means that reordering could be done directly on the tree; this should also be simpler to implement.

Workarounds

The scheme described so far is sufficient to describe any bidi structure merely by using elements (sometimes only for the structure, like HTML's span or div elements) and LRM/RLM to override directions. If you directly write the HTML, there is no problem. However, one of the major ideas of this scheme is that you can use an existing converter from some other format to HTML that doesn't have to know anything about bidi. As long as it translates elements of the source format to HTML elements and doesn't drop your LRM/RLM marks, it will work. But the conversion will usually limit your freedom to control the resulting HTML. You can suffer from either or both of two problems:

  • Inability to create arbitrary structure of elements.
  • Inability to put text, specifically bidi control codes, into arbitrary places of the resulting document.

The first problem is easily solved by the fact that the full UAX 9 bidi algorithm runs on each element of the document. You can use LRE..PDF / RLE..PDF (or sometimes just LRM/RLM) to embed parts without having to actually wrap them with elements. Obviously, the ideal development would be to add two new implicit embedding characters to Unicode, behaving as described here. That will not happen soon (and one of my goals here is to prove it's a good idea), so for now the explicit embeddings will do.

The second problem is harder. Suppose you want to put RLM at the start of the document to override it's direction. Many input formats will not let you put any text above the heading and in any case all your text will be wrapped in block-level elements. The best you can do is to create a paragraph containing nothing but an RLM.

What I propose is to artifically strip any element(s) that contain nothing but bidi control codes (LRM, RLM, LRE, RLE, LRO, RLO, PDF), until they appear in the same element with other things. E.g. {text {{&LRM;}{&RLE;}} text} becomes {text &LRM;&RLE; text}. This is a rather ugly hack but it allows you to overcome the above problems. A paragraph containing nothing but an LRM would be converted to an LRM at the block level, which would have priority over all sub-elements (including the heading appearing before it) to set the document's direction. You could even place part of the document between two paragraph containing a nothing but an RLE and a PDF respectively, to embed this part of the document.

Specification

This is version 0 of the specification. It was not yet reviewed by other people and almost surely can be improved.

Here I summarize the decided algorithm for handling:

  1. In bottom-up order, strip any elements that surround nothing but bidi control codes.

  2. In bottom-up order, assign strong directions to elements.

    • If an element directly contains strongly-directional (L, R, AL) text, it gets the direction of the first such character.
    • Otherwise, it gets the direction of the first sub-element that was assigned a strong direction.
    • Otherwise, it's considered neutral.
      • If it's the document root, assign it L direction (and you can stop running this algorithm, nothing will be reordered).

    TODO: account for numbers! Behavoir must be adapted (see number boundaries example).

  3. In top-down order, apply the UAX 9 bidi algorithm to every element, using its direction as the base direction and treating each sub-element (no matter what direction it was assigned) as a neutral (ON) character.

    • For each sub-element that was neutral, assign it a direction corresponding to the final resolved level of the character that stood for it (and use that as the base direction when applying the UAX 9 algorithm to it).
  4. Reorder hierarchically: the characters in each element are re-ordered as defined in UAX 9 according to the levels resolved for them; sub-elements are placed where the neutral characters representing them would be placed (and the order inside each sub-element is determined in the same manner).

Implementation

Implementing this model requires changes only to the display stage (or as close to it as possible). For instance, it would be nice to implement it in a browser. However HTML should be readable in all browsers, so the only way to try out this new scheme is to write (just once) a program for converting hierarchically implicit HTML to standard HTML.

HTML conversion

Theory

Since HTML already does implicit bidi, the resulting standard HTML is not very different from the hibidi HTML. The required changes are:

  1. Strip tags containing only bidi control codes.
  2. Detect the direction of every tag.
  3. Set dir attributes.
    1. Insert LRM/RLM codes to prevent level run merging that can corrupt nesting structure?

Practice

The existing implementation is a quick-and-dirty prototype. It doesn't do (1), approximates (2) and doesn't do (3.A). "Approximates" means that it doesn't really perform the bidi algorithm at each level. It only infers the base direction of each element from the first strong character or first element containing a strong character and takes it as default for the whole element. This means that in:

<foo>
    ltr
    <bar>??</bar>
    RTL
    <quux>??</quux>
    RTL
    <baz>??</baz>
    ltr
</foo>

the neutral element quux will not be inferred as RTL although it should be according to the spec.

I do intend to implement it fully, but for now this is enough for 98% of mixed documents out there... I have successfully used it on several highly mixed-language documents processed with docutils from plain text, without any internal changes to docutils (for example, see http://cben-hacks.sf.net/python/lectures/osdc2006/gen-fu.html). This is an example of my hope that with hibidi, we won't have to explicitly support bidi in every program and format.

Usage

Grab hibidi.py and rst2html_hibidi.py. Put them in the same directory and make both executable.

  • hibidy.py < without-dir-attrs.html > with-dir-attrs.html does the processing.
  • rst2html_hibidi.py is a version of rst2html.py from docutils that takes the same options but processes the output with hibidi. See the bidi entry in the docutils FAQ for details.

Future