PreviousNextTracker indexSee it online !

(156/211) 3884 - Unicode combining characters

There is a macro "Display Character Code." It used to be the case that one could use the cursor keys (e.g. the right arrow) to step through the parts of a Unicode composite character, and at each character--base or combining--get the character code of that character using this macro.

After bug 3455 was fixed, the above behavior was no longer possible. In particular, the cursor will no longer move to each combining character--it only moves to base characters.

There are several implication of this. One of the implications is that it is no longer possible to edit a composite character: the only thing you can do is delete the entire composite character (base + diacritics) and start over. I suppose that's not the end of the world.

A more important implication is that the "Display Character Code" no longer works for combining characters--it is impossible to use this macro to find the code point of such a character (it only works for base characters). In principle, one can use the Hex plugin, but this is a clumsy work-around, involving switching to a different file format, finding where the character in question was, and switching back. It would be far better to be able to use this macro.

There are (at least) two possible fixes. One would be to restore the ability of the cursor keys (and backspace and delete keys) to move one character at a time, regardless of whether that character is a base or combining character. Based on an email thread of Aug 2012 entitled "Editing Unicode combining characters", this would be a lot of work. A simpler but acceptable method, suggested by Kazutoshi Satoda 25 Aug 2012 in that same thread, would be for the "Display Character Code" macro to output a sequence of code points, including the base character and any combining characters.

I'm attaching a file that contains a base character 'a' (ASCII/Unicode U+61) followed by a combining acute accent (U+301). Notice that this is in Unicode *decomposed* (NFD) format; if it were in the composed format (NFC), it would be a single code point, U+E1, and would not illustrate the problem. (The problem also arises with combinations of base+diacritics for which there is no NFC form. BTW, I'm assuming that sourceforge doesn't convert uploaded files to NFC--if it does, the code point U+E1 will show up. Let me know if that happens and I'll come up with a combination that won't change.)

To demonstrate the problem with the attached file, put the cursor at the top of the file, and call "Display Character Code." It will show code point 61. Move the cursor one position to the right; you'll be at EOF, and the macro will not display anything. It should instead either display the code point U+301 (first solution above, allow cursor movement to each character whether base or diacritic), or else when the cursor is at the top of the file (before the accented 'a') the macro should display "61 301" or similar (e.g. "61+301"), i.e. one code point for each character (second solution above).

This problem will not occur in 8-bit encodings, since they don't have a notion of combining characters. I don't know enough about other non-8-bit encodings (like Big5) to know whether it happens there; probably just in Unicode.

Jedit v5.1.0 (and other versions since at least Aug 2012), Windows 7, Java version 1.8.0_20.

Submitted mcswell - 2014-09-21 18:19:57.607000 Assigned
Priority 5 Labels unicode macros
Status open Group minor bug
Resolution None

Comments