Friday, February 27, 2009

So, Why Not Use Huneybee, Anyway?



One of the things that surprises me as I poke around the World Wide Web looking for Deseret Alphabet materials is the presence of Huneybee and recommendations that it be used. I have nothing against its design—indeed, I am happy to see new designs of the Deseret Alphabet, particularly if they’re not slavish copies of the font used for the four books printed in the 1860’s—but I will confess that Huneybee’s continued use makes me shudder.

The problem can be summarized in one word, mojibake. Computers, after all, don’t represent text qua text; they represent text via numbers. Text is stored internally via a series of numbers, and the software involved has to somehow map these numbers into something the user can see.

In practice, there are actually three sets of numbers associated with text. The first is the keycode, the number associated with the physical key the user is pressing. The second is the character code, the number used internally to represent a particular character. The third is the glyph ID, which is the index of a particular graphic shape within a font.

Naïve users (that is to say, most computer users, especially people whose experience is based on English where text display is obscenely simple) assume that there is a direct mapping between the three. Huneybee is an example of this. You want to see a particular symbol on screen or in print, and you want to generate it by using a particular keystroke. For example, you may want to type shift-S and have 𐐝 show up at the other end. You get this result by using your font-production software by inserting the 𐐝 glyph in the slot currently occupied by S. Done!

All this works if you are generating a text for immediate display and you don’t care what happens down the line, either when you transmit the text to someone else or when you come back in five years and try to edit the text. In order for this to work over space or time, you need to guarantee that the person at the other end has the right font installed and is set up to use it. If not, you get garbled nonsense, mojibake. The 2006 New Deseret Reader illustrates this. It works, if you have Huneybee installed. If not, you get illegible nonsense.

This is actually a serious problem in computer science and is one of the main motivations underlying Unicode. I still have some of the first computer-generated documents I ever made, but I can’t use them anymore. They were written using defunct software with an undocumented internal format on a defunct platform (the Atari ST) using a defunct, proprietary character set. Trajan’s column can still be read effortlessly nearly two thousand years after it was erected, but my own journals from the early 1980’s are illegible.

(I spent a fair chunk of my wasted youth as a secretary in the Molecular Biology department at the university where I did graduate work. We started out with WordPerfect on DOS, which was a very non-WYSIWYG environment. Once, I managed to switch the font to “Greek” to insert some symbols but not switch back and didn’t realize it until I printed a draft and the last two-thirds of the paper in question came out as garbage.

(I should also point out in fairness that I did this kind of thing myself out of laziness when I produced the Deseret Alphabet Triple Combination in 1997. I knew better, but I did it anyway, and I regret it now. I’ve managed to get away with it because the document is a PDF and doesn’t store the text as text, but as glyph IDs for an embedded font, so the data is entirely self-contained. It is, however, impossible for me to take that document and back-convert it to raw text because I don’t have a copy of the font I used anymore. I could probably manage to recreate the encoding, but more likely than not, I’m going to have to do the work all over again.)

The New Deseret Reader, by the way, illustrates another aspect of this problem. Because the Deseret Alphabet has thirty-eight letters in its standard form, and because it uses both upper- and lower-cases, you need room for seventy-six letters, whereas ASCII only has slots for fifty-two. That means that you have to steal slots from punctuation as well as letters, and that means that you can’t use the punctuation yourself. Or Latin letters, for that matter, if you want to intermingle scripts.

There is a natural solution to this, and it comes in two pieces. The first piece is to decouple the characters from the specific font being used to represent them, and this is what Unicode does. It provides a standard way of representing text for dozens of writing systems and thousands of languages which is not tied to a specific font or platform. You still need a font covering the specific language/script in question, of course, but you don’t need to have a specific version of a specific font. Thus Wikipedia’s article on the Deseret Alphabet can contain Deseret Alphabet text and not require you to download and install a specific font before you can do it. You can use any Unicode-savvy Deseret Alphabet font you want. If you’re on a Mac, of course, you’re in luck because every Mac ships with a Unicode-savvy Deseret Alphabet font. If you’re on Windows, you can use James Kass’s excellent Code2001 font.

And Unicode’s Web site can contain a whole page of Deseret Alphabet text and blithely assume that this page will continue to be legible for decades to come on any computer system with an appropriate font installed. And even if a font is not available, the text will be indisputably Deseret and not badly-spelled Latin.

There is a slight trickiness in doing this with somewhat older software which doesn’t support the non-BMP portions of Unicode, but current font editing software and operating systems can do so. Some applications may still be lacking in this area, I’m sorry to say, but that will change over time. (Firefox, for example, doesn’t display Unicode Deseret correctly.)

The other thing you need is a keyboard, that is a way of mapping particular keystrokes into particular characters. All major operation systems have a way of using custom keyboard mappings and editors for these mappings are freely available. Now, there are still issues with making a keyboard for the Deseret Alphabet which I’ll go into at some future point. And yes, you do need to have them installed. Making keyboards, however, is trivial and getting them installed isn’t hard.

This, by the way, is what the Deseret Language Kit did for Mac OS 9. It provided a keyboard, font, and other software pieces necessary to get the Deseret Alphabet to work in a semi-standardized way with any Mac software. It hasn’t been as necessary for Mac OS X, because that’s Unicode-based, which is one reason why I haven’t come out with a successor. I do have a keyboard which I use myself when I want to type Deseret text, ������ ������. I have other techniques for converting lots of text at once, however, which are generally easier to use.

Now, I don’t fault the people who do use Huneybee, because by and large they don’t know better. They haven’t run into the practical problems that made software companies like Apple and Microsoft move towards soft keyboard and Unicode. As such, it’s really a communication problem. It’s the responsibility of people like me who do deal with these issues to educate the public at large.

And one has to allow for the fact that people are people and don’t always do things the right way. After all, I’ve been blithely typing two spaces at the end of every sentence in this blog, even though I know it’s wrong.

But the bottom line is, if you really want to communicate with the Deseret Alphabet, use the standardized techniques which have become available and switch to Unicode. If the owner of Huneybee would like me to create a Unicode-savvy version of it, I’d be happy to oblige, pending the free time to do so.

Wednesday, February 25, 2009

Whys and Wherefores

I suppose I should start with a word of introduction regarding me and my history with the Deseret Alphabet.  My personal involvement with the Deseret Alphabet goes way back to the one linguistics course I took as an undergraduate at the University of Utah, where the instructor brought it up as an example of an interesting local linguistic oddity.  

This would have been in late 1977.  (Excuse me for a minute while I find a quiet corner in which to have a cry about how long ago this was.)  Somewhat over a decade later, I became involved with the Unicode Standard.  In those early days of Unicode, there was a list of potential scripts for encoding circulating among the various Unicodets, and amongst these was listed the “Mormon Alphabet.”  Knowing something about it—including its proper name—I was quick to point out that it was really not an appropriate candidate for encoding, because it was rather thoroughly dead and not much used when it was alive.  

And yet as the 1990’s drew to a middle, Unicode found itself in an awkward position.  The standard had originally been designed to support some 65,000 different characters, but it became apparent that this would not be sufficient.  An architectural change was added in Unicode 2.1 to deal with this, splitting Unicode into the Basic Multilingual Plane (BMP) and sixteen additional planes, each plane supporting the same count of 65,500-ish characters.  

What then happened was a bit of a chick-and-egg problem.  Unicode support was beginning to appear in applications and system software, but it was of the BMP-only sort.  As a result, nobody wanted their script to end up in the astral planes, as the new planes were often called; it wouldn’t be supported by current software. And since the astral planes had no actual content, there was little incentive for anybody to even start the process of implementing support.  

What was needed was a scapegoat or sacrificial lamb:  a script which was arguably a legitimate candidate for encoding but which could live indefinitely as a second-class citizen until software support caught up with it.  

As a result, I started to put together proposals for the encoding of various scripts which would reasonably end up in the Supplementary Multilingual Plane (SMP) of the standard.  There were six, as I recall, and were I sufficiently ambitious I’d look them up.  They included, if memory serves, Etruscan, Linear B, Gothic, Shavian and Pollard.  The sixth was the Deseret Alphabet.  With the exception of Pollard, these are all now encoded, all in the SMP, and work on Pollard is proceeding slowly.  

(In fairness, none of these are actually driving non-BMP Unicode support.  The characters making non-BMP support a sine qua non are from East Asian character sets such as HK SCS and JIS X 0213.  But that would be another blog.)

Actually, Deseret (as it is called in encoding circles) is not an inappropriate candidate for encoding after all.  There is a limited amount of printed material in the Deseret Alphabet, to be sure, but a fair amount of additional material of historical interest exists in manuscript.  More to the point, there are hobbyists who want to use it even now, despite its serious design flaws.  

I am amongst these hobbyists, I’m sorry to say, and have foisted a fair chunk of Deseret material on the world, including this blog.  Now, you may have noticed that this blog isn’t actually in the Deseret Alphabet.  I may or may not add entries in the DA in the future, depending on software support and the amount of time I’m willing to waste on it.  This is more a spot for me to think aloud, as I say, about the technical problems involved in Deseret support and its significance both in LDS culture and in the broader world.