You should be automatically redirected . If not, visit
http://newlisper.wordpress.com
and update your bookmarks.

25/07/2008

Character reference

I was looking through an old (1990!) book on Unicode the other day. I've always been intrigued by the amazing diversity of letter forms that we've created over the last few thousand years. Here are just some of the many wonderful and peculiar characters you'll find tucked away in the Unicode glyph banks:

  • ۞
  • Ϣ
  • ܍

You'll also find the I Ching, Braille, alchemy, an alphabet funded by George Bernard Shaw, neo-pagan tree language, astrology, dentists, talking leaves, and much more besides.

Most of the technical aspects of Unicode escape me (supplementary planes, normalization, high surrogates, collation?) but it's useful to know the basics of using Unicode in newLISP, particularly now that it's the most popular encoding used on the internet.

newLISP is UTF-8 friendly by default on MacOS X, and UTF-8 versions are available for other platforms too (although I'm not sure whether the default versions are UTF-8). UTF-8 is a variable-length character encoding, which allows characters to use 1, 2, 3 or 4 bytes depending on their Unicode value.

One essential newLISP function for exploring the Unicode character set is char. This takes either a number or a character, and returns the matching character or number:

(char 63498)
""

(char "")
63498

Unicode characters are usually described using hexadecimal, so it's useful to know how to translate between hex and decimal. To convert a decimal integer to a hex string, use format:

(format "%llx" 63498)
"f80a"

To convert a hex string to a decimal integer, pass a hexadecimal string starting with "0x" to int :

(int (string "0x" "f80a"))
63498

When you're writing text, it would be good if you could easily insert these characters as you type. There are useful system tools for doing this (on MacOS X, there's the Character Palette), but for fun I've added the following two functions to the Markdown converter that I use to process my writing:

(define (hex-str-to-unicode-char strng)
   (char (int (string "0x" (1 strng)) 0 16)))

(define (ustring s)
  (replace "U[0-9a-f]{4,}" s (hex-str-to-unicode-char $0) 1))

So now I can type "U" followed by 4 hexadecimal characters, and the appropriate Unicode character is inserted automatically: "U f80a" is converted to "". (I had to insert a space after the U to prevent translation.)

You can happily use Unicode characters anywhere in newLISP code, if your text editor or console is up to the job. And if ustring is available, you can generate them easily too:

(constant (sym (ustring "U 2660")) 4  ; spades
       (sym (ustring "U 2661"))      3  ; hearts
       (sym (ustring "U 2662"))      2  ; diamonds
       (sym (ustring "U 2663"))      1  ; clubs
     )

(symbols)

(! != $ $0 $1 $10 $11 $12 $13 $14 $15 $2 $3 $4 $5 $6 $7 $8 $9 $HOME $args $idx $main-args ...  zero? | ~ ♠ ♡ ♢ ♣)

(println "(> ♢ ♣)? " (> ♢ ♣))
(> ♢ ♣)? true

(println "(> ♡ ♠)? " (> ♡ ♠))
(> ♡ ♠)? nil

Using descriptive Unicode characters for your symbol names could introduce a whole new level of readability to your code!

(constant (global '☼)  MAIN)
(context '☺)

(define (☻ ✄ ☁ ⍾)
   (print ✄ ☁ ⍾))

(define (‽) 
   (println {‽}))

(context ☼)
(set '℥ "what "  'ᴥ "the " 'ᴒ "dickens")
(☺:☻ ℥ ᴥ ᴒ)
(☺:‽)

Appropriately enough, that last function call returns "‽", which is the much-needed interrobang character.

The problem now is to remember all those four digit hexadecimal numbers that identify the Unicode characters. I whipped up a quick Unicode browser in newLISP:

a Unicode browser

This just shows a page of Unicode characters at a time, and lets you move up and down through the 'pages'. It has some problems when the character code exceeds FFFF - I don't know why‽

This post should display correctly on most modern browsers. If you see lots of boxes rather than characters, then you are using a browser or system that doesn't handle Unicode well. This applies to the iPhone and iPod Touch as well: it appears that Mobile Safari doesn't like Unicode as much as its desktop version. Apple - improve Unicode support please!

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home