You should be automatically redirected . If not, visit
http://newlisper.wordpress.com
and update your bookmarks.

30/01/2006

Super strings: the basics of newLISP strings

Strings are one of the basic building blocks of all programming languages. newLISP has many easy to use and powerful string handling tools, and you can easily add more tools to your toolbox if your particular needs aren't met.

Here's a quick guided tour of newLISP's 'string orchestra'. It's also an extract from the book about newLISP I'm writing, so don't be dismayed by the length of this post. Relax, this is a gentle journey rather than a steep climb!

Strings in newLISP code

You can write strings in three ways:

  • enclosed in double quotes
  • embraced by curly brackets
  • marked-up by markup codes

like this:

(set 's "this is a string")
(set 's {this is a string})
(set 's [text]this is a string[/text])

Use the first method for strings with less than 2048 characters or if you want to include escaped characters, such as \n and \t, or code numbers (\046).

(set 's "this is a string \n with two lines")
(println s)
;-> 
this is a string 
with two lines

Double-quote characters must be escaped with backslashes, as must a backslash.

Use the second method, braces ('curly brackets'), for strings shorter than 2048 characters when you don't want any escaped characters to be processed:

(set 's {strings can be enclosed in "quotation marks" \n } )
(println s)
;-> strings can be enclosed in "quotation marks" \n

This is a really useful way of writing strings, because you don't have to worry about putting backslashes before every quotation character, or backslashes before other backslashes. However, don't include a closing brace before the end of the string (you can't escape them - by which I mean you can't 'escape' them). You can nest pairs of braces inside a braced string though.

I like to use braces, not only because they face the right way (which plain quotation marks don't), but also because text editors can balance and match them.

The third method, using [text] and [/text] markup tags, is intended for longer text strings running over many lines, and is used automatically by newLISP when it outputs long strings. Again, you don't have to worry about which characters you can and can't include - you can put anything you like in, with the obvious exception of "[/text]"!

(set 'novel (read-file {my-latest-novel.txt} ))
;->
[text]
It was a dark and "stormy" night...
...
The End.
[/text]

If you want to know the length of a string, use length:

(length novel)
;-> 575196

A million characters or so doesn't seem to bother newLISP too much.

Making strings

A lot of functions, such as the file reading ones, return strings or lists of strings for you. If you want to build a string from scratch, one way is to start with the char function. This converts the supplied number to the equivalent character string with that code number. (It can also reverse the operation, converting the supplied character string to its equivalent code number.)

(char 33) ;-> "!"
(char "!") ;-> 33
(char 955)
;-> Unicode lambda character 
(char 0x2318)
;-> Unicode Place of Interest Sign character 2318

These last two examples are available when you're running the Unicode-capable version of newLISP. Since Unicode is hexadecimally inclined, I've used the hexadecimal number, which char can convert to a string. (I haven't attempted to get them displayed in this post!)

You can use char to build strings in other ways:

(join (map char (sequence (char "a") (char "z"))))
;-> "abcdefghijklmnopqrstuvwxyz"

This applies the char function to a list of integers generated by sequence, so producing a list of strings. This list can be converted back to a single string by join, which turns a list into a string. join can also take a separator when building strings:

(join (map char (sequence (char "a") (char "z"))) "-")
;-> "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"

Similar to join is append, which works directly on strings:

(append "con" "cat" "e" "nation")
;-> "concatenation"

but even more useful is string, which turns any collection of numbers, lists, and strings into a single string.

(string ' '(sequence 1 10) { produces '} (sequence 1 10) "\n")
;-> '(sequence 1 10) produces '(1 2 3 4 5 6 7 8 9 10)

Notice that even the parentheses around the lists are included in the string.

The string function, combined with the various string markers such as braces and markup tags, is a good way to include the values of variables inside strings:

(set 'x 42)
(string {the value of } 'x { is } x) 
;-> "the value of x is 42"

dup makes copies:

(dup "spam" 10)
;-> "spamspamspamspamspamspamspamspamspamspam"

And date makes a date:

(date)
;-> "Wed Jan 25 15:04:49 2006"

or you can give it a number of seconds since 1970 to convert:

(date 1230000000) 
;-> "Tue Dec 23 02:40:00 2008"

String surgery

Now you've got your string, there are plenty of functions for operating on them. Some of these are 'destructive' functions - they change the string permanently, possibly losing information for ever, whereas others are 'constructive', producing a new string and leaving the old one unharmed.

reverse is destructive:

(set 't "a hypothetical one-dimensional subatomic particle")
(reverse t)
;-> "elcitrap cimotabus lanoisnemid-eno lacitehtopyh a"

Now t has changed for ever. However, the case-changing functions aren't destructive, producing new strings without harming the old ones:

(set 't "a hypothetical one-dimensional subatomic particle")
(upper-case t)
;-> "A HYPOTHETICAL ONE-DIMENSIONAL SUBATOMIC PARTICLE"
(lower-case t)
;-> "a hypothetical one-dimensional subatomic particle"
(title-case t)
;-> "A hypothetical one-dimensional subatomic particle"

Substrings

If you know which part of a string you want to extract, use one of the following constructive functions:

(set 't "a hypothetical one-dimensional subatomic particle")
(first t)
;-> "a"
(rest t)
;-> " hypothetical one-dimensional subatomic particle"
(last t)
;-> "e"
(nth 2 t) ; the first character has index 0
;-> "h"

There's a useful shortcut: follow the string with a number:

(t 2)
;-> "h"

slice gives you a new slice of an existing string, counting either from the beginning (positive integers) or from the end (negative integers), for a given number of characters:

(slice t 15 13)
;-> "one-dimension"
(slice t -8 8)
;-> "particle"

There's an easier way to do this, too, by putting the required start and length before the string in a list:

(15 13 t)
;-> "one-dimension"
(0 14 t)
;-> "a hypothetical"

If you don't want a continuous run of characters, but want to cherry-pick some of them for a new string, use select followed by a sequence of character index numbers:

(set 't "a hypothetical one-dimensional subatomic particle")
(select t 3 5 24 48 21 10 44 8)
;-> "yosemite"
(select t (sequence 1 49 12)) ; every 12th character starting at character 1
;-> " lime"

which is good for finding secret Da Vinci-style coded messages buried in text...

If you just want to swap two characters, use the destructive function swap:

(set 'typo {teh})
(swap 2 1 typo)
;-> "the"

Changing strings

trim and chop are both constructive string-editing functions that work from the ends of the original strings inwards:

(chop t) ; defaults to last character
;-> "a hypothetical one-dimensional subatomic particl"
(chop t 9) ; chop 9 characters off
;-> "a hypothetical one-dimensional subatomic"

trim removes strings from the ends of a source string:

(set 's "      centred       ")
(trim s) ; defaults to removing spaces
;-> "centred"
(set 's "------centred------")
(trim s "-")
;-> (centred)
(set 's "------centred********")
(trim s "-" "*")
;-> "centred"

There are two approaches to changing characters inside a string. Either use the index numbers of the characters, or specify the substring you want to change.

Using index numbers

Use indexing with the nth-set and set-nth functions. nth-set and set-nth are twin character assassins - destructive functions for changing strings. They look the same, but nth-set returns just the part of the string that was destroyed, and set-nth returns the modified string. nth-set is quicker.

(set 't "a b c")
;-> "a b c"
(set-nth 0 t "xyz") 
;-> "xyz b c"
(nth-set 0 t "xyz")
;-> "a"
t
;-> "xyz b c"

To remember which does which, consider that set-nth starts with "s" and returns the string, whereas nth-set starts with "n" and returns only the nth characters. (If this doesn't work for you, remember them another way!)

Changing substrings

If you don't want to - or can't - deal with index numbers or character positions, use replace, a powerful destructive function that does all kinds of useful operations on strings. Use it in the form:

(replace old-string source-string replacement)

So:

(set 't "a hypothetical one-dimensional subatomic particle")
(replace "hypoth" t "theor")
;-> "a theoretical one-dimensional subatomic particle"

replace is usually destructive, but if you want to use replace or another destructive function constructively, without affecting the original string, enclose the string in a string function call:

(set 't "a hypothetical one-dimensional subatomic particle")
(replace "hypoth" (string t) "theor")
;-> "a theoretical one-dimensional subatomic particle"
t
;-> "a hypothetical one-dimensional subatomic particle"

The use of string creates a new string that gets operated on by replace. The original string t is unaffected.

replace is one of a group of newLISP functions that accept regular expressions for defining patterns in text. You add a number at the end of the list which specifies the type of regular expression to use: 0 means basic regular expressions, 1 means case-insensitive matching, and so on.

(set 't "a hypothetical one-dimensional subatomic particle")
(replace "h.*?l" t "" 0) ; look for "h" followed by "l", but not too greedily
;-> "a  one-dimensional subatomic particle"

If you're happy working with Perl-compatible Regular Expressions (PCRE), you'll be happy with replace. Full details are in the newLISP reference manual.

Another interesting feature of replace is that the replacement doesn't have to be just a simple string, it can be any newLISP expression. Each time the pattern is found, the replacement expression runs. If you want, you can use this to provide a replacement value that's calculated dynamically, or you could do anything else you wanted to. For example, here's a simple search and replace operation that keeps count of how many times a letter has been found, and replaces each occurrence in the original string with the total so far:

(set 't "a hypothetical one-dimensional subatomic particle")
(set 'counter 0)
(replace "o" t 
    (begin 
        (inc 'counter)
        (println {replacing "} $0 {" number } counter) 
        (string counter)) 0)
replacing "o" number 1
replacing "o" number 2
replacing "o" number 3
replacing "o" number 4
;-> "a hyp1thetical 2ne-dimensi3nal subat4mic particle"

Did you notice the $0 in there? replace updates a set of system variables $0, $1, $2 up to $15 with the matched expressions, so you can access the inner workings of the regular expression matching that's going on while the function is running. You could do other useful too, such as build a list of matches for later processing.

Testing and comparing strings

There's various tests that you can run on strings. newLISP's comparison operators work by finding and comparing the code numbers of the characters until a decision can be made:

(> {Higgs Boson} {Higgs boson}) ; nil
(> {Higgs Boson} {Higgs}) ; true
(< {dollar} {euro}) ; true
(> {newLISP} {LISP}) ; true
(= {fred} {Fred}) ; nil
(= {fred} {fred}) ; true

and of course newLISP's flexible argument handling lets you test loads of strings at the same time:

(< "a" "c" "d" "f" "h")  
;-> true

To check whether two strings share common features, you can either use starts-with and ends-with, or the more general pattern matching commands regex and find.

starts-with and ends-with are simple enough:

(starts-with "newLISP" "new")
;-> true
(ends-with "newLISP" "LISP")
;-> true

regex is more interesting. It returns nil if the string doesn't contain the pattern, or, if it does contain the pattern, it returns a list with the matched strings and substrings and the start and length of each string.

(regex "sub.*" t)
;-> ("subatomic particle" 31 18)
(regex {(s[a-z]*)(.*)(s[a-z]*)} t 0)
;-> ("sional subatomic" 24 16 "sional" 24 6 " " 30 1 "subatomic" 31 9)

and these matches are also stored in the system variables $0, $1, $2 up to $15, which you could inspect with:

(dotimes (i 16) (println ($ i)))

Instead of regex you could use find, which returns the index of the matching substring.

Strings to lists

Two functions let you convert strings to lists, ready for manipulation with newLISP's extensive list-processing powers. The well-named explode function cracks open a string and returns a list of single characters:

(set 't "a hypothetical one-dimensional subatomic particle")
(explode t)
:-> ("a" " " "h" "y" "p" "o" "t" "h" "e" "t" "i" "c" "a" "l" " " "o" 
 "n" "e" "-" "d" "i" "m" "e" "n" "s" "i" "o" "n" "a" "l" " " "s" 
"u" "b" "a" "t" "o" "m" "i" "c" " " "p" "a" "r" "t" "i" "c" "l" 
"e")

The explosion is easily reversed with join.

parse is a more powerful way of breaking strings up and returning the pieces. Used on its own, it will break strings apart at the spaces between them:

(parse t)
;-> ("a" "hypothetical" "one-dimensional" "subatomic" "particle")

Or you can supply a delimiting character, and parse will break the string whenever it meets the character:

(set 'pathname {/System/Library/Fonts/Courier.dfont})
(parse pathname {/})
;-> ("" "System" "Library" "Fonts" "Courier.dfont")

By the way, we could eliminate that first empty string by filtering it out. Notice the use of a lambda function for defining a quick nameless test function - we can use either fn or lambda:

(filter (fn (s) (not (empty? s))) (parse t {/}))
;-> ("System" "Library" "Fonts" "Courier.dfont")

You can also specify a delimiter string rather than a delimiter character:

(set 't {spamspamspamspamspamspamspamspam})
;-> "spamspamspamspamspamspamspamspam"
(parse t {am}) ; break on "am"
;-> ("sp" "sp" "sp" "sp" "sp" "sp" "sp" "sp" "")

Or you can specify a regular expression, remembering the options flag 0 (or whatever):

(set 't {/System/Library/Fonts/Courier.dfont})
(parse t {[/aeiou]} 0) ; strip out vowels and slashes
;-> ("" "Syst" "m" "L" "br" "ry" "F" "nts" "C" "" "r" "" "r.df" "nt")

Here's the well-known quick and not very reliable HTML tag-stripper:

(set 'html (read-file "/Users/Sites/index.html"))
(println (parse html {<.*?>} 4))

For parsing XML strings, newLISP provides the specialized function xml-parse.

Other string functions

There are a few other functions that work with strings. search looks for a string inside a file:

(set 'f (open {/private/var/log/system.log} {read}))
(search f {kernel})
(seek f (- (seek f ) 64))
(dotimes (n 3)
    (println (read-line f)))
(close f)

This example looks in the system.log for the string "kernel". If it's found, newLISP rewinds the file pointer by 64 characters, then prints out three lines, showing the line in context.

There are also functions for working with base64 encoding files, and for encrypting strings.

It's also worth mentioning the format function, which lets you insert the values of newLISP expressions into a pre-defined template string. Use %s to represent the location of a string expression inside the template. For example, suppose we want to display a list of files like this:

[File: foo.txt]
[File: bar.txt]

A suitable template looks like this:

"[File: %s]":

We give the format function this template string, followed by the expression (f) that produces a filename:

(format "[File: %s]" f)

The code to generate a directory listing using this format and the directory function looks like this:

(dolist (f (directory)) 
    (println (format "[File: %s]" f)))

and generates a listing like this:

[File: .hotfiles.btree]
[File: .Spotlight-V100]
[File: .Trashes]
[File: .vol]
[File: .VolumeIcon.icns]
[File: Applications]
[File: automount]
[File: bin]
[File: Cleanup At Startup]
[File: cores]
[File: Desktop Folder]
[File: dev]
[File: Developer]
[File: etc]
[File: Library]
...

Lastly, we must mention eval-string, a version of newLISP's eval function for use with strings. eval-string tries to process a string as newLISP code. If it's valid newLISP, you'll see the result:

(set 'sum "(+ 2 2)")
;-> "(+ 2 2)"
(eval-string sum)
;-> 4

This means that you can build newLISP code strings, using all the functions we've described in this chapter, and then have it evaluated by newLISP. You could write programs that write programs. But that's another chapter.

Updated for correction of minor errors and incorporation of comments.

5 Comments:

At 16:59, Anonymous Gordon said...

The constructive replace -- using (string x) -- is quite slick.

I normally do (set 'tmp x) and then a replace on tmp. I will now start wrapping with 'string'.

 
At 17:07, Anonymous Lutz said...

Thanks for this great intoduction into newLISP's string theory ;)

instead of:

(char (int "0x2318"))

you can just say:

(char 0x2318)

newLISP recognizes hex format for numbers starting with 0x and octals for numbers starting with 0 (zero)

 
At 17:35, Blogger newlisper said...

Thanks! I often wonder whether I should edit posts once there have been improvements and corrections. I suppose if people read the comments after the text, that would be all right... but if they don't, then I've misled them... :-(

Don't know what's best...

and Gordon - thanks for your comments on these posts. Much appreciated!

 
At 04:48, Blogger Rick Hanson said...

One more for your errata sheet:

> (set 's "------centred------")
"------centred------"
> (trim s "-")
"centred"

You had as the punchline '(centred)'.

 
At 09:28, Blogger newlisper said...

Thanks, Rick. I'm keeping the "introduction to newlisp" document up to date, so your comments and corrections will probably make it into there, rather than these 'draft' posts.

 

Post a Comment

Links to this post:

Create a Link

<< Home