You should be automatically redirected . If not, visit
http://newlisper.wordpress.com
and update your bookmarks.

17/01/2006

Sherlock Holmes and the Case of the Picture in the Attic

A context in newLISP is, according to the manual:

a namespace that is lexically separated from other namespaces

or

a stateful namespace

this definition is from John Small's excellent 21 minute introduction to newLISP.

My present understanding of a newLISP context is that it provides a named container for symbols, and that symbols in different contexts can have the same name without clashing. So, for example, in one context I can define the symbol called meaning-of-life to have the value "42", but, in another context, the identically-named symbol could have the value "dna-propagation", and, in yet another, "worship-of-deity-x".

Unless you specifically choose to create and/or switch contexts, all your newLISP work is carried out in the default context, called MAIN.

I decided to investigate contexts and namespaces with the help of that great detective, Sherlock Holmes. I downloaded Sir Arthur Conan Doyle's "The Sign of Four" from Project Gutenberg, stripped out the introductory text, and ran the following newLISP script, which reads the text of the original novel and stores every word as a symbol, prefaced by an underscore (_) character (a convention that helps us to avoid confusing ordinary words and symbols):

(context 'Doyle)
(set 'file (open "/Users/me/doyle-sign4.txt" "read"))
(set 'word-count 0)
; remember and count each word
(while (read-line file) 
    (set 'data (parse (lower-case (current-line)) "[^a-z]+" 0))
    (dolist (w data)
        (inc 'word-count)
        (and (!= w "") ; skip blanks
            (if (set 'result (eval (sym (append "_" w) Doyle ) ))
                    (set (sym (append "_" w) Doyle ) (+ result 1)) ; increase count
                    (set (sym (append "_" w) Doyle ) 1)))))
; create a word list
(dolist (w (symbols Doyle))
    (set 'wrd (name w))
    (if (and (starts-with wrd "_") (!= "_" wrd))
        (push (list (eval w) (slice wrd 1) ) words) ))
; save the context
(save "/Users/me/doyle-context.lsp" 'Doyle)

The first line creates - and switches to - a new context called "Doyle", and all the new symbols are created in this context rather than in MAIN. Each line of the file is converted to lower-case and then split into words. If the word preceded by an underscore doesn't already exist, it is created. But if it evaluates to something, the word has already been encountered, so the symbol's associated count is updated instead.

Then the words and their frequencies are stored as a list in the symbol words, without the initial underscore:

(set 'words '(
    (2 "zum") 
    (1 "zigzag") 
    (3 "youth") 
    (2 "yourselves") 
    (9 "yourself") 
    (7 "yours") 
    (107 "your") ...

Finally, the entire context is saved in a newLISP source file. The whole script takes 2 seconds on my machine, which is pretty quick.

Loading contexts

I now have a collection of data, wrapped up in a package called "Doyle", that captures the words used in the novel (although it has, of course, completely lost the plot). I can quickly load this saved context in another script or newLISP session using:

(load "/Users/me/doyle-context.lsp")

and newLISP will automatically recreate all the symbols in the Doyle context, switching back to the MAIN (default) context when done. It takes about 80 milliseconds here.

I can access the values of any symbol in the Doyle context by prefacing it with the name of the context and a colon, eg "Doyle:". For example:

Doyle:word-count
;-> 43795

I can find out the frequency of any word just by evaluating the name of the symbol, remembering the underscore we used as a prefix. If I'm in the MAIN context, I have to use the "Doyle" 'prefix' - of course, if I'm already in the Doyle context, I don't need to.

Doyle:_treasure
;-> 75
(context Doyle)
_india
;-> 12
(context MAIN)  ; switch back to MAIN context
Doyle:_cocaine
;-> 5

Conan-Doyle famously describes Holmes's drug-taking habits in the opening paragraphs...

Loading other contexts

It's the work of a few seconds to load up other contexts with other novels. This lets us make lots of pointless but amusing comparisons between different novels. As before, we obtain the novel's text and create a context to hold the words. I've chosen Oscar Wilde's "The Picture of Dorian Gray". All I need to do is change "Doyle" to "Wilde" in the above script and change the context names accordingly:

(context 'Wilde)
(set 'file (open "/Users/me/wilde-doriangray.txt" "read"))
... 
(save "/Users/me/wilde-context.lsp" 'Wilde)
...
(load "/Users/me/wilde-context.lsp")

When both the Doyle and Wilde contexts have been loaded side by side (they're happy to co-exist) we can start to ask questions like "How often do the two writers use the word 'charming'?":

(dolist (ctx '(Wilde Doyle))
    (println (context ctx (string "_charming") )))
;-> 
43 
1

Here, we're using the dolist function to step through the two contexts, and the context function to assemble a reference to the symbol that you'd otherwise refer to as Doyle:_charming or Wilde:_charming if you were addressing them directly. As you might have guessed if you've read both authors, the word appears far more in Oscar's sentences than in Arthur's.

If we produce a pair of word lists, without frequencies, we can ask how many words appear in just one novel. The difference function can return a new list of all the symbols that appear in the first list but not the second:

; first, make the word lists, in their own contexts 
(dolist (w (reverse (sort Doyle:words)))
    (push (last w) Doyle:wlist))
(dolist (w (reverse (sort Wilde:words)))
    (push (last w) Wilde:wlist))
; now compare the word lists
(println " words in Wilde but not in Doyle: " (length (difference Wilde:wlist Doyle:wlist)))
(println " words in Doyle but not in Wilde: " (length (difference Doyle:wlist Wilde:wlist)))
;-> 
words in Wilde but not in Doyle: 4060
words in Doyle but not in Wilde: 2626

This suggests that, despite the more exotic nature of Sherlock Holmes's quest for Indian treasure, Wilde manages to reach more corners of the English dictionary.

You can also use intersect to find list elements that appear in both lists. In fact there's no end to the number of strange tests and queries you could run - there are probably university researchers who get paid for doing this stuff all day.

(define (wfreq-diff wlist)
    (dolist (w wlist) 
        (set 'wf  (context 'Wilde (string "_" w)))
        (set 'df  (context 'Doyle (string "_" w)))
        (push  (list (- wf df) w wf df) r ) r))
(println (sort (wfreq-diff (intersect Wilde:wlist Doyle:wlist))))

this produces a list of words used by both writer, sorted to show which writer uses them more frequently: "wooden", "river", "police", and "business" are Conan-Doyle words; "simply", "pity", "perfect", and "painting" are Wilde words.

We can do some calculations, too:

(dolist (ctx '(Wilde Doyle))
    (println  ctx " " (div (context ctx "word-count") (length (context ctx "wlist"))))) ;-> 
;->
Wilde 12.67362847
Doyle 8.163094129

I'm not sure what dividing the length of the novel by the number of different words used tells us - perhaps that Wilde uses a wider selection of words than Conan-Doyle?

Elementary

To be honest, I haven't learnt much about the novels that I couldn't have learned by reading them again - but I have started to learn about using newLISP contexts. There are many other uses for contexts, such as for prototype-based object-oriented programming, whatever that is. The newLISP documentation provides many useful examples.

By the way, there's an interesting connection between these two novels. I'll leave you to google it.

2 Comments:

At 14:35, Anonymous Lutz said...

It is not necessary to put the pogram in the same context 'Doyle' which you are using as the word- ditionary 'Doyle'. The 'Doyle' in the 'sym' statement is enough to tell it to put all words in the right place. But in that case you would have to quote the word 'Doyle':

(sym (append "_" w) 'Doyle )

Now all your programm text can be in a different place and 'Doyle only contains what is Doyle's.

Thanks for showing this great application for contexts.

 
At 14:59, Anonymous Lutz said...

... or easier, you could just switch back to MAIN after creating 'Doyle':

(context 'Doyle)
(context MAIN)
...
...

and leave the rest of the program like it is.

 

Post a Comment

Links to this post:

Create a Link

<< Home