You should be automatically redirected . If not, visit
http://newlisper.wordpress.com
and update your bookmarks.

28/07/2006

Repeating myself

Do I repeat myself a lot? Small children like repeating things a lot, and I wouldn't be surprised if one of the indications of encroaching old age might be the tendency to repeat oneself. Anyone who writes, or indulges in any creative endeavour, might want to be wary of repeating their previous successful ideas too frequently, in case facile repetition might be edging out genuine invention. Naturally, every great mind indulges in some repetition now and again - but it's probably beneficial for us lesser mortals to be wary of the oft-used cliche, the tired old phrase, or the same old jokes, which will merely water down still further the stream of successful ideas, if you can water down a stream ... -

Do I repeat myself? I'm starting to worry now. There's nothing worse than to re-read your old writing and wish that you'd never written this phrase or that. But it's too late. Don't look back, look forward. I certainly think that I should be wary of relying on the tired old phrase, the oft-used ... -

It's no use, I need a writing tool that checks my writing to see if I'm repeating myself a lot. Luckily it looks possible in newLISP:

(context 'REPETITIONS)      
(define (report-repetition)
  (set 'file (open ((main-args) 2) "read"))
  ; load the file
  (while (read-line file) 
    (map (lambda (i) (push i input-list -1)) 
      (parse (lower-case (current-line)) "[^a-z]+" 0)))
  ; start scanning
  (for (phrase-length 2 8) 
    (for (cursor 0 (- (length input-list) phrase-length)) 
     (set 'pattern (slice input-list cursor phrase-length) 'score 
      0 'data input-list) 
     (while (set 'temp-list (match (flat (list '* pattern '*)) data)) 
      (inc 'score) 
      (set 'data (apply append temp-list))) 
     (if (> score 1) 
      (push (list score pattern) results))))
  ; report results
  (if (> (length results) 0) 
    (map println (sort (unique results) (lambda (x y) (> x y))))))
(report-repetition)
(exit)

A list of words called input-list is compiled, then a loop looks through this list for patterns 2 elements long, then 3, then 4, and so on. The important part of this script is the cool function, match:

(set 'temp-list 
    (match (flat (list '* pattern '*)) data))

Here I'm searching data for a pattern, similar to the way you can look for patterns using regular expressions. The '* and '* mean any number of elements before and after the pattern. match finds the location in the list where such a pattern first appears. (I'm using flat as well, because pattern is itself a list, but I don't want to find a nested sublist, just the elements in the pattern list.)

If I find a match for the pattern, I increase the score, then set data to be the rest of the list - because we want to keep searching through the list until the end.

(set 'data (apply append temp-list))

Then we start again and look for longer phrases. Finally we sort the results and print them. Output might look a bit like this:

(5 ("i" "m"))
(4 ("of" "the"))
(4 ("it" "s"))
(3 ("wary" "of"))
(3 ("want" "to"))
(3 ("to" "be"))
(3 ("the" "list"))
(3 ("be" "wary" "of"))
(3 ("be" "wary"))
(3 ("a" "lot"))
(2 ("wary" "of" "repeating"))
(2 ("to" "be" "wary" "of"))
(2 ("to" "be" "wary"))
(2 ("tired" "old"))
(2 ("this" "script"))
(2 ("things" "a" "lot"))
(2 ("things" "a"))
(2 ("the" "pattern"))
(2 ("temp" "list"))
(2 ("successful" "ideas"))
(2 ("set" "data"))
(2 ("repeating" "things" "a" "lot"))
(2 ("repeating" "things" "a"))
(2 ("repeating" "things"))
(2 ("repeat" "myself"))
(2 ("phrase" "or"))
(2 ("on" "the"))
(2 ("of" "repeating"))
...

(I was trying to write repetitively - I hope you noticed!)

The big problem with this script is that it runs quite slowly. Of course, the algorithm isn't smart (I'm no good at algorithms) and it's doing a number of passes through the document, looking at every word many times, which seems wrong, somehow, although I can't think of a better way, at present.

But I found a trick for speeding this script up dramatically. I read these statements on the newLISP forum:

...newLISP handles cell memory more efficiently than string memory. As a result, it is often better to use symbols than strings for efficient text processing. For example, when handling natural language it is more efficient to handle natural language words as individual symbols in a separated name-space, rather than as a single string ... Programmers coming from other programming languages frequently overlook that symbols in LISP can act as more than just variables or object references. The symbol is a useful data type in itself, which in many cases can replace the string data type.

So I tried changing this:

(push i input-list -1)

to this:

(push (sym (string "_" i)) input-list -1)

and noticed a dramatic speed increase - the script now runs over twice as fast! This converts the input text into a list of symbols rather than a list of strings. I made a few more changes, to strip off the underscores before the results are reported. Now I've got no excuse for repeating myself!

2 Comments:

At 23:01, Blogger don Lucio said...

Nice application for 'match', which is very much underused in newLISP.

You could get a further speedup by setting up a word index storing for each word a pointer (symbol) to the sentence where it occurs. That would cut down on the work 'match' has to perform, but it also would get a program to complex for the scope of a blog entry.

 
At 23:18, Blogger newlisper said...

sounds like real programming to me :-) but it's a good idea!

 

Post a Comment

Links to this post:

Create a Link

<< Home