You should be automatically redirected . If not, visit
and update your bookmarks.


Simple sums

Someone asked the TextWrangler mailing list for suggestions on how to add up every occurrence of a range of prices in a large text file, and generate subtotals and grand totals. There were about a dozen different prices (1.99, 2.99, 3.99, 4.99, and so on) scattered through the file. This was the first suggestion, when the format of the file was still unspecified:

(set 'file (open "/Users/me/bigdata.txt" "read"))
(set 'prices '("1.99" "2.99" "3.99" "4.99" "5.99" "6.99")
     'tally (dup '0 (length prices))
     'total 0)
(while (read-line file)
  (set 'tally
    (map (fn (x y) (add x y))
      tally (count prices (parse (current-line))))))
(map (fn (price number)
       (set 'value (mul (float price) (float number)))
       (println number " at " price ", value " value)
       (inc 'total value))
     prices tally)
(println "Total " total " for " (apply add tally) " items")

which produced the following (untidy) output for a large sample file:

262530 at 1.99, value 522434.7
131270 at 2.99, value 392497.3
131270 at 3.99, value 523767.3
87510 at 4.99, value 436674.9
0 at 5.99, value 0
0 at 6.99, value 0
Total 1875374.2 for 612580 items

newLISP provides useful tools for this type of job. For example, the count function takes two lists and finds occurrences of each of the first list's elements in the second list:

(count '("1.99" "2.99" "3.99") '("1.99" "2.99" "1.99" "1.99" "1.99")))
;-> (4 1 0)

and this can cope with input lines that have any number of occurrences of prices.

The map function can be used to add lists together (adding the results of the count function to a running total), and also produces the totals by applying a simple multiplication across the final totals list.

However, this first version of the script didn't work. The problem is here:

(count prices (parse (current-line)))

Using parse like this isn't going to work on every text file. I'm not sure exactly what the problem was, but it seems to be related to string constants, since the error message was:

string token too long :

I think the reason is that parse, used without options, is newLISP's own internal parser, and it's therefore treating the input string as newLISP code, and so presumably finding some construction that breaks the rules.

The solution is to use parse more carefully:

(parse (current-line) " ")) ; split at spaces
(parse (current-line) "[^A-z]" 0 ) ; split at non-alphabetics

choosing the technique to match the format of the input file.

In fact, the input file turned out to be XML, so the problem was easily solved!


At 15:06, Blogger sarken said...

What version of newLISP are you using?

I ran into similar problems with parse using some 8.7 releases, but it has been fixed in 8.8

At 21:40, Blogger newlisper said...

I'm using 8.8. parse has been improved anyway for 8.8 (speeded up, I think, as well), but I also think that this function is quite hard to get to grips with if you're a beginner, so it might be me...!

At 13:22, Anonymous donlucio said...

parse without the break parameter will parse as if parsing newLISP source code and conmplain about unbalanced quotes etc.


Post a Comment

Links to this post:

Create a Link

<< Home