### Bayesian functions in the latest newLISP

The latest development version of newLISP contains two new Bayesian statistical functions: *bayes-train* analyses datasets for word frequencies, saving the results in a newLISP context, and *bayes-query* uses this context to generate the probability of a given piece of text belonging to one or other of the datasets. Here's a version of the example in the manual, which estimates the probability that a given piece of text can be considered as spam - a typical use for Bayesian analysis tools.

First, we obtain two sets of data, one good, one bad: I dumped a dozen or so desirable and undesirable email messages into a pair of text files. *parse* converts each file into a list of symbols:

```
(set 'spam-data (parse (read-file "/Users/me/spam.txt") {\s+} 0))
(set 'nospam-data (parse (read-file "/Users/me/nospam.txt") {\s+} 0))
```

Then we use the *bayes-train* function to produce a context containing all the words in these lists, together with their frequency data:

```
(bayes-train spam-data nospam-data 'Lexicon)
```

Next, we can save the resulting context in a text file:

```
(save "lex.lsp" 'Lexicon)
```

so that we can load it again later:

```
(load "lex.lsp")
```

The training process takes a few seconds, so it makes sense to do it once, and then load the context when you want to analyse some text.

With the context loaded, we can use the *bayes-query* function to analyse a piece of text against the previously-analysed data:

```
(set 'q1 (bayes-query (parse "newLISP is fine open source software") Lexicon))
(set 'q2 (bayes-query (parse "Office XP is cheap at the moment") Lexicon))
```

Each result is a pair of numbers in a list: the first number is the probability that the phrase belongs in the first dataset (and is therefore, in this case, not spam), and the second number the probability that it belongs in the second, spam dataset. The two numbers add up to 1.

```
(println (format "%5f" (first q1)))
(println (format "%5f" (first q2)))
```

with the following results:

```
0.000090
0.999998
```

So the phrase "newLISP is fine open source software" scores a tiny 0.000090 out of 1, and so is not considered to be similar to the spam text I used for training, whereas the phrase "Office XP is cheap at the moment" certainly scores like some of the spam email I receive - I receive dozens of messages like this each month, so the statistics have clearly produced the right results this time.

Consult the newLISP manual for all the options and formulas.

I discovered these functions just after writing my previous entry about analysing novels. If I use two of the novels for training, I can find out the probability that a piece of text was written by one or other of the authors. Plainly newLISP is an excellent choice for this sort of activity!

## 3 Comments:

The new 'bayes-train' and 'bayes-query' are not yet in the offical release but in the current development version 8.7.8.

Also when doing really big files its a lot faster to parse and train tokens line by line instead of all at once. Its the parseing/tokenizing which is faster doing it in small portions versus one big chunk.

The next development release 8.7.8 will have improved documentation and examples how to do this. Meanwhile I post the improved Bayes documentation here:

http://newlisp.org/downloads/development/bayesian.html

The link was cut of, here the link in two lines:

http://www.newlisp.org/downloads/

development/bayesian.html

Thanks Lutz - Yes, I know that these are very new, and development version only at the moment, but I thought it worth mentioning since I'd only just posted something v similar and they looked pretty cool...

By the time anyone else reads this post, it will probably be newLISP version 9! ;-)

Post a Comment

## Links to this post:

Create a Link

<< Home