[Users] bogofilter gone mad

David Relson relson at osagesoftware.com
Fri Feb 24 00:00:29 CET 2012


On Thu, 23 Feb 2012 22:18:05 +0000
richard wrote:

> On Thu, 23 Feb 2012 13:50:30 +0100
> Gour <gour at atmarama.net> wrote:
> 
> > On Thu, 23 Feb 2012 12:46:18 +0000
> > richard <richard.bown at blueyonder.co.uk> wrote:
> > 
> > > the word list was a bit large 4Mb, so I deleted it
> > 
> > [gour at atmarama gour] ls -lh .bogofilter
> > total 375M
> > -rw-r--r-- 1 gour users 375M Vel 21 17:14 wordlist.db
> > 
> > [gour at atmarama gour] file .bogofilter/wordlist.db 
> > .bogofilter/wordlist.db: SQLite 3.x database
> > 
> > Sincerely,
> > Gour
> > 
> 
> Must have a lot of words.
> But I have big problems as its still taking 80% of incoming mail as
> spam as putting into trash, I have two inboxes, inbox and trash.
> 
> This is after deleting the wordlist.db, are any other files used by
> bogofilter ??

Bogofilter assigns scores to words using information in the wordlist.
The wordlist has counts of the number of spam messages and the
number of ham messages used to build it.  For each token (word) in the
wordlist, ham and spam counts are stored indicating how many ham and
spam messages (respectively) the word has been used in.  From this
information bogofilter can decide whether a word is "hammy" or
"spammy". For example, a word used in 10% of all spam messages and in
90% of all ham messages is "hammy".   A word in 10% of spam and 1% of
ham is "spammy".  The balance of hammy vs spammy results for all words
in a message results in a score for the message.

It may help to train bogofilter with more ham messages (than you've
already done) in order to increase bogofilter's vocabulary.  The more
information in the wordlist, the better bogofilter can do.

Bogofilter also has a configuration file that sets the thresholds
for considering a message to be ham, spam, or unsure.  Changing
those values will affect whether message with a given score is
considered ham, or spam, or unsure.

HTH,

David



More information about the Users mailing list