How To Train SpamAssassin

This is an overview of how to train SpamAssassin to more effectively catch spam.

The document is based somewhat heavily on the installation on Obscure, so local configuration (filepaths, etc.) may vary.

Why Train?

SpamAssassin comes with a large set of rules for likely spam behavior. This is somewhat effective, but not very smart. Training allows SA to learn what kind of spam you get over time, and adjust its results accordingly.

How To Train

First, file all your mail into folders which only contain spam or non-spam (“ham”), but not both.

Then, run the commands listed below, based on the format your mail server uses to store mail. Most unix systems use mbox format folders. Some use mbx folders, while others use Maildir format. If you aren’t sure what your system uses, check with an administrator.

You’ll run sa-learn once per folder, telling sa-learn whether to consider that folders’ contents to be spam or ham. It will then use the contents as an examples to compare to future mail.

Make sure you have filed your mail into folders correctly before running sa-learn, and make sure you run sa-learn with the right flags. If you leave mail from your mother in a folder you train as spam, SA will start to think your mom is a spammer. If you have a message that is false-negative or false-positive and you train with it (e.g. a false positive – move it into the inbox and retrain) it will learn the contents of that message and should do the right thing with them in the future.

SA’s training starts to work effectively once:

You’ve trained about 3000 messages each of both spam and ham.
You have an equal amount of spam and ham trained, or more ham than spam trained. This last bit is important: If you only train spam – and not ham – the filter will become biased towards spam.

I’ve found that it makes sense to:

train regularly until SA becomes smart enough that spam isn’t annoying me
train again when spam slips through
train intermittently, even when spam doesn’t slip through, just to keep SA up to date (of course, this necessitates keeping old spam around)

You should sweep your spam folder occasionally to make sure you aren’t accidentally trapping legitimate mail. If you are, refile and retrain. Be careful of false positives. Training is a feedback loop, and legitimate mail learned to be spam will lead to more legitimate mail captured as spam.

Training with mbox format

The general format is:

sa-learn --no-sync [--spam or --ham] --mbox [folder]

For exmaple, assuming you’ve shoved all spam into a spam folder in the ~/mail directory:

sa-learn --no-sync --spam --mbox ~/mail/spam

And assuming you’ve filed ham into several folders (in this example, “friends” and “lists”, also in the ~/mail directory):

sa-learn --no-sync --ham --mbox ~/mail/friends
sa-learn --no-sync --ham --mbox ~/mail/lists

You’ll also want to clear all spam out of your inbox, and file that. Most systems using the mbox format store the inbox in a special location. On Obscure, the location is /var/spool/mail/[userid]. For example, for me the command is:

sa-learn --no-sync --ham --mbox /var/spool/mail/faisal

Once you’ve trained all the folders you’re using, you’ll need to run this command to tell sa-learn to clean up after itself and rebuild its database:

sa-learn --sync

If you’d like to see what’s currently in the database, do:

sa-learn --dump magic

nspam and nham are the number of spam and ham messages that SpamAssassin has learned from.

Training with mbx format

Training with mbx format works much the same same as training with mbox format, except you must use -mbx instead of -mbox for all commands:

sa-learn --no-sync [--spam or --ham] --mbx [folder]

Generally the special folder for the inbox using mbx format is the INBOX folder in the user’s home directory:

sa-learn --no-sync --ham --mbx INBOX

Training with Maildir format

Maildir format is a bit different – it stores each message in a seperate file within one of three subdirectories (‘cur’, ‘new’, and ‘tmp’). Instead of pointing sa-learn at a specific mbox or mbx file, you point sa-learn at the directories and it looks at all the files inside:

sa-learn --no-sync [--spam or --ham] [folder/{cur,new}]

For example:

sa-learn --no-sync --spam ~/Maildir/.INBOX.Spam/{cur,new}

sa-learn --no-sync --ham ~/Maildir/.INBOX/{cur,new}

(This ignores the ‘tmp’ directory, which is used as a working directly and is usually empty. You may also wish to ignore the ‘new’ directory, which lowers the odds of receiving and scanning wrongly-filed mail during a scan. To do that you would just scan folder/cur.)

More Information

The SA project’s documentation regarding training: BayesInSpamAssassin

Paul Graham’s foundational essay on probabalistic filters: A Plan for Spam