How To Train SpamAssassin

Faisal N. Jawdat

This is an overview of how to train SpamAssassin to more effectively catch spam.

The document is based somewhat heavily on the installation on Obscure, so local configuration (filepaths, etc.) may vary.

Why Train?

SpamAssassin comes with a large set of rules for likely spam behavior. This is somewhat effective, but not very smart. Training allows SA to learn what kind of spam you get over time, and adjust its results accordingly.

How To Train

First, file all your mail into folders which only contain spam or non-spam ("ham"), but not both.

Then, run the commands listed below, based on the format your mail server uses to store mail. Most unix systems use "mbox" format folders. Some use "mbx" folders, while others use Maildir format. If you aren't sure what your system uses, check with an administrator.

You'll run sa-learn once per folder, telling sa-learn whether to consider that folders' contents to be spam or ham. It will then use the contents as an examples to compare to future mail.

Make sure you have filed your mail into folders correctly before running sa-learn, and make sure you run sa-learn with the right flags. If you leave mail from your mother in a folder you train as spam, SA will start to think your mom is a spammer. If you have a message that is false-negative or false-positive and you train with it (e.g. a false positive -- move it into the inbox and retrain) it will learn the contents of that message and should do the right thing with them in the future

SA's training starts to work effectively once:

I've found that it makes sense to:

You should sweep your spam folder occasionally to make sure you aren't accidentally trapping legitimate mail. If you are, refile and retrain. Be careful of false positives. Training is a feedback loop, and legitimate mail learned to be spam will lead to more legitimate mail captured as spam.

Training with mbox format

The general format is:

sa-learn --no-sync [--spam or --ham] --mbox [folder]

For exmaple, assuming you've shoved all spam into a spam folder in the ~/mail directory:

sa-learn --no-sync --spam --mbox ~/mail/spam

And assuming you've filed ham into several folders (in this example, "friends" and "lists", also in the ~/mail directory):

sa-learn --no-sync --ham --mbox ~/mail/friends
sa-learn --no-sync --ham --mbox ~/mail/lists

You'll also want to clear all spam out of your inbox, and file that. Most systems using the mbox format store the inbox in a special location. On Obscure, the location is /var/spool/mail/[userid]. For example, for me the command is:

sa-learn --no-sync --ham --mbox /var/spool/mail/faisal

Once you've trained all the folders you're using, you'll need to run this command to tell sa-learn to clean up after itself and rebuild its database:

sa-learn --sync

If you'd like to see what's currently in the database, do:

sa-learn --dump magic

nspam and nham are the number of spam and ham messages that SpamAssassin has learned from.

Training with mbx format

Training with mbx format works much the same same as training with mbox format, except you must use "-mbx" instead of "-mbox" for all commands:

sa-learn --no-sync [--spam or --ham] --mbx [folder]

Generally the special folder for the inbox using mbx format is the INBOX folder in the user's home directory:

sa-learn --no-sync --ham --mbx INBOX

Training with Maildir format

Maildir format is a bit different -- it stores each message in a seperate file within one of three subdirectories ('cur', 'new', and 'tmp'). Instead of pointing sa-learn at a specific mbox or mbx file, you point sa-learn at the directories and it looks at all the files inside:

sa-learn --no-sync [--spam or --ham] [folder/{cur,new}]

For example:

sa-learn --no-sync --spam ~/Maildir/.INBOX.Spam/{cur,new}

or

sa-learn --no-sync --ham ~/Maildir/.INBOX/{cur,new}

(This ignores the 'tmp' directory, which is used as a working directly and is usually empty. You may also wish to ignore the 'new' directory, which lowers the odds of receiving and scanning wrongly-filed mail during a scan. To do that you would just scan folder/cur.)

More Information

The SA project's documentation regarding training: BayesInSpamAssassin

Paul Graham's foundational essay on probabalistic filters: A Plan for Spam