Bayesian Filters – Basic Explanation
Bayesian filters, as implemented within IceWarp Server, use two reference databases to decide the probability that a message is a spam:
The Reference Base, which is built and supplied by us, uses real-world messages in a real-world mail server. Updates are supplied through the Antispam update function.
The User Reference Base, which is built by IceWarp Server using the Auto Learn and/or Learning Rules functions, and uses actual messages passing through the server, and consequently becomes much more specific to the individual installation.
User Reference Base information overrides Reference Base information.
Bayesian filters are based on the Bayesian probability theory. This theory says that the probability something will happen is the same as the probability that it has happened in the past. For them to work correctly a good selection of both spam and real (ham) messages should be analyzed.
Its implementation within IceWarp Server is as follows:
Take the probability that a spam message contains a certain word.
Multiply by the probability that any email is spam.
Divide by the probability that a ham message contains the certain word.
Gives you the probability that this message is spam.
Example:
Assume that we have received and analyzed 100,000 messages in total:
-
80,000 messages are spam
-
48,000 spam messages contain the word viagra
-
400 ham messages contain the word viagra
Then:
-
The probability that spam contains viagra = 48,000 / 80,000 = 0.6
-
The probability that a message is spam = 80,000 / 100,000 = 0.8
-
The probability that any message contains viagra is (48,000 + 400) / 100,000 = 0.484
-
So Bayesian theory says the probability that a message containing viagra is spam = 0.6 * 0.8 / 0.484 = 0.991
-
Meaning a message containing viagra has a 99.1% chance of being spam.
We recommend an initial Auto Learn period of about two weeks, and a Compact and re-learn every 3-4 months at least. This will allow the User Reference Base to follow any changes in company message content (for example, the company start selling mortgages)
The User Reference Base can hold a maximum of 100,000 words. (This limit can be changed – use the C_AS_SpamBayesMaxWords API variable.) You can see how many words are actually stored in the General tab.
Once the limit is reached you should Compact the database (which removes lower frequency, less important, words) and enable the Auto Learn feature again for a time.
The Reference Base is contained within file <install_dir>/spam/spam.db
The User Reference Base is contained within file <install_dir>/spam/spam.usr
spam.db and spam.usr Files
These files include records of spam probability of given words:
return 38953 128 190
revealed 38891 0 16.
where items on each line are:
- word itself
- timestamp of the last modification (Delphi time, number of days from 1.1.1900)
- how much genuine messages contained this word
- how much spam messages contained this word
Note: Sometimes, the same number is subtracted from both
spam and genuine counters to keep the numbers low. So, the third example record
does not mean that there was not any spam message with this word.
These numbers are only 32-bit ones, thus they cannot be higher
than . In the case this number is exceeded, the appropriate
record can look a bit strange – see the first record.