Filtering spam with probabilities
Slashdot has linked up an interesting article on a new technique for filtering out email spam.
I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss only 5 per 1000 spams, with 0 false positives.
What's most interesting about the author's technique is that it determines the actual probability that an email is spam rather than generating an arbitrary score.
The article is on the long side, but it's well worth a look unless you're in the market for an extra 3-4 inches or want to repair your credit rating online.