January 30, 2004
I do not recall how I first heard of Bayes or his formula. Not that it is well known to the common man, except perhaps during a semester of Statistics. The formula is here (for your viewing pleasure):

Looks useful, right? Not to me either. I know the P stands for Probability, though. Here is a pretty good description of what it does (source):

It lets us combine the probability of multiple independent events into one number. For example, suppose I know that when my car makes a loud clicking sound, there’s a 75% chance it’s going to break down. I also know that if it’s more than 80 degrees Fahrenheit outside, my car only has a 15% chance of breaking down. If it’s more than 80 degrees outside and my car is making a loud clicking sound, I can use Bayes’ Formula to figure out how likely it is that I’m going to be walking to work that morning (34.6%, if you’re curious).
The explanation above presents a nice description of what the formula means but it is far outside of the realm of plausibility (and practicality). Bayes' Formula is a nice tool, but tools just hang on the wall unless there is a valid application for them. (As I always say..."If all you have is Bayes' Formula, every problem looks like a statistic.")

Recently, one very practical application for this tool has emerged - coping with spam. Bayesian Filtering of e-mail is simple: 1. Break the email into tokens (individual words); 2. Determine the probability of each token being spam; 3: Use Bayes Formula to find the probability that the message is spam based on the collective probability of the individual tokens.

Paul Graham has two fantastic articles describing Bayesian Filtering with regard to e-mail here and here. I am posting a few key ideas from his first article, but I would highly recommend reading both in their entirety.

The typical approach to Spam is not Bayesian Filtering:

The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code.
Even Graham himself tried this method last:
I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.) When I did try statistical analysis, I found immediately that it was much cleverer than I had been. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.
Why Bayesian Filtering works so well:
But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The Bayesian approach assigns an actual probability. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
It also automatically adapts as spam tactics change:
If I thought that I could keep up current rates of spam filtering, I would consider this problem solved. But it doesn't mean much to be able to filter out most present-day spam, because spam evolves. Indeed, most antispam techniques so far have been like pesticides that do nothing more than create a new, resistant strain of bugs.
I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.
After reading those articles, I was sold. I was tired of fighting spam, and I wanted a Bayesian Filter.

The rest of this story can be found in Winning the Spam War (Part II).

Categories
Archives
March 2010
S M T W T F S
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Complete Archives

Tools
Search:
  Advanced Search

Mailing List:



Currently Reading
Recently Read
Animal Farm

Animal Farm
George Orwell

Life of Pi

Life of Pi
Yann Martel

The Fourth K

The Fourth K
Mario Puzo

Catch 22

Catch 22
Joseph Heller

the Sicilian

the Sicilian
Mario Puzo

The Quantum Rose

The Quantum Rose
Catherine Asaro

Members
Sponsors
Blogroll
Links
Stats
Entries: 2147
Comments: 2925
Trackbacks: 665
Members: 258

Most Recent:
  Entry: 11/09/08 9:38
  Comment: 11/17/08 12:27
  Visitor: 03/11/10 9:19

Powered by:
  ExpressionEngine

Extreme Tracking