68b4d23acc
git-svn-id: file:///srv/svn/repos/haiku/haiku/trunk@15963 a95241bf-73f2-0310-859d-f6bbb57e9c96
807 lines
40 KiB
HTML
807 lines
40 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<TITLE>AGMSBayesianSpam Documentation</TITLE>
|
|
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
|
|
<META NAME="author" CONTENT="Alexander G. M. Smith">
|
|
<META NAME="description" CONTENT="Documentation for AGMSBayesianSpam, for classifying incoming e-mail messages as spam (junk mail) or genuine.">
|
|
<!--
|
|
; $Log: index.html,v $
|
|
; Revision 1.11 2003/02/08 21:54:14 agmsmith
|
|
; Updated the AGMSBayesianSpamServer documentation to match the current
|
|
; version. Also removed the Beep options from the spam filter, now they
|
|
; are turned on or off in the system sound preferences.
|
|
;
|
|
; Revision 1.10 2002/12/16 17:32:43 agmsmith
|
|
; Added Alex's settings paragraph. Added screen shot of dangerous
|
|
; header filter that deletes things on the server.
|
|
;
|
|
; Revision 1.9 2002/12/13 22:45:22 agmsmith
|
|
; More changes for self training and chi-squared scoring.
|
|
;
|
|
; Revision 1.8 2002/12/13 22:20:50 agmsmith
|
|
; Under construction.
|
|
;
|
|
; Revision 1.7 2002/11/29 23:42:26 agmsmith
|
|
; Describe the word display and what you can do with it.
|
|
;
|
|
; Revision 1.6 2002/11/29 22:20:02 agmsmith
|
|
; Updated version numbers in the text
|
|
;
|
|
; Revision 1.5 2002/11/28 21:19:41 agmsmith
|
|
; Updated to explain how to check for spam without downloading the
|
|
; whole message.
|
|
;
|
|
; Revision 1.4 2002/11/10 20:56:39 agmsmith
|
|
; Updated documentation to include MDR installer effects, and added a
|
|
; section on the tokenizing experiments.
|
|
;
|
|
; Revision 1.3 2002/11/06 00:54:47 agmsmith
|
|
; Spam definition corrected, with prodding from Ian G.
|
|
;
|
|
; Revision 1.2 2002/11/05 22:47:30 agmsmith
|
|
; Replace UTF-8 copyright symbol with useable token.
|
|
;
|
|
; Revision 1.1 2002/11/05 22:43:24 agmsmith
|
|
; Starting point for the HTML documentation for AGMSBayesianSpam stuff.
|
|
;
|
|
; Revision 1.5 2002/10/21 21:07:19 agmsmith
|
|
; Added references to the original spam detection papers.
|
|
;
|
|
; Revision 1.4 2002/10/21 20:56:01 agmsmith
|
|
; Added hyperlinks for local files, and an explanation of "spam".
|
|
;
|
|
; Revision 1.3 2002/10/21 20:19:29 agmsmith
|
|
; Finished updating instructions for version 1.60, and adding
|
|
; lots of screen shots.
|
|
;
|
|
; Revision 1.2 2002/10/21 02:03:35 agmsmith
|
|
; Added log in HTML comments area.
|
|
;
|
|
; Revision 1.1 2002/10/21 02:00:54 agmsmith
|
|
; Initial revision
|
|
-->
|
|
</HEAD>
|
|
<BODY BGCOLOR="WHITE" TEXT="BLACK">
|
|
|
|
<P><FONT COLOR="MAGENTA">Short: Junk E-Mail Classifier.
|
|
<BR>Author: agmsmith@rogers.com (Alexander G. M. Smith)
|
|
<BR>Uploader: agmsmith@rogers.com (Alexander G. M. Smith)
|
|
<BR>Website: <A HREF="http://members.rogers.com/agmsmith/">http://members.rogers.com/agmsmith/</A>
|
|
<BR>Version: 1.77
|
|
<BR>Type: internet & network/e-mail
|
|
<BR>Requires: BeOS 5.0+
|
|
<BR>Related things: <A HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>, <A HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html">http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A></FONT>
|
|
|
|
<H1><A NAME="Contents"></A>Table of Contents</H1>
|
|
|
|
<UL>
|
|
<LI><A HREF="#Contents">Table of Contents</A>
|
|
<LI><A HREF="#Introduction">Introduction to AGMSBayesianSpam</A>
|
|
<LI><A HREF="#Installation">Installation</A>
|
|
<LI><A HREF="#Usage">Usage</A>
|
|
<UL>
|
|
<LI><A HREF="#Reading">Reading E-Mail</A>
|
|
<LI><A HREF="#Training">Training</A>
|
|
<LI><A HREF="#HidingServer">Hiding the Server Window</A>
|
|
<LI><A HREF="#AlexSettings">Alex's Settings</A>
|
|
</UL>
|
|
<LI><A HREF="#AdvancedUsage">Advanced Usage</A>
|
|
<UL>
|
|
<LI><A HREF="#CommandLine">Command Line Mode and Scripting</A>
|
|
<LI><A HREF="#Spreadsheet">Using a Spreadsheet to Examine Word Statistics</A>
|
|
<LI><A HREF="#WordDisplay">Understanding and Using the Word Display</A>
|
|
<LI><A HREF="#Tokenizing">Tokenizing Modes Compared</A>
|
|
<LI><A HREF="#HeadersOnly">High Speed and High Danger - Headers Only Trick</A>
|
|
</UL>
|
|
<LI><A HREF="#ChangeLog">Change Log</A>
|
|
</UL>
|
|
|
|
<H1><A NAME="Introduction"></A>Introduction to AGMSBayesianSpam</H1>
|
|
|
|
<P>AGMSBayesianSpam is a set of BeOS programs for classifying e-mail messages
|
|
and other text as either spam or genuine. "Spam" is the colloquial name for
|
|
unwanted junk messages, usually advertising. The name comes from a 1970's <A
|
|
HREF="http://www.google.com/search?&q=monty+python+spam">Monty Python comedy
|
|
skit</A> involving lots of unwanted Spam, which is the name for the spicy ham
|
|
in a can made by the <A HREF="http://www.hormel.com/">Hormel Foods</A> company,
|
|
originally from Austin, Minnesota, USA. The program classifies messages as
|
|
spam or genuine (sometimes called "ham"), based on the words they contain and
|
|
previous messages which have been identified by the user as spam or genuine.
|
|
It's implemented as a server program (AGMSBayesianSpamServer) which keeps track
|
|
of the word list and a Mail Daemon Replacement add-on (AGMSBayesianSpamFilter)
|
|
which uses the server to classify incoming messages. Theoretically other
|
|
programs, like a news reader, could also use the word database using the
|
|
scripting interface. There's also a command line interface and a graphical
|
|
user interface.
|
|
|
|
<P>If you want to know more about the technique of counting words, have a look
|
|
at Paul Graham's wonderful write-up at <A
|
|
HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>.
|
|
This program is currently using an improved version of Graham's method, called
|
|
Gary-combining, by Gary Robinson. See <A
|
|
HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html"
|
|
>http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A> for
|
|
Gary's story. There's also an even more improved method called Chi-Squared
|
|
(χ²) combining, which grew from discussions on the <A
|
|
HREF="http://mail.python.org/mailman-21/listinfo/spambayes">Spambayes</A>
|
|
mailing list.
|
|
|
|
<H1><A NAME="Installation"></A>Installation</H1>
|
|
|
|
<OL TYPE="1">
|
|
<LI>Install the BeOS Mail Daemon Replacement (MDR) version 2.0.0 beta 7 or
|
|
later. Beta 3 and later include AGMSBayesianSpam so you don't need to
|
|
worry about incompatible versions, and the MDR will even do some of the
|
|
installation for you. You can get MDR from <A
|
|
HREF="http://www.bebits.com/app/2289">http://www.bebits.com/app/2289</A> or
|
|
get the latest source code and compile it yourself from <A
|
|
HREF="http://sourceforge.net/projects/bemaildaemon"
|
|
>http://sourceforge.net/projects/bemaildaemon</A>.
|
|
<LI>Move the AGMSBayesianSpamServer program to the
|
|
<A HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> directory
|
|
(the MDR installer will do this for you). It's
|
|
put there to make it useable from the command line. If you use it
|
|
frequently, you can also add a symbolic link to it in your desktop
|
|
applications menu or to the mail menu.
|
|
<LI>Move the AGMSBayesianSpamFilter mail add-on to the
|
|
<A HREF="file:/boot/home/config/add-ons/mail_daemon/inbound_filters/"
|
|
>/boot/home/config/add-ons/mail_daemon/inbound_filters/</A> directory
|
|
(the MDR installer will do this for you).<BR>
|
|
<IMG SRC="pictures/HomeConfigAddonsMaildaemonInboundfilters.png"
|
|
ALT="[/boot/home/config/add-ons/mail_daemon/inbound_filters/ directory]"
|
|
WIDTH="634" HEIGHT="288">
|
|
<LI><IMG SRC="pictures/CantFindSettings.png" ALT="[Can't Find Settings]"
|
|
WIDTH="326" HEIGHT="120" ALIGN="RIGHT">Set up MIME types and indices (the
|
|
MDR installer will do this step and the next one for you, invisibly). Run
|
|
the AGMSBayesianSpamServer program. It will put up an alert box
|
|
complaining about not finding the settings file. Just hit the Acknowledge
|
|
button to get past it. Then click the "Install MIME Types & Make
|
|
Indices on All Drives" button which does what it says plus it also adds a
|
|
few sound effect names to the system.<BR CLEAR="ALL">
|
|
<IMG SRC="pictures/TheInstallButton.png" ALT="[The Install Button]"
|
|
WIDTH="604" HEIGHT="403">
|
|
<LI>Quit the program (the close box at the top left corner is one way of
|
|
doing that). It will make the settings file and settings directory
|
|
<A HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
|
|
>/boot/home/config/settings/AGMSBayesianSpam/</A> when it exits.
|
|
<LI>Use the Sounds preferences (or the installsound command) to associate
|
|
the names with your sound files (SoundGenuine, SoundUncertain and SoundSpam
|
|
are included as examples with MDR in the <A
|
|
HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
|
|
>/boot/home/config/settings/AGMSBayesianSpam/</A> directory), no I don't
|
|
have the rights to the Monty Python Spam skit). If you don't want it to
|
|
make sounds, don't do anything (you can also use the Sounds preferences
|
|
later on to disable or remove the sounds if you get tired of them).<BR>
|
|
<IMG SRC="pictures/StartingSoundPreferences.png" WIDTH="291" HEIGHT="445"
|
|
ALT="[Starting Sound Preferences]"> <IMG
|
|
SRC="pictures/SoundPrefChoosingAFile.png" WIDTH="318" HEIGHT="387"
|
|
ALT="[Sound Preferences Choosing a File]">
|
|
<P>When you're done, it should look something like this:<BR>
|
|
<IMG SRC="pictures/SoundPrefFinished.png" WIDTH="314" HEIGHT="245"
|
|
ALT="[Sound Preferences Finished]">
|
|
<LI>Add some example messages to the database.
|
|
<OL TYPE="A">
|
|
<LI>If you don't want to do this, see step B. You need to add roughly the
|
|
same number of sample Spam messages as you add of genuine e-mail. A few
|
|
hundred of each should do, though you can get useful results with a dozen.
|
|
<P>Run the AGMSBayesianSpamServer program again. This time it shouldn't
|
|
complain. Click the "Create" button to make a new database with the
|
|
default name of "<A
|
|
HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database"
|
|
>/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>".
|
|
<P>Use the "Add Example of Spam/Genuine" button, and only select at most
|
|
80 files at a time (otherwise the Tracker/File Requester will lock up and
|
|
you'll have to reboot your computer). It will ask you to identify each
|
|
file as spam or genuine, you also have the choice of identifying a whole
|
|
batch of them as all spam or all genuine.<BR>
|
|
<IMG SRC="pictures/SingleMessageClassificationRequest.png"
|
|
ALT="[Single Message Classification Request]"
|
|
WIDTH="349" HEIGHT="104" ALIGN="LEFT">
|
|
<IMG SRC="pictures/MultipleMessageClassificationRequest.png"
|
|
ALT="[Multiple Message Classification Request]"
|
|
WIDTH="349" HEIGHT="104" ALIGN="RIGHT">
|
|
<BR CLEAR="ALL">
|
|
<P>You can also drag and drop example messages into the bottom half of
|
|
the window. Drop in the left side for genuine, right side for spam, but
|
|
avoid the middle third of the window.<BR>
|
|
<IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]"
|
|
WIDTH="596" HEIGHT="204"><BR>
|
|
<P>If you have thousands of messages, use the command line mode.<BR>
|
|
<IMG SRC="pictures/CommandLineSetSpam.png" ALT="[Command Line Set Spam]"
|
|
WIDTH="634" HEIGHT="205">
|
|
<LI>If you don't have a few hundred spam messages, instead of doing step
|
|
A copy the sample database file to "AGMSBayesianSpam Database" in the <A
|
|
HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
|
|
>/boot/home/config/settings/AGMSBayesianSpam/</A> directory (the MDR
|
|
installer will do this for you). Due to complaints about the huge file
|
|
size, the sample spam database that comes with MDR is now very small
|
|
(10 spam, 10 genuine example messages), so you'll need to train it before
|
|
it gets accurate (auto-training is your friend). Or you could get the
|
|
huge (976KB, 484 spam, 1009 genuine messages) one from version 2.0.0 Beta
|
|
8, available at: <A
|
|
HREF="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8"
|
|
>http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8</A>
|
|
<BR>
|
|
<IMG SRC="pictures/DatabaseFileLocation.png"
|
|
ALT="[Database File Location]" WIDTH="609" HEIGHT="234"><BR>
|
|
Run the AGMSBayesianSpamServer program again. Hit the Purge button
|
|
(because it doesn't load the database until it has to do something). If
|
|
things are working correctly, you should see a list of the words in the
|
|
sample database in the bottom half of the window. An alternative method
|
|
of picking a database file is to double click on it in Tracker, which is
|
|
useful if you don't want to type in the full name.
|
|
</OL>
|
|
<LI>Quit the AGMSBayesianSpamServer program. Delete all the remaining
|
|
files you unzipped from the archive (such as the example database, this
|
|
readme, this documentation, or source code), unless you want to
|
|
keep them around. You will have to decide where to store them; I can't
|
|
tell you everything :-).
|
|
<LI>Start up the E-mail preferences control panel (part of the Mail Daemon
|
|
Replacement project).<BR>
|
|
<IMG SRC="pictures/StartingEMailPreferences.png"
|
|
ALT="[Starting EMail Preferences]" WIDTH="291" HEIGHT="443"><BR>
|
|
Choose the e-mail account you wish to have checked for spam. Then hit the
|
|
Add Filter button to bring up the menu with the list of filters you can
|
|
add, and pick AGMSBayesianSpamFilter.<BR>
|
|
<IMG SRC="pictures/ClickingOnAddFilterShowsList.png"
|
|
ALT="[Clicking On Add Filter Shows List]" WIDTH="454" HEIGHT="413"><BR>
|
|
Remember to click on the filter after you have added it to set the settings
|
|
(though the defaults are useable too).<BR>
|
|
<IMG SRC="pictures/ClickOnFilterNameToGetSettings.png"
|
|
ALT="[Click On Filter Name To Get Settings]" WIDTH="454" HEIGHT="413"><BR>
|
|
Then select the settings you wish. If you installed sound files earlier,
|
|
you can turn on the sound effects here.<BR>
|
|
<IMG SRC="pictures/FilterSettings.png" ALT="[Filter Settings]"
|
|
WIDTH="454" HEIGHT="413">
|
|
<LI>Test it. Send yourself some e-mail and see if it gets rated correctly.
|
|
</OL>
|
|
|
|
|
|
<H1><A NAME="Usage"></A>Usage</H1>
|
|
|
|
<H2><A NAME="Reading"></A>Reading E-Mail</H2>
|
|
|
|
<P>Check for e-mail as usual. If you look at the inbox directory in Tracker,
|
|
you can add an extra column with the E-mail attribute "Spam/Genuine Estimate"
|
|
to see how spammy the messages are. 0.0 means the system thinks the message is
|
|
fully genuine, 1.0 fully spam. But it can be wrong, for things like a friend
|
|
of yours quoting a spam message. For the Chi-squared method (the default), you
|
|
see numbers close to zero for genuine (like 9.750e-13), close to 1 for spam and
|
|
in-between (0.01 to 0.99) if it can't decide. With the Robinson scoring
|
|
method, usually if it is over 0.56 (the best cutoff value depends a bit on your
|
|
database quality, but 0.56 is typical) then it is spam, and the closer it is to
|
|
1.0 the more likely it really is spam.
|
|
|
|
<P>I sort by spam ratio, and manually throw away the messages that are spammy,
|
|
then I switch the Tracker window back to sorting by thread+date (just a click
|
|
on the appropriate column title does it) and get on with reading the mail.
|
|
|
|
<P>If you turned on the filter option to modify the subject, you'll see spam
|
|
messages with something like [Spam 95%] in front of the subject (I don't use it
|
|
because it looks ugly). But only in the Tracker display of the Subject, the
|
|
actual subject inside the message isn't affected, just the MAIL:subject
|
|
attribute, which is what the Tracker shows.
|
|
|
|
<H2><A NAME="Training"></A>Training</H2>
|
|
|
|
<P><EM>The accuracy is only as good as your database</EM>, so update it with
|
|
more example spam and genuine messages. In particular, if it gets the estimate
|
|
wrong, add that message to the database to tell it what it should be doing. A
|
|
quick way to do that is to right click on the e-mail in Tracker, and pick Open
|
|
With... AGMSBayesianSpamServer.<BR>
|
|
<IMG SRC="pictures/SortingInboxBySpamEstimate.png"
|
|
ALT="[Sorting Inbox By Spam Estimate]" WIDTH="831" HEIGHT="361"><BR>
|
|
It should start up and ask you if the message is spam or genuine.<BR>
|
|
<IMG SRC="pictures/SingleMessageClassificationRequest.png"
|
|
ALT="[Single Message Classification Request]" WIDTH="349" HEIGHT="104"><BR>
|
|
You can also drag and drop the message into the left third of the word list for
|
|
genuine messages, or right third for spam messages. Dropping in the middle
|
|
third does something else that's mostly harmless and fun.<BR>
|
|
<IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]" WIDTH="596"
|
|
HEIGHT="204"><BR>
|
|
|
|
<P>You may also want to train it with all your messages (it gives slightly
|
|
better results in the long run than just training on the mistakes). To make it
|
|
easier, turn on the self-training option in the mail filter. It will compute
|
|
the spam ratio of new mail messages, then feed back the same message into the
|
|
database as an example of spam/genuine. When it gets it wrong, you should
|
|
manually retrain it with the correct classification, otherwise the database
|
|
will get worse and worse and finally turn into mush.
|
|
|
|
<H2><A NAME="HidingServer"></A>Hiding the Server Window</H2>
|
|
|
|
<P>If you're annoyed by the server window popping up whenver the system checks
|
|
for e-mail, you can tell it to hide. Just click the "Server Mode" checkbox.
|
|
Actually, that's now the default since people were complaining about the window
|
|
getting in the way. The disadvantage is that you don't get to see error
|
|
messages. To make it visible again, start up AGMSBayesianSpamServer (possibly
|
|
by double clicking on its icon in <A
|
|
HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> and bring up the
|
|
hidden window by using the deskbar, or by using the "Edit Server Settings"
|
|
button in the spam filter configuration).<BR>
|
|
<IMG SRC="pictures/MakingTheWindowVisibleFromTheDeskbar.png"
|
|
ALT="MakingTheWindowVisibleFromTheDeskbar" WIDTH="640" HEIGHT="109"><BR>
|
|
|
|
<H2><A NAME="AlexSettings"></A>Alex's Settings</H2>
|
|
|
|
<P>I'm currently using it with these settings: Chi-squared scoring,
|
|
AnyTextAndHeader tokenizing, server mode on, ignore previous classification
|
|
off, mark subject with [Spam %] off, spam cutoff 0.95, genuine below 0.05, no
|
|
words found on, self-training on, close AGMSBayesianSpamServer when Finished
|
|
on. Because of the self training, I always correct it when it gets the
|
|
classification wrong (that means I have to manually delete the messages, can't
|
|
use a Match Header filter to do it). My Tracker window shows the
|
|
Classification Group attribute rather than the Spam/Genuine Estimate number
|
|
(which isn't pretty when using Chi-squared).
|
|
|
|
<H1><A NAME="AdvancedUsage"></A>Advanced Usage</H1>
|
|
|
|
<H2><A NAME="CommandLine"></A>Command Line Mode and Scripting</H2>
|
|
|
|
<P>Besides the graphical user interface, there
|
|
is also a command line mode. Just type "AGMSBayesianSpamServer help"
|
|
in the terminal to get a list of the commands and what they do (the ultimate
|
|
documentation). It also explains all of the mysterious options you see in the
|
|
graphical user interface. The same commands can be used in scripting, either
|
|
from some other program or via the "hey" utility which you can get from <A
|
|
HREF="http://www.bebits.com/app/2042">http://www.bebits.com/app/2042</A>. A
|
|
useful command, if you have a lot of spam messages to add, is
|
|
"AGMSBayesianSpamServer set genuine *" which will use all messages in the
|
|
current directory as examples of genuine text.
|
|
|
|
<PRE>
|
|
Sat Feb 8 16:30:51 274 /tmp>AGMSBayesianSpamServer help
|
|
|
|
AGMSBayesianSpamServer - A Spam Database Server
|
|
Copyright © 2002 by Alexander G. M. Smith. Released to the public domain.
|
|
|
|
Compiled on Feb 8 2003 at 11:13:28. $Revision: 1.11 $ $Header:
|
|
/cvsroot/bemaildaemon/AGMSBayesianSpamServer/AGMSBayesianSpamServer.cpp,v 1.77
|
|
2003/01/22 03:19:48 agmsmith Exp $
|
|
|
|
This is a program for classifying e-mail messages as spam (junk mail which
|
|
you don't want to read) and regular genuine messages. It can learn what's
|
|
spam and what's genuine. You just give it a bunch of spam messages and a
|
|
bunch of non-spam ones. It uses them to make a list of the words from the
|
|
messages with the probability that each word is from a spam message or from
|
|
a genuine message. Later on, it can use those probabilities to classify
|
|
new messages as spam or not spam. If the classifier stops working well
|
|
(because the spammers have changed their writing style and vocabulary, or
|
|
your regular correspondants are writing like spammers), you can use this
|
|
program to update the list of words to identify the new messages
|
|
correctly.
|
|
|
|
The original idea was from Paul Graham's algorithm, which has an excellent
|
|
writeup at: http://www.paulgraham.com/spam.html
|
|
|
|
Gary Robinson came up with the improved algorithm, which you can read about at:
|
|
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
|
|
|
|
Then he, Tim Peters and the SpamBayes mailing list developed the Chi-Squared
|
|
test, see http://mail.python.org/pipermail/spambayes/2002-October/001036.html
|
|
for one of the earlier messages leading from the central limit theorem to
|
|
the current chi-squared scoring method.
|
|
|
|
Thanks go to Isaac Yonemoto for providing a better icon.
|
|
|
|
Usage: Specify the operation as the first argument followed by more
|
|
information as appropriate. The program's configuration will affect the
|
|
actual operation (things like the name of the database file to use, or
|
|
whether it should allow non-email messages to be added). In command line
|
|
mode it will do the operation and exit. In GUI/server mode a command line
|
|
invocation will just send the command to the running server. You can also
|
|
use BeOS scripting (see the "Hey" command which you can get from
|
|
http://www.bebits.com/app/2042 ) to control the Spam server. And finally,
|
|
there's also a GUI interface which shows up if you start it without any
|
|
command line arguments.
|
|
|
|
Commands:
|
|
|
|
Quit
|
|
Stop the program. Useful if it's running as a server.
|
|
|
|
Get DatabaseFile
|
|
Get the pathname of the current database file. The default name is something
|
|
like B_USER_SETTINGS_DIRECTORY / AGMSBayesianSpam / AGMSBayesianSpamServer
|
|
Database
|
|
|
|
Set DatabaseFile NewValue
|
|
Change the pathname of the database file to use. It will automatically be
|
|
converted to an absolute path name, so make sure the parent directories exist
|
|
before setting it. If it doesn't exist, you'll have to use the create command
|
|
next.
|
|
|
|
Create DatabaseFile
|
|
Creates a new empty database, will replace the existing database file too.
|
|
|
|
Delete DatabaseFile
|
|
Deletes the database file and all backup copies of that file too. Really only
|
|
of use for uninstallers.
|
|
|
|
Count DatabaseFile
|
|
Returns the number of words in the database.
|
|
|
|
Set Spam NewValue
|
|
Adds the spam in the given file (specify full pathname to be safe) to the
|
|
database. The words in the files will be added to the list of words in the
|
|
database that identify spam messages. The files processed will also have the
|
|
attribute MAIL:classification added with a value of "Spam" or "Genuine" as
|
|
specified. They also have their spam ratio attribute updated, as if you had
|
|
also used the Evaluate command on them. If they already have the
|
|
MAIL:classification attribute and it matches the new classification then they
|
|
won't get processed (and if it is different, they will get removed from the
|
|
statistics for the old class and added to the statistics for the new one).
|
|
You can turn off that behaviour with the IgnorePreviousClassification
|
|
property. The command line version lets you specify more than one pathname.
|
|
|
|
Count Spam
|
|
Returns the number of spam messages in the database.
|
|
|
|
Set SpamString NewValue
|
|
Adds the spam in the given string (assumed to be the text of a whole e-mail
|
|
message, not just a file name) to the database.
|
|
|
|
Set Genuine NewValue
|
|
Similar to adding spam except that the message file is added to the genuine
|
|
statistics.
|
|
|
|
Count Genuine
|
|
Returns the number of genuine messages in the database.
|
|
|
|
Set GenuineString NewValue
|
|
Adds the genuine message in the given string (assumed to be the text of a
|
|
whole e-mail message, not just a file name) to the database.
|
|
|
|
Set IgnorePreviousClassification NewValue
|
|
If set to true then the previous classification (which was saved as an
|
|
attribute of the e-mail message file) will be ignored, so that you can add the
|
|
message to the database again. If set to false (the normal case), the
|
|
attribute will be examined, and if the message has already been classified as
|
|
what you claim it is, nothing will be done. If it was misclassified, then the
|
|
message will be removed from the statistics for the old class and added to the
|
|
stats for the new classification you have requested.
|
|
|
|
Get IgnorePreviousClassification
|
|
Find out the current setting of the flag for ignoring the previously recorded
|
|
classification.
|
|
|
|
Set ServerMode NewValue
|
|
If set to true then error messages get printed to the standard error stream
|
|
rather than showing up in an alert box. It also starts up with the window
|
|
minimized.
|
|
|
|
Get ServerMode
|
|
Find out the setting of the server mode flag.
|
|
|
|
Flush
|
|
Writes out the database file to disk, if it has been updated in memory but
|
|
hasn't been saved to disk. It will automatically get written when the program
|
|
exits, so this command is mostly useful for server mode.
|
|
|
|
Set PurgeAge NewValue
|
|
Sets the old age limit. Words which haven't been updated since this many
|
|
message additions to the database may be deleted when you do a purge. A good
|
|
value is 1000, meaning that if a word hasn't appeared in the last 1000
|
|
spam/genuine messages, it will be forgotten. Zero will purge all words, 1
|
|
will purge words not in the last message added to the database, 2 will purge
|
|
words not in the last two messages added, and so on. This is mostly useful
|
|
for removing those one time words which are often hunks of binary garbage, not
|
|
real words. This acts in combination with the popularity limit; both
|
|
conditions have to be valid before the word gets deleted.
|
|
|
|
Get PurgeAge
|
|
Gets the old age limit.
|
|
|
|
Set PurgePopularity NewValue
|
|
Sets the popularity limit. Words which aren't this popular may be deleted
|
|
when you do a purge. A good value is 5, which means that the word is safe
|
|
from purging if it has been seen in 6 or more e-mail messages. If it's only
|
|
in 5 or less, then it may get purged. The extreme is zero, where only words
|
|
that haven't been seen in any message are deleted (usually means no words).
|
|
This acts in combination with the old age limit; both conditions have to be
|
|
valid before the word gets deleted.
|
|
|
|
Get PurgePopularity
|
|
Gets the purge popularity limit.
|
|
|
|
Purge
|
|
Purges the old obsolete words from the database, if they are old enough
|
|
according to the age limit and also unpopular enough according to the
|
|
popularity limit.
|
|
|
|
Get Oldest
|
|
Gets the age of the oldest message in the database. It's relative to the
|
|
beginning of time, so you need to do (total messages - age - 1) to see how
|
|
many messages ago it was added.
|
|
|
|
Set Evaluate NewValue
|
|
Evaluates a given file (by path name) to see if it is spam or not. Returns
|
|
the ratio of spam probability vs genuine probability, 0.0 meaning completely
|
|
genuine, 1.0 for completely spam. Normally you should safely be able to
|
|
consider it as spam if it is over 0.56 for the Robinson scoring method. For
|
|
the ChiSquared method, the numbers are near 0 for genuine, near 1 for spam,
|
|
and anywhere in the middle means it can't decide. The program attaches a
|
|
MAIL:ratio_spam attribute with the ratio as its float32 value to the file.
|
|
Also returns the top few interesting words in "words" and the associated
|
|
per-word probability ratios in "ratios".
|
|
|
|
Set EvaluateString NewValue
|
|
Like Evaluate, but rather than a file name, the string argument contains the
|
|
entire text of the message to be evaluated.
|
|
|
|
ResetToDefaults
|
|
Resets all the configuration options to the default values, including the
|
|
database name.
|
|
|
|
InstallThings
|
|
Creates indices for the MAIL:classification and MAIL:ratio_spam attributes on
|
|
all volumes which support BeOS queries, identifies them to the system as
|
|
e-mail related attributes (modifies the text/x-email MIME type), and sets up
|
|
the new MIME type (text/x-vnd.agmsmith.spam_probability_database) for the
|
|
database file. Also registers names for the sound effects used by the
|
|
separate filter program (use the installsound BeOS program or the Sounds
|
|
preferences program to associate sound files with the names).
|
|
|
|
Set TokenizeMode NewValue
|
|
Sets the method used for breaking up the message into words. Use "Whole" for
|
|
the whole file (also use it for non-email files). The file isn't broken into
|
|
parts; the whole thing is converted into words, headers and attachments are
|
|
just more raw data. Well, not quite raw data since it converts
|
|
quoted-printable codes (equals sign followed by hex digits or end of line) to
|
|
the equivalent single characters. "PlainText" breaks the file into MIME
|
|
components and only looks at the ones which are of MIME type text/plain.
|
|
"AnyText" will look for words in all text/* things, including text/html
|
|
attachments. "AllParts" will decode all message components and look for words
|
|
in them, including binary attachments. "JustHeader" will only look for words
|
|
in the message header. "AllPartsAndHeader", "PlainTextAndHeader" and
|
|
"AnyTextAndHeader" will also include the words from the message headers.
|
|
|
|
Get TokenizeMode
|
|
Gets the method used for breaking up the message into words.
|
|
|
|
Set ScoringMode NewValue
|
|
Sets the method used for combining the probabilities of individual words into
|
|
an overall score. "Robinson" mode will use Gary Robinson's nth root of the
|
|
product method. It gives a nice range of values between 0 and 1 so you can
|
|
see shades of spaminess. The cutoff point between spam and genuine varies
|
|
depending on your database of words (0.56 was one point in some experiments).
|
|
"ChiSquared" mode will use chi-squared statistics to evaluate the difference
|
|
in probabilities that the lists of word ratios are random. The result is very
|
|
close to 0 for genuine and very close to 1 for spam, and near the middle if it
|
|
is uncertain.
|
|
|
|
Get ScoringMode
|
|
Gets the method used for combining the individual word ratios into an overall
|
|
score.
|
|
|
|
ProcessArgs: The property specified isn't known or doesn't support the requested action (usually means it is an unknown command), error code $FFFFFFFF/-1 (General OS error) has occured.
|
|
AGMSBayesianSpamServer shutting down...
|
|
Sat Feb 8 16:30:58 275 /tmp>
|
|
</PRE>
|
|
<!-- End the C style comment which makes editing this look bad with BeIDE's syntax colouring. */ -->
|
|
|
|
<H2><A NAME="Spreadsheet"></A>Using a Spreadsheet to Examine Word Statistics</H2>
|
|
|
|
<P>Another advanced trick is to load the list of words into Gobe Productive's
|
|
spreadsheet, so that you can find the most popular word or chart the word
|
|
frequencies. Unfortunately it can only handle about 16000 words. To do that,
|
|
start up Gobe Productive, pick Open, then from the file requester's "Document
|
|
Type" menu, pick "Spreadsheet" and then in the submenu pick "Tab-delimited
|
|
text". Then navigate to the database, the default location is "<A
|
|
HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database"
|
|
>/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>".
|
|
Have fun!
|
|
|
|
|
|
<H2><A NAME="WordDisplay"></A>Understanding and Using the Word Display</H2>
|
|
|
|
<P><IMG ALIGN="RIGHT" SRC="pictures/WordDisplay.png" WIDTH="285" HEIGHT="645"
|
|
ALT="[Narrow Word Display Window]">The word display tells you more than you
|
|
need to know about the words in the database. Those colour bars actually mean
|
|
something.
|
|
|
|
<P>Obviously words which are more genuine than spamish show up in <FONT
|
|
COLOR="BLUE">blue</FONT>, while spammier words are in <FONT
|
|
COLOR="RED">red</FONT>. It's proportionally based on the total message counts
|
|
so that a word which shows up in 10% of the genuine messages and 9% of the spam
|
|
will show up in blue, even if it was in more spam messages than genuine
|
|
messages (this compensates a bit for not training on an equal number of spam
|
|
and genuine messages). The length of the bar shows the ratio of the
|
|
proportions; further to the left for larger genuine proportions, and similarly
|
|
further right for larger spam proportions.
|
|
|
|
<P>The thickness of the bar shows how many messages the word was found in.
|
|
It's kind of a weight, saying how frequently used that word is and thus how
|
|
significant it is.
|
|
|
|
<P>The paleness of the bar shows you how old that word is. A light colour
|
|
means that the word was last added to the database long ago. A darker, more
|
|
saturated colour means that the word was added more recently, when you added
|
|
example messages to the database.
|
|
|
|
<P>Finally, if you click on the word display, the background will change from a
|
|
pale blue tint into solid white, to show you that it is the active keyboard
|
|
focus. That means you can type in letters to find a particular word (delay for
|
|
one second to start typing the letters for a new word). The arrow keys, page
|
|
up/down keys and the mouse scroll wheel also show you different words. Sorry,
|
|
there's no scroll bar since finding the Nth word is a slow operation with a set
|
|
of words (they aren't numbered); each twitch of the scroll bar would mean going
|
|
through the list of tens or even hundreds of thousands of words and counting to
|
|
find the scroll position.
|
|
|
|
<BR CLEAR="ALL">
|
|
|
|
<H2><A NAME="Tokenizing"></A>Tokenizing Modes Compared</H2>
|
|
|
|
<P>I did some tests with tokenizing different parts of mail messages to see
|
|
what would work best.
|
|
|
|
<P>The Database:
|
|
<BR>341 training genuine messages, 406 training spam messages (or 398 when
|
|
parsing due to a bug (fixed later on in 2.0.0b5) with messages that don't have
|
|
body text).
|
|
<BR>40 test genuine messages, 40 test spam messages, all more recent than the
|
|
training ones.
|
|
<BR>Spam threshold is 0.56, Gary-combining method.
|
|
|
|
<P>The results:
|
|
|
|
<TABLE BORDER="2" SUMMARY="[Table showing results of different tokenizing methods]">
|
|
<TR><TH>Tokenizing Method
|
|
<TH>Genuine Test Details
|
|
<TH>Genuine Accuracy
|
|
<TH>Spam Test Details
|
|
<TH>Spam Accuracy
|
|
|
|
<TR><TD>Just headers
|
|
<TD>Genuine .181352 to .557881, one false positive (a mailbox full announcement).
|
|
<TD>2.5% wrong.
|
|
<TD>Spam .450602 to .750511, 21 false negatives.
|
|
<TD>52.5% wrong.
|
|
|
|
<TR><TD>Whole raw message text
|
|
<TD>Genuine .163027 to .627022, 3 false positives.
|
|
<TD>7.5% wrong.
|
|
<TD>Spam .509355 to .993985, 1 false negative.
|
|
<TD>2.5% wrong.
|
|
|
|
<TR><TD>Message parsed into parts plus header
|
|
<TD>Genuine .168857 to .609005, 4 false positives.
|
|
<TD>10% wrong.
|
|
<TD>Spam .614564 to .994364, 0 false negatives.
|
|
<TD>0% wrong.
|
|
|
|
<TR><TD>Message parsed into parts, no header data
|
|
<TD>Genuine .220161 to .631161, 5 false positives.
|
|
<TD>12.5% wrong.
|
|
<TD>Spam .592501 to .994444, 0 false negatives.
|
|
<TD>0% wrong.
|
|
|
|
<TR><TD>Any text parts and header
|
|
<TD>Genuine .162697 to .614136, 4 false positives.
|
|
<TD>10% wrong.
|
|
<TD>Spam .614973 to .994362, 0 false negatives.
|
|
<TD>0% wrong.
|
|
|
|
<TR><TD>Any text parts, no headers
|
|
<TD>Genuine .221923 to .635487, 6 false positives.
|
|
<TD>15% wrong.
|
|
<TD>Spam .594271 to .994441, 0 false negatives.
|
|
<TD>0% wrong.
|
|
|
|
<TR><TD>text/plain parts (including body text)
|
|
<TD>Genuine .137869 to .583192, 3 false positives.
|
|
<TD>7.5% wrong.
|
|
<TD>Spam .448059 to .994119, 17 false negatives.
|
|
<TD>42.5% wrong.
|
|
|
|
<TR><TD>Only text/plain sub-parts, no headers.<BR>
|
|
150 spam and 1 genuine training message had no words!
|
|
<TD>Genuine .219169 to .696899, 9 false positives.
|
|
<TD>22.5% wrong.
|
|
<TD>Spam .660755 to .994116, 0 false negatives, 27 had no words.
|
|
<TD>0% wrong.
|
|
</TABLE>
|
|
|
|
<P>The results look good for the whole message tokenizing method (which also
|
|
works on non-email files) and for the all text parts plus header. Since the
|
|
text parts method doesn't add lots of garbage words to the database from trying
|
|
to find words in binary attachments, it's now the default setting.
|
|
|
|
<P>The header only method is pretty good too for identifying genuine messages,
|
|
and so-so for spam messages. That may make it useable for pre-download tests
|
|
(delete some of the spam on the mail server before downloading it, without
|
|
worrying about deleting too many genuine messages).
|
|
|
|
|
|
<H2><A NAME="HeadersOnly"></A>High Speed and High Danger - Headers Only Trick</H2>
|
|
|
|
<P>If you have a slow dial-up connection, you may wish to classify your mail
|
|
quickly by deleting spam messages without downloading the entire junk message.
|
|
|
|
<P><IMG SRC="pictures/ChoosingJustHeaderTokenizingMode.png" WIDTH="920"
|
|
HEIGHT="402" ALT="[Choosing JustHeader Tokenizing Mode]">
|
|
|
|
<P><IMG SRC="pictures/DangerousMatchFilter.png" WIDTH="276" HEIGHT="243"
|
|
ALIGN="RIGHT" ALT="[Dangerous Match Filter]">This can be done with three
|
|
settings. First switch the AGMSBayesianSpamServer into tokenizing just the
|
|
headers. Then go into the E-mail preferences and add an
|
|
AGMSBayeisianSpamFilter with the "Add [Spam %] in Front of Subject" option
|
|
turned on, and the ratio set to a nice safe high level like 0.95 (so that your
|
|
genuine mail is less likely to get deleted, but it will still delete the 1% of
|
|
your real mail that looks like spam, which is why this is dangerous). Do not
|
|
turn on self-training, since you can't manually correct it. Finally in the
|
|
E-mail preferences, add a "Match Header" filter after the spam filter and set
|
|
it so that If <B>Subject</B> is <B>\[Spam*</B> then <B>Delete Message</B>.
|
|
That's backslash, left square bracket, Spam with the S capitalised, asterix.
|
|
Now it will download the headers, check them against the spam database, and
|
|
then delete the spam ones on the server without downloading the rest of their
|
|
contents.
|
|
|
|
<P>You should also make a new spam database trained in Just Headers tokenizing
|
|
mode with roughly equal examples of your genuine messages and spam messages (50
|
|
of each should be enough to start). A full message database may also work, but
|
|
headers only training should be more accurate for headers only decisions. When
|
|
testing JustHeader mode, I noticed that the false positive rate (genuine
|
|
reported as spam) is nice and low, but the false negative rate (spam reported
|
|
as genuine) is high (tested with Robinson scoring, not Chi-Squared scoring).
|
|
So this means JustHeader mode will delete maybe half the spam (and download the
|
|
rest) and also delete the occasional genuine message.
|
|
|
|
|
|
<H1><A NAME="ChangeLog"></A>Change Log</H1>
|
|
|
|
<P>The various versions released to the public. These are actually several
|
|
accumulated minor changes, which you can see by looking at the log in the top of the
|
|
source code files.
|
|
|
|
<UL>
|
|
<LI>Version 1.77 changed the tokenizing to not convert words to lower case,
|
|
the case is important for spam! Minimize the window before opening it so
|
|
that it doesn't flash on the screen in server mode. Also load the database
|
|
when the window is displayed so that the user can see the words.
|
|
|
|
<LI>Version 1.73 added self training support and the Chi-Squared scoring
|
|
method.
|
|
|
|
<LI>Version 1.68 nothing significant changed. Just very minor tweaking.
|
|
|
|
<LI>Version 1.65 added a time delay for exiting the program. This is so that
|
|
multiple e-mail accounts can simultaneously download mail, without having the
|
|
server close when one of the accounts finishes downloading. Scripting
|
|
requests that come in while it is counting down to quitting time will cancel
|
|
the countdown. In the belt <I>and</I> suspenders department, the filter has
|
|
been enhanced to try starting up the server up to three times.
|
|
|
|
<LI>Version 1.60 got rid of the need to use a modified Inbox filter for MDR
|
|
(found out the correct way of setting attributes on a message), added sound
|
|
effects, and added parsing of mail messages (parsing MIME headers, decoding
|
|
base64, quoted-printable and converting character sets to UTF-8 for text, all
|
|
thanks to using the MDR mail kit, which you now need since it uses their
|
|
libmail.so code library). There are now new options for selecting what kind
|
|
of parsing to do (text/plain or text/* or */* attachments, with or without
|
|
headers, etc). Plus sound effect options. The sample database has also been
|
|
updated to use text/* plus headers tokenization, which makes it slightly
|
|
smaller.<!-- End the C style comment which makes editing this look bad with
|
|
BeIDE's syntax colouring. */ -->
|
|
|
|
<LI>Version 1.49 switched to Gary Robinson's method for calculating spam
|
|
ratios. The overall results are about the same but you have less false
|
|
positives and the numbers are spread more evenly between 0.0 and 1.0 than
|
|
with Paul Graham's method (change the E-mail preferences filter setting
|
|
cutoff point to 0.56, adjust as needed). Also, as "jaf" requested, you can
|
|
now drag and drop messages into the word list - drop in the left third to use
|
|
it as an example of genuine messages, right third for spam, and middle third
|
|
to get an evaluation of a message's spaminess. Also a useless command was
|
|
removed. Updated files (replace your existing copies): AGMSBayesianSpam
|
|
Database, AGMSBayesianSpamFilter, AGMSBayesianSpamServer.
|
|
|
|
<LI>Version 1.47 was the first public (and working) version. It used Paul
|
|
Graham's algorithm with a few simplifications.
|
|
</UL>
|
|
|
|
<P>Released to the public domain in 2002 by the author, Alexander G. M. Smith.
|
|
</BODY>
|
|
</HTML>
|