Filtering Out Spam
Filtering Out Spam
April 6, 2017
6:16pm
c
2017 Avinash Kak, Purdue University
Goals:
• Spam and computer security
• How I read my email
• The acronyms MTA, MSA, MDA, MUA, etc.
• Structure of email messages
• How spammers alter email headers
• A very brief introduction to regular expressions
• An overview of procmail based spam filtering
• Writing Procmail recipes
Computer and Network Security by Avi Kak Lecture 31
CONTENTS
2
Computer and Network Security by Avi Kak Lecture 31
departments advertising their activities, and students in various parts of the world seeking to come to
Purdue. Even just opening all of these messages would consume a significant portion of each day. ]
These filters are also sometimes called Bayesian filters for blocking
spam. A statistical filter with sufficiently low “falses” to suit
my tastes would require too many samples of a certain type of
spam before blocking such messages in the future. On the other
hand, with a regular-expression based filter, once you see a spam
message that has leaked through, it is not that difficult to figure
out variations on that message that the spammers may use in the
future. In many cases, you can design a short regular expression
to block the email you just saw and all its variations that the
spammer may use in the future in just one single step.
• Spam filter that are close to 100% effective for your specific needs
in the sense defined above can only be built slowly. My spam filter
has evolved over several years. It needs to be tweaked up every
once in a while as spammers discover new ways of delivering their
unwelcome goods.
4
Computer and Network Security by Avi Kak Lecture 31
• These days most folks read their email through web based mail
clients. If you are at Purdue, in all likelihood, you log into Pur-
due’s webmail service to check your email. Or, perhaps, you have
it forwarded to your email account at a third party service such
as that provided by gmail or yahoomail. This way of reading
email is obviously convenient for, say, English ma-
jors. However, if you happen to be a CS or a CompE
major, that is not the way to receive and send your
email.
• The web based email tools can only filter out standard spam —
this is, the usual spam about fake drugs, about how you can en-
large certain parts of your body, and things of that sort. But
nowadays there is another kind of spam that is just as much of a
nuisance. As mentioned in the previous section, you have gener-
ally well-meaning folks (and organizations) who want to keep you
informed of all the great stuff they are engaged in and why you
should check out their latest doings. These include local busi-
nesses, marketing companies, PR folks, etc. When you write
your own spam filter, you can deal with such email in
a much more selective manner than would otherwise
5
Computer and Network Security by Avi Kak Lecture 31
be the case.
Although the main function of an MTA is to exchange email with another MTA, they can also be programmed
to receive email directly from MUAs and to send messages directly to the same. More generally, the client
email first goes to an MSA (Mail Submission Agent) and the MSA forwards it to the MTA. By the same token,
when an MTA receives email for clients in its own domain, it generally forwards the email to an MDA (Mail
Delivery Agent) and it is the MDA’s job to send that email to the clients. However, an MTA can also be
Internet
My Spam Filter
The procmail Program on the Engineering Computer Network
The email client on the laptop, Thunderbird, picks up the email from the mailbox
/var/mail/kak in the laptop and makes it available to me through a visual interface
7
Computer and Network Security by Avi Kak Lecture 31
the email sent to me. The most popular program that is used
as an MTA is known as Sendmail. Other MTAs include MMDF,
Postfix, Smail, Qmail, Zmailer, Exchange, etc.
] On Linux/Unix platforms,
sending the messages to the clients in the local network.
• The rest of this section is for folks who wish to use the Thun-
derbird MUA on their Ubuntu laptop (or other mobile devices
based on Ubuntu) to pick up email from a designated maildrop
machine (and to also deliver the outgoing email emanating from
your laptop to the SMTP server running on the maildrop ma-
chine or elsewhere in the internet). The material that follows is
particularly applicable if you want your spam filter to do its job
in the maildrop machine itself. That is, you want the incoming
email to be filtered before it is made available for pickup at the
9
Computer and Network Security by Avi Kak Lecture 31
10
Computer and Network Security by Avi Kak Lecture 31
for input on port 25, picks up the messages offered by fetchmail and
deposits them in the laptops’s mailbox /var/mail/usr name.
– I did not have to change anything in the sendmail’s very large config
files for the above mentioned behavior by sendmail.
– The remaining issue is to get Thunderbird (TB) to work off the mail-
box /var/mail/user name in the laptop itself. [To get the TB email
client to work directly off an IMAP server on a remote maildrop ma-
chine is easy. All you have to do is to enter the IMAP server infor-
mation and your email address in the remote machine directly in the
initial welcome screen you see when you bring up TB in the laptop.
But, for reasons already explained, that’s not what I wanted.] To
get TB to work with the local (meaning, on the laptop itself) mailbox
/var/mail/user name, you have to work off the Edit menubutton at
the top of the TB GUI and select “Account Settings...” from its drop-
down menu. After you click on this selection, you click on “Add Other
Account”. That brings up a popup, in which you click on “Choose
Unix Movemail” and hit “next” and so on. This process will also
prompt you for the SMTP server for the outgoing email, which in my
case happened to be smtp.ecn.purdue.edu. [It is choosing “Unix Move-
mail” that causes the TB client to work off the mailbox /var/mail/user name on
the laptop itself.]
– You might ask: What is Movemail? [Before I realized what Movemail was,
the TB would display in the GUI my [email protected] account that I had created
as described above, but without the Inbox, Sent, Trash, etc., folders.]
As
it turns out, for the TB GUI to make available the Inbox, Sent,
Trash, etc., folders, you need to have previously installed the Gnu
email utilities that are included in the mailutils package that you
can install through the Synaptic Package Manager. Movemail is one
of the utilities in this package. The purpose of Movemail — more
11
Computer and Network Security by Avi Kak Lecture 31
– One more thing: You will also be asked for the SSL/TLS based au-
thorizations for SMTP in a screen that you’ll see after you provide
information about the SMTP server.
12
Computer and Network Security by Avi Kak Lecture 31
body: This is the part that carries the message of the email. It
may also contain multimedia objects.
rules can be based on just the header, or just the body, or both. For a spam filter rule meant for just the
header, the pattern matching operations of the rule are applied to just the header portion of the emails. ]
]
ers. It consists of the “conversation” that takes place be-
tween a sender MTA and a receiver MTA involving recipient
authentication, etc.
13
Computer and Network Security by Avi Kak Lecture 31
• For the email shown above, here is a printout of what was actually
sent by the MTA to the MDA:
14
Computer and Network Security by Avi Kak Lecture 31
Message-Id: <[email protected]>
Received: from lulu.it.northwestern.edu (localhost [127.0.0.1]) by lulu.it.northwester
id xma028114; Sat, 14 Feb 04 19:06:56 -0600
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-Originating-Ip: 165.124.28.55
Priority: 3 (Normal)
X-Webmail-User: cdo388@localhost
To: [email protected]
X-Priority: 3 (Normal)
MIME-Version: 1.0
X-Http_host: lulu.it.northwestern.edu
From: [email protected]
Subject: Re: hi...
Date: Sat, 14 Feb 2004 19:06:56 -0600
Reply-To: [email protected]
X-Mailer: EMUmail 5.2.7 (UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; .NET CLR 1.1.4322))
X-Virus-Scanned-ECN: by AMaVIS version 11 (perl 5.8) (http://amavis.org/)
15
Computer and Network Security by Avi Kak Lecture 31
• Also note that the name of the final recipient is present in the con-
versation that takes place between the MTA’s at the Northwest-
ern end and at Purdue’s fairway.ecn.purdue.edu machine.
The name of the recipient is also present in the conversation that
takes place between Purdue’s fairway machine and the local
RVL4 machine.
• So you can see why you can get email even if your name shows
up nowhere in any of the headers you can see on your computer.
Here is an example of one such spam email I received:
16
Computer and Network Security by Avi Kak Lecture 31
--0.D6.._EF0B97BFE__AA._6_
Content-Type: text/plain;
Content-Transfer-Encoding: quoted-printable
<html>
<TABLE cellpadding=3D’0’ cellspacing=3D’0’ border=3D0 align=3D’center’>=
<TR>
<TD height=3D’50’ bgcolor=3D’#FFFFFF’ align=3D’center’ valign=3D=
’middle’>
<a href=3D"http://nipponbog.com/partner/recom.asp?recome_id=3Dstart"=
target=3D"_blank"><img src=3D"http://nipponbog.com/partner/email/email2=
/1.jpg" border=3D"0"></a>
</TD>
</TR>
</TABLE>
</html>
oada slh vwudbxr sodb frjmh
bs arf
ohf
vjkutctg
yzmyzfuwjadg
ua
uq ffwd
uh
--0.D6.._EF0B97BFE__AA._6_--
17
Computer and Network Security by Avi Kak Lecture 31
Return-Path: [email protected]
Delivery-Date: Sat Feb 14 20:07:06 2004
Received: from fairway.ecn.purdue.edu (fairway.ecn.purdue.edu [128.46.125.96])
by rvl4.ecn.purdue.edu (8.12.10/8.12.10) with ESMTP id i1F1758Y006551
(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NOT)
for <[email protected]>; Sat, 14 Feb 2004 20:07:06 -0500 (EST)
Received: from lulu.it.northwestern.edu (lulu.it.northwestern.edu [129.105.16.54])
by fairway.ecn.purdue.edu (8.12.10/8.12.10) with ESMTP id i1F172gN003361
for <[email protected]>; Sat, 14 Feb 2004 20:07:02 -0500 (EST)
Received: (from mailnull@localhost)
by lulu.it.northwestern.edu (8.12.10/8.12.10) id i1F1718S028285
for <[email protected]>; Sat, 14 Feb 2004 19:07:01 -0600 (CST)
Message-Id: <[email protected]>
Received: from lulu.it.northwestern.edu (localhost [127.0.0.1]) by lulu.it.northwester
id xma028114; Sat, 14 Feb 04 19:06:56 -0600
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
X-Originating-Ip: 165.124.28.55
Priority: 3 (Normal)
X-Webmail-User: cdo388@localhost
To: [email protected]
X-Priority: 3 (Normal)
MIME-Version: 1.0
X-Http_host: lulu.it.northwestern.edu
From: [email protected]
Subject: Re: hi...
Date: Sat, 14 Feb 2004 19:06:56 -0600
Reply-To: [email protected]
X-Mailer: EMUmail 5.2.7 (UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
18
Computer and Network Security by Avi Kak Lecture 31
• With regard to the printout shown above, recall I said earlier that
for an email to be legal, its first line must start with “From”,
which in turn must be followed by a blank space. The printout
is meant to convey to you the fact that an MUA may modify the
very first “From” line into two separate lines, one for “Return-
Path” and the other for “Delivery-Date”.
19
Computer and Network Security by Avi Kak Lecture 31
Return-Path: [email protected]
Delivery-Date: Sun Apr 4 12:36:10 2010
Received: from mx03.ecn.purdue.edu (mx03.ecn.purdue.edu [128.46.105.218])
by rvl4.ecn.purdue.edu (8.14.4/8.14.4) with ESMTP id o34GaAhE013679
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT)
for <[email protected]>; Sun, 4 Apr 2010 12:36:10 -0400 (EDT)
Received: from 114-24-88-69.dynamic.hinet.net (114-24-88-69.dynamic.hinet.net [114.24.88.69])
by mx03.ecn.purdue.edu (8.14.4/8.14.4) with ESMTP id o34GZ2k8020095;
Sun, 4 Apr 2010 12:35:23 -0400
Received: from 114.24.88.69 by e33.co.us.ibm.com; Mon, 5 Apr 2010 00:34:59 +0800
Message-ID: <000d01cad414$c4404060$6400a8c0@cossacksrg1>
From: "Minerva Souza" <[email protected]>
To: <[email protected]>
Subject: ecn.purdue.edu account notification
Date: Mon, 5 Apr 2010 00:34:59 +0800
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0006_01CAD414.C4404060"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2180
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180
20
Computer and Network Security by Avi Kak Lecture 31
X-ECN-MailServer-VirusScanned: by amavisd-new
X-ECN-MailServer-Origination: 114-24-88-69.dynamic.hinet.net [114.24.88.69]
X-ECN-MailServer-SpamScanAdvice: DoScan
Status: RO
X-Status:
X-Keywords:
X-UID: 7
------=_NextPart_000_0006_01CAD414.C4404060
Content-Type: text/plain;
format=flowed;
charset="iso-8859-1";
reply-type=original
Content-Transfer-Encoding: 7bit
Dear Customer,
This e-mail was send by ecn.purdue.edu to notify you that we have temporanly prevented access to your account.
We have reasons to beleive that your account may have been accessed by someone else. Please run attached file a
(C) ecn.purdue.edu
------=_NextPart_000_0006_01CAD414.C4404060
Content-Type: application/zip;
name="Instructions.zip"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Instructions.zip"
UEsDBBQAAgAIAFkQhDwZeJaCR18AADVzAAAQAAAASW5zdHJ1Y3Rpb25zLmV4Ze38BVQfTbcnjP5x
CO4ElwDBHUJwtxDc3d3d3d3dXQNBA8EhENzd3R0S/DZPnvOe98icO3dm7pr5vjW1dknvqv5tqapd
3f1nIa0eC4IAgUCQQH55AYGaQX8SP+j/e3odi0TUggSqhxshaQb7NEKiaGrmQGxrb2Nir2dFbKBn
bW3jSKxvRGzvZE1sZk0sLKNAbGVjaESPiPjmHeh/LsmKgECfwKBAyFiNUv/CWwchg8GDQSH8ZRDK
30yIvzP031aBgf7KkH93/0sNcvx7HJDA/ypR/sZA+QcWyj/JJwbwuF8bsCCQLiLof10CcIn/i256
RyPXV1WNwf/JNoh/Owa4X5fe3lDPUQ8Euv0b8y+7of/tOMAb/PR/hv2xBebvcTD/YVwnvb2DvQHo
.....
.....
.....
------=_NextPart_000_0006_01CAD414.C4404060--
• If you examine the headers, you will see that the email was
generated by 114.24.88.69. If you enter this address in http:
//www.ip2location.com window, you will see that this address belongs
to “Chunghwa Telecom Data Communication Business Group”
21
Computer and Network Security by Avi Kak Lecture 31
• You will also notice in the email message shown above that it con-
tains a fake “Received: from” line that seems to indicate that
the email was received by a server named e33.co.us.ibm.com
from the address 114.24.88.69 in Taiwan. This line is fake be-
cause higher up in the email header you can see that the mail
exchange server for the ecn.purdue.edu domain received the
email directly from 114.24.88.69.
attachment consisting of just ‘.exe’ executables. But then I would not have found this gem. ]
22
Computer and Network Security by Avi Kak Lecture 31
• Another way to confirm the fact that this file is a Windows exe-
cutable is by looking at its hexdump:
As shown below, in the very first line you can see the telltale
“MZ” marker that is the beginning of a MS-DOS PE header.
00000000 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 |MZ..............|
00000010 b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 b8 00 00 00 |................|
00000040 0e 1f ba 0e 00 b4 09 cd 21 b8 01 4c cd 21 54 68 |........!..L.!Th|
00000050 69 73 20 70 72 6f 67 72 61 6d 20 63 61 6e 6e 6f |is program canno|
00000060 74 20 62 65 20 72 75 6e 20 69 6e 20 44 4f 53 20 |t be run in DOS |
00000070 6d 6f 64 65 2e 0d 0d 0a 24 00 00 00 00 00 00 00 |mode....$.......|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
.....
.....
23
Computer and Network Security by Avi Kak Lecture 31
constraints that can be built into a regular expression include specifying the number of repetitions of
a given elemental pattern, whether the matching of the regular expression with an input string should
be greedy or non-greedy, etc. Regular expressions are also useful in search-and-replace operations in
text processing, for specifying the separators for splitting long strings of text substrings, etc. ]
file, which justifies input in the name input string. But an input string may also be specified directly in a
program. ]
26
Computer and Network Security by Avi Kak Lecture 31
#!/usr/bin/perl -w
## word_match.pl
use strict;
my $regular_expression = "hello";
print "Enter a line of text:\n";
while (chomp( my $input_string = <> ) ) {
if ( $input_string =~ /$regular_expression/ ) {
print ’The line you entered contains "hello"’, "\n";
print "The portion of the line before the match: ", $‘,"\n";
print "The portion of the line after the match: ", $’, "\n";
print "The portion of the line actually matched: ", $&,"\n";
print "The current line number read by <>: ", $., "\n";
print "\nEnter another line of text or Ctrl-C to exit:\n\n";
} else {
print "\nNo match --- try again or enter Ctrl-C to exit\n\n";
}
}
$input_string =~ /$regular_expression/
#!/usr/bin/env python
## word_match.py
## works with both Python 2.x and Python 3.x
import re
regular_expression = r’hello’
while 1:
import sys
try:
if sys.version_info[0] == 3:
input_string = input("\nEnter a line of text: ")
else:
input_string = raw_input("\nEnter a line of text: ")
except IOError as e:
print(e.strerror)
m = re.search( regular_expression, input_string )
if m:
# Print starting position index for the match:
print( m.start() )
# Print the ending position index for the match:
print( m.end() )
# Print a tuple of the position indices that span this match:
print( m.span() )
# print the input strings characters consumed by this match:
print( m.group() )
else:
print("no match")
The rest of the code then extracts the needed information from
this object. [Regular expression matching in Python is carried out with the re module.
Also note that the prefix r for a string argument causes all the characters in the string to be accepted
literally. ]
• In both the Perl and the Python examples shown above, we used
a simple pattern, hello, as our regular expression. The matching
functions invoked in both scripts looked for this pattern anywhere
in the input string.
the string abracadabra, but not the string dabracababra. In addition to forcing a regex match
to take place at the beginning and the end of a line with the help of anchor metacharacters, it is also
possible to force a regex to match at the beginning or the end of a word boundary. Both Perl and
Python use the anchor metacharacter \b to denote the word boundary. The symbol \b can stand
29
Computer and Network Security by Avi Kak Lecture 31
for both a non-word to word transition and a word to non-word transition. So the regex \bwhat will
match the string whatever will be will be free, but not the string somewhat happier than
thou. Similarly, the regex ever\b will match the string whatever will be will be free, but
not the string everywhere I go you go. Note that the anchors do not consume any characters from
30
Computer and Network Security by Avi Kak Lecture 31
- ^ ] \
• Here are some other illustrations of the use of the range operator
inside a character class:
31
Computer and Network Security by Avi Kak Lecture 31
32
Computer and Network Security by Avi Kak Lecture 31
input_string = "hellosweetsie"
regex = h(ey|ello|i)(sweet|sweetsie)
• Note that when a match with the input string does not work out
with the first choice in a set of alternatives, backtracking is used
to try each of the remaining choices. [To explain why we use the word
‘backtracking’ to describe the matching process in the presence of alternatives, let’s say we have two
alternatives in the first portion of a regex and two alternatives in the remaining portion. Let’s also say
we have a successful match between the input string and the first of the two alternatives in the first
33
Computer and Network Security by Avi Kak Lecture 31
portion of the regex. But, then, we are not able to match either of the two alternatives in the second
part of the regex with what remains of the input string. Now the matcher must backtrack and try the
to repetition through the use of quantifier metacharacters. (iii) For extracting a desired substring from
an input string. The input-string substring that matches a parenthesized portion of a regex is available
to the rest of the program through a special variable. It is also available inside later portions of the
regex through a backreference. (iv) For specifying non-capturing groupings in regexes. Non-capturing
parentheses have special notation — ‘(?: )’ — as oppose to ‘( )’. (v) For specifying lookahead and
lookbehind assertions. The parentheses are used in the form ‘(?= )’ for lookahead assertions and
$1 $2 $3 $4 .....
\1 \2 \3 \4 .....
35
Computer and Network Security by Avi Kak Lecture 31
#!/usr/bin/perl -w
## Grouping.pl
use strict;
36
Computer and Network Security by Avi Kak Lecture 31
# balalaika
# balalaikas
### Grouping.py
import re #(A)
37
Computer and Network Security by Avi Kak Lecture 31
a
a[bc]
a[bc][bc]
a[bc][bc][bc]
a[bc][bc][bc][bc]
...
...
• If there exists a match between the input string and any of these
indefinitely large number of regexes, the regex engine will declare
a successful match between the input string and the regex.
ab
a.b
a..b
a...b
39
Computer and Network Security by Avi Kak Lecture 31
a....b
a.....b
and so on
41
Computer and Network Security by Avi Kak Lecture 31
• What precisely is returned by the regex engine when you set the
global option depends on two factors: (i) whether or not the regex
contains any groupings of subexpressions; and (ii) the evaluation
context of matching.
• All of our discussion so far has dealt with input strings that con-
sisted of single lines, which were either read one line at a time
from an input file or were specified directly so in the program.
Another match modifier is to take care of the case when the in-
put string consisting of multiple lines.
42
Computer and Network Security by Avi Kak Lecture 31
"|/usr/local/bin/procmail #kak"
where you must replace ‘kak’ by your own login name. If you
are outside the ‘ecn’ domain at Purdue, you must also replace
the path to the procmail utility with what it is on the host where
the MTA to MDA transfer of email takes place. The pipe symbol
at the very beginning of the string in the .forward file tells the
Sendmail program to make the email available to the Procmail
program on its standard input. What follows ’#’ is really a com-
ment that sendmail may use to make your .forward file unique
in its own cache.
• The very first thing that Procmail does is to look for the file
$HOME/.procmailrc
sure you change the name of the file from dot procmailrc to .procmailrc ]
44
Computer and Network Security by Avi Kak Lecture 31
3. Recipes
SHELL=/bin/sh
PATH=/usr/local/lib/mh:$PATH
MAILDIR=$HOME/Mail
LOGFILE=$HOME/Mail/logfile
#VERBOSE=1
VERBOSE=0
EOL="
"
LOG="$EOL$EOL$EOL"
LOG="New message log:$EOL"
LOG=‘perl GET_MESSAGE_INDEX‘
LOG="$EOL"
where SHELL, PATH, MAILDIR, and LOGFILE are local variables that
store the environment information needed by Procmail. The vari-
ables VERBOSE and EOL are the two other local variables; the first
controls the level of detail placed in the log files and the second
defines the end-of-line character for log entries. The variable EOL
defines a macro that can subsequently be used through the $EOL
syntax shown in the last line. Note that all these variables are
local to the .procmailrc file. Any assignment to the local vari-
45
Computer and Network Security by Avi Kak Lecture 31
46
Computer and Network Security by Avi Kak Lecture 31
3. An action starting in a new line. There can only be one action line
in a recipe.
• Here are some examples of the colon line. The examples also
illustrate the use of flags in the colon line. Note that when there is
a second colon present in the same line, as in the second recipe, a
47
Computer and Network Security by Avi Kak Lecture 31
48
Computer and Network Security by Avi Kak Lecture 31
:0 fhw You will use this for a filtering recipe that tells
procmail that the body of the email will NOT be
changed by the external filtering program. In other
words, the external program in the action line will
only change the header of the email. All that is
accomplished by the ‘h’ flag. The ‘w’ flag tells
procmail to wait for the filtering program to return
and TO CHECK THAT IT EXECUTED SUCCESSFULY.
49
Computer and Network Security by Avi Kak Lecture 31
# Recipe 2:
:0:
* !^From.*groothuis
* ^From.*root
junkMail
# Recipe 3:
:0:
* ^From.*joe.*bureaucrat
* ^To.*engfaculty
junkMail
# Recipe 4:
:0 HB:
* ^Content-Type: text/html
* !(charset="?us-ascii"?|charset="?iso-8859-1"?)
junkMail
# Recipe 5:
:0 HB
* ^Content-Disposition:.*attachment
* < 300000
{
:0 c
! [email protected]
:0 c:
medium_attachments
:0 :
/var/mail/kak
}
• You will find two kinds of recipes in the list shown above:
50
Computer and Network Security by Avi Kak Lecture 31
• The sole action line that is allowed in a recipe starts with one of
the following symbols:
51
Computer and Network Security by Avi Kak Lecture 31
You saw all these four types of action lines in the five recipes
shown earlier. Note the very different roles played by the charac-
ter ‘!’ in a condition line and in an action line.
:0 HB:
* < 15000
* ? $MAILDIR/condfilter2.pl 2>&1
junkMail
This recipe feeds the email into the Perl script condfilter2.pl.
The condition succeeds if the Perl script returns the exit code of
0 and fails if the exit code returned is 1. The string ‘2>&1’ redi-
rects the STDERR stream to the STDOUT stream (which the filtering
program redirects into the log file).
a filter would be invoked only AFTER a lot of other tests that would have declared the message to be
non-spam if that was indeed the case. Base64 encoding is commonly used by spammers to hide their
text content. ]
#!/usr/bin/perl -w
use strict;
use MIME::Base64;
my $encoded_string = "";
my $decoded_string = "";
my $content_html_flag = 0;
my $encoding_flag = 0;
while ( <STDIN> ) {
chomp;
if ( /^From:/ ) {
print "$_\n";
next;
}
if ( /^Date:/ ) {
print "$_\n";
next;
53
Computer and Network Security by Avi Kak Lecture 31
}
if ( /content-type.*text\/html/i ) {
$content_html_flag = 1;
next;
}
if ( $content_html_flag && /content.*encoding.*base64/i ) {
$encoding_flag = 1;
next;
}
next if $content_html_flag == 0;
next if /^Content-T/;
next if /^X-/;
next if /^\s*$/;
$encoded_string .= $_;
54
Computer and Network Security by Avi Kak Lecture 31
{
KEY=‘echo $MATCH | sed ’s/[^0-9a-zA-Z]//g’ | tr ’A-Z’ ’a-z’‘
SUBJECT=‘echo "the key you supplied $KEY"‘
:0 fhw
| formail -I "Subject: $SUBJECT"
:0
[email protected]
}
To understand this recipe, you must know about the special role
played by the symbol pair ‘\/’ in the second condition line. What-
ever portion of the subject line in the email being processed by
this recipe matches the regex that comes after ‘\/’ becomes im-
plicitly the value of the local variable MATCH. Next we have a local
variable KEY inside a sub-recipe. Because of the backquotes, the
value of KEY will be whatever is returned by the Unix process in
which the command(s) that is/are within the backquotes is/are
executed. The first Unix command is echo; this command simply
echos its argument to the standard output, where it is picked up
by the second Unix command sed, etc. What that means is that
the string value of the local variable MATCH will be subject to a
modification by the sed command, and so on.
55
Computer and Network Security by Avi Kak Lecture 31
]
utilities in the manner I have shown.
• I will next show a small recipe file called my recipe file whose
job is to accomplish the following:
56
Computer and Network Security by Avi Kak Lecture 31
SHELL=/bin/sh
MAILDIR=$HOME/proc_folder
LOGFILE=$HOME/proc_folder/logfile
#VERBOSE=1
VERBOSE=0
EOL="
"
:0
* ^From.*ack
* ^Subject.*the key is[ ]+\/.*[0-9a-z].*
{
KEY=‘echo $MATCH | sed ’s/[^0-9a-zA-Z]//g’ | tr ’[A-Z]’ ’[a-z]’‘
SUBJECT=‘echo "the key you supplied $KEY"‘
DATE=‘formail -x Date:‘
:0
{
:0 fhw
| formail -I "Subject: $SUBJECT"
:0 fhw
| formail -I "Date: $Date"
}
:0 fhw
| cat -; echo "<><><>MESSAGE AT THE BEGINNING OF NEW BODY<><><>"
57
Computer and Network Security by Avi Kak Lecture 31
:0
[email protected]
}
• In the recipe shown shown above, note the following two different
uses of the formail Unix utility. I first use this utility in the
line:
DATE=‘formail -x Date:‘
• Now note the second different use of formail in the action line
for the recipe shown in the file my recipe file:
formail -I "Subject: $SUBJECT"
58
Computer and Network Security by Avi Kak Lecture 31
the string ‘AbcDEF 123’ will become the value of the local vari-
able MATCH.
• Again in the file my recipe file, notice from the following action
line how I am adding some additional text to the body of the
incoming email to form the body of the outgoing email:
The echo function will place in the standard output the text that
is given to it as the argument. This additional text will appear
BEFORE the body of the incoming email because only the flag
‘h’ is in the colon line of this sub-recipe. Regarding the invocation
‘cat -’ , note that the basic job of the command cat is to send
59
Computer and Network Security by Avi Kak Lecture 31
In the first case, the ‘h’ flag is crucial; and in the second case, the
‘b’ flag is crucial. The ‘h’ flag makes available only the header
section on the standard input. The ‘b’ flag makes available only
the body at the standard input. [Recall that the ‘-’ argument to cat causes the
standard input to be used for reading the input. Of course, in both cases, cat will make its output
• I should also point out that for experimenting with a recipe, you
do NOT have to put it in a .procmailrc file at the top level of
60
Computer and Network Security by Avi Kak Lecture 31
where the file mail file is some file that contains a previously
collected email message for testing purposes.
61
Computer and Network Security by Avi Kak Lecture 31
2. Programming Assignment:
where “XX” is the integer suffix for the message file. Obviously,
you would need to write either a shell script, or a Python script,
or a Perl script to execute the above command in a loop for all 75
spam messages. If your recipes work on all 75 messages, you will
not see any messages being subject to the default action of your
procmail filter, which is usually to put the surviving messages in
your mailbox /var/mail/account name.
Since the spam messages in the tar archive are in their raw form,
it is sometimes difficult to see what is in them — especially if the
MIME objects in the messages are Base64 encoded. To help you
decipher those spam messages that are fully or partially encoded,
youll find in the starter kit a Perl script named EmailParser2.pl.
Execute this script and give it a command-line argument that
is the name of the junk mail file you want to decipher. It will
deposit the different MIME objects in the email in a subdirectory
called mimemail in the directory in which you execute the script.
64